Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Thousands of customers use Amazon Redshift data sharing to enable instant, granular, and fast data access shared across Redshift provisioned clusters and serverless workgroups. This allows you to scale your read workloads to thousands of concurrent users without having to move or copy data.
Now, we are announcing general availability (GA) of Amazon Redshift multi-data warehouse writes through data sharing. This new capability allows you to scale your write workloads and achieve better performance for extract, transform, and load (ETL) workloads by using different warehouses of different types and sizes based on your workload needs. You can make your ETL job runs more predictable by distributing them across different data warehouses with just a few clicks. Other benefits include the ability to monitor and control costs for each data warehouse, and enabling data collaboration across different teams because you can write to each other’s databases. The data is live and available across all warehouses as soon as it’s committed, even when it’s written to cross-account or cross-Region. To learn more about the reasons for using multiple warehouses to write to same databases, refer to this previous blog on multi-warehouse writes through datasharing.
As organizations continue to migrate workloads to AWS, they are also looking for mechanisms to manage costs efficiently. A good understanding of the cost of running your business workload, and the value that business workload brings to the organization, allows you to have confidence in the efficiency of your financial management strategy in AWS.
In this post, we demonstrate how you can develop a business chargeback model by adopting the multi-warehouse architecture of Amazon Redshift using data sharing. You can now attribute cost to different business units and at the same time gain more insights to drive efficient spending.
Use case
In this use case, we consider a fictional retail company (AnyCompany) that operates several Redshift provisioned clusters and serverless workgroups, each specifically tailored to a particular business unit—such as the sales, marketing, and development teams. AnyCompany is a large enterprise organization that previously migrated large volumes of enterprise workloads into Amazon Redshift, and now is in the process of breaking data silos by migrating business-owned workloads into Amazon Redshift. AnyCompany has a highly technical community of business users, who want to continue to have autonomy on the pipelines that enrich the enterprise data with their business centric data. The enterprise IT team wants to break data siloes and data duplication as a result, and despite this segregation in workloads, they mandate all business units to access a shared centralized database, which will further help in data governance by the centralized enterprise IT team. In this intended architecture, each team is responsible for data ingestion and transformation before writing to the same or different tables residing in the central database. To facilitate this, teams will use their own Redshift workgroup or cluster for computation, enabling separate chargeback to respective cost centers.
In the following sections, we walk you through how to use multi-warehouse writes to ingest data to the same databases using data sharing and develop an end-to-end business chargeback model. This chargeback model can help you attribute cost to individual business units, have higher visibility on your spending, and implement more cost control and optimizations.
Solution overview
The following diagram illustrates the solution architecture.
The workflow includes the following steps:
- Steps 1a, 1b, and 1c – In this section, we isolate ingestion from various sources by using separate Amazon Redshift Serverless workgroups and a Redshift provisioned cluster.
- Steps 2a, 2b, and 2c – All producers write data to the primary ETL storage in their own respective schemas and tables. For example, the Sales workgroup writes data into the Sales schema, and the Marketing workgroup writes data into the Marketing schema, both belonging to the storage of the ETL provisioned cluster. They can also apply transformations at the schema object level depending on their business requirements.
- Step 2d – Both the Redshift Serverless producer workgroups and the Redshift producer cluster can insert and update data into a common table,
ETL_Audit
, residing in the Audit schema in the primary ETL storage. - Steps 3a, 3b, and 3c – The same Redshift Serverless workgroups and provisioned cluster used for ingestion are also used for consumption and are maintained by different business teams and billed separately.
The high-level steps to implement this architecture are as follows:
- Set up the primary ETL cluster (producer)
- Create the datashare
- Grant permissions on schemas and objects
- Grant permissions to the Sales and Marketing consumer namespaces
- Set up the Sales warehouse (consumer)
- Create a sales database from the datashare
- Start writing to the etl and sales datashare
- Set up the Marketing warehouse (consumer)
- Create a marketing database from the datashare
- Start writing to the etl and marketing datashare
- Calculate the cost for chargeback to sales and marketing business units
Prerequisites
To follow along with this post, you should have the following prerequisites:
- Three Redshift warehouses of desired sizes, with one as the provisioned cluster and another two as serverless workgroups in the same account and AWS Region.
- Access to a superuser in both warehouses.
- An AWS Identity and Access Management (IAM) role that is able to ingest data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift.
- For cross-account only, you need access to an IAM user or role that is allowed to authorize datashares. For the IAM policy, refer to Sharing datashares.
Refer to Getting started with multi-warehouse for the most up-to-date information.
Set up the primary ETL cluster (producer)
In this section, we show how to set up the primary ETL producer cluster to store your data.
Connect to the producer
Complete the following steps to connect to the producer:
In the query editor v2, you can see all the warehouses you have access to in the left pane. You can expand them to see their databases.
- Connect to your primary ETL warehouse using a superuser.
- Run the following command to create the prod database:
Create the database objects to share
Complete the following steps to create your database objects to share:
- After you create the prod database, switch your database connection to the
prod
.
You may need to refresh your page to be able to see it.
- Run the following commands to create the three schemas you intend to share:
- Create the tables in the
ETL
schema to share with the Sales and Marketing consumer warehouses. These are standard DDL statements coming from the AWS Labs TPCDS DDL file with modified table names.
- Create the tables in the
SALES
schema to share with the Sales consumer warehouse:
- Create the tables in the
MARKETING
schema to share with the Marketing consumer warehouse:
Create the datashare
Create datashares for the Sales and Marketing business units with the following command:
Grant permissions on schemas to the datashare
To add objects with permissions to the datashare, use the grant syntax, specifying the datashare you want to grant the permissions to.
- Allow the datashare consumers (Sales and Marketing business units) to use objects added to the
ETL
schema:
- Allow the datashare consumer (Sales business unit) to use objects added to the
SALES
schema:
- Allow the datashare consumer (Marketing business unit) to use objects added to the
MARKETING
schema:
Grant permissions on tables to the datashare
Now you can grant access to tables to the datashare using the grant syntax, specifying the permissions and the datashare.
- Grant
select
andinsert
scoped privileges on theetl_audit_logs
table to the Sales and Marketing datashares:
- Grant
all
privileges on all tables in theSALES
schema to the Sales datashare:
- Grant
all
privileges on all tables in theMARKETING
schema to the Marketing datashare:
You can optionally choose to include new objects to be automatically shared. The following code will automatically add new objects in the etl
, sales
, and marketing
schemas to the two datashares:
Grant permissions to the Sales and Marketing namespaces
You can grant permissions to the Sales and Marketing namespaces by specifying the namespace IDs. There are two ways to find namespace IDs:
- On the Redshift Serverless console, find the namespace ID on the namespace details page
- From the Redshift query editor v2, run
select current_namespace;
on both consumers
You can then grant access to the other namespace with the following command (change the consumer namespace to the namespace UID of your own Sales and Marketing warehouse):
Set up and run an ETL job in the ETL producer
Complete the following steps to set up and run an ETL job:
- Create a stored procedure to perform the following steps:
- Copy data from the S3 bucket to the inventory table in the ETL
- Insert an audit record in the
etl_audit_logs
table in the ETL
- Run the stored procedure and validate data in the ETL logging table:
Set up the Sales warehouse (consumer)
At this point, you’re ready to set up your Sales consumer warehouse to start writing data to the shared objects in the ETL producer namespace.
Create a database from the datashare
Complete the following steps to create your database:
- In the query editor v2, switch to the Sales warehouse.
- Run the command
show datashares;
to see etl and sales datashares as well as the datashare producer’s namespace. - Use that namespace to create a database from the datashare, as shown in the following code:
Specifying with permissions allows you to grant granular permissions to individual database users and roles. Without this, if you grant usage permissions on the datashare database, users and roles get all permissions on all objects within the datashare database.
Start writing to the datashare database
In this section, we show you how to write to the datashare database using the use
command and using three-part notation:
.
Let’s try the use command method first. Run the following command:
Ingest data into the datashare tables
Complete the following steps to ingest the data:
- Copy the TPC-DS data from the AWS Labs public S3 bucket into the tables in the producer’s
sales
schema:
- Insert an entry in the
etl_audit_logs
table in the producer’setl
schema. To insert the data, let’s try three-part notation this time:
Set up the Marketing warehouse (consumer)
Now, you’re ready to set up your Marketing consumer warehouse to start writing data to the shared objects in the ETL producer namespace. The following steps are similar to the ones previously completed while setting up the Sales warehouse consumer.
Create a database from the datashare
Complete the following steps to create your database:
- In the query editor v2, switch to the Marketing warehouse.
- Run the command
show datashares;
to see the etl and marketing datashares as well as the datashare producer’s namespace. - Use that namespace to create a database from the datashare, as shown in the following code:
Start writing to the datashare database
In this section, we show you how to write to the datashare database by calling a stored procedure.
Set up and run an ETL job in the ETL producer
Complete the following steps to set up and run an ETL job:
- Create a stored procedure to perform the following steps:
- Copy data from the S3 bucket to the customer and promotion tables in the MARKETING schema of the producer’s namespace.
- Insert an audit record in the
etl_audit_logs
table in theETL
schema of the producer’s namespace.
- Run the stored procedure:
At this point, you’ve completed ingesting the data to the primary ETL namespace. You can query the tables in the etl
, sales
, and marketing
schemas from both the ETL producer warehouse and Sales and Marketing consumer warehouses and see the same data.
Calculate chargeback to business units
Because the business units’ specific workloads have been isolated to dedicated consumers, you can now attribute the cost based on compute capacity utilization. The compute capacity in Redshift Serverless is measured in Redshift Processing Units (RPUs) and metered for the workloads that you run in RPU-seconds on a per-second basis. A Redshift administrator can use the SYS_SERVERLESS_USAGE view on individual consumer workgroups to view the details of Redshift Serverless usage of resources and related cost.
For example, to get the total charges for RPU hours used for a time interval, run the following query on the Sales and Marketing business units’ respective consumer workgroups:
Clean up
When you’re done, remove any resources that you no longer need to avoid ongoing charges:
Conclusion
In this post, we showed you how you can isolate business units’ specific workloads to multiple consumer warehouses writing the data to the same producer database. This solution has the following benefits:
- Straightforward cost attribution and chargeback to business
- Ability to use provisioned clusters and serverless workgroups of different sizes to write to the same databases
- Ability to write across accounts and Regions
- Data is live and available to all warehouses as soon as it’s committed
- Writes work even if the producer warehouse (the warehouse that owns the database) is paused
You can engage an Amazon Redshift specialist to answer questions, and discuss how we can further help your organization.
About the authors
Raks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.
Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.
Saurav Das is part of the Amazon Redshift Product Management team. He has more than 16 years of experience in working with relational databases technologies and data protection. He has a deep interest in solving customer challenges centered around high availability and disaster recovery.