sexta-feira, novembro 22, 2024
HomeBig DataGet started with Amazon DynamoDB zero-ETL integration with Amazon Redshift

Get started with Amazon DynamoDB zero-ETL integration with Amazon Redshift


We’re excited to announce the general availability (GA) of Amazon DynamoDB zero-ETL integration with Amazon Redshift, which enables you to run high-performance analytics on your DynamoDB data in Amazon Redshift with little to no impact on production workloads running on DynamoDB. As data is written into a DynamoDB table, it’s seamlessly made available in Amazon Redshift, eliminating the need to build and maintain complex data pipelines.

Zero-ETL integrations facilitate point-to-point data movement without the need to create and manage data pipelines. You can create zero-ETL integration on an Amazon Redshift Serverless workgroup or Amazon Redshift provisioned cluster using RA3 instance types. You can then run enhanced analysis on this DynamoDB data with the rich capabilities of Amazon Redshift, such as high-performance SQL, built-in machine learning (ML) and Spark integrations, materialized views (MV) with automatic and incremental refresh, data sharing, and the ability to join data across multiple data stores and data lakes.

The DynamoDB zero-ETL integration with Amazon Redshift has helped our customers simplify their extract, transform, and load (ETL) pipelines. The following is a testimony from Keith McDuffee, Director of DevOps at Verisk Analytics, a customer who used zero-ETL integration with DynamoDB in place of their homegrown solution and benefitted from the seamless replication that it provided:

“We have dashboards built on top of our transactional data in Amazon Redshift. Earlier, we used our homegrown solution to move data from DynamoDB to Amazon Redshift, but those jobs would often time out and lead to a lot of operational burden and missed insights on Amazon Redshift. Using the DynamoDB zero-ETL integration with Amazon Redshift, we no longer run into such issues and the integration seamlessly and continuously replicates data to Amazon Redshift.”

In this post, we showcase how an ecommerce application can use this zero-ETL integration to analyze the distribution of customers by attributes such as location and customer signup date. You can also use the integration for retention and churn analysis by calculating retention rates by comparing the number of active profiles over different time periods.

Solution overview

The zero-ETL integration provides end-to-end fully managed process that allows data to be seamlessly moved from DynamoDB tables to Amazon Redshift without the need for manual ETL processes, ensuring efficient and incremental updates in Amazon Redshift environment. It leverages DynamoDB exports to incrementally replicate data changes from DynamoDB to Amazon Redshift every 15-30 minutes. The initial data load is a full load, which may take longer depending on the data volume. This integration also enables replicating data from multiple DynamoDB tables into a single Amazon Redshift provisioned cluster or serverless workgroup, providing a holistic view of data across various applications.

This replication is done with little to no performance or availability impact to your DynamoDB tables and without consuming DynamoDB read capacity units (RCUs). Your applications will continue to use DynamoDB while data from those tables will be seamlessly replicated to Amazon Redshift for analytics workloads such as reporting and dashboards.

The following diagram illustrates this architecture.

In the following sections, we show how to get started with DynamoDB zero-ETL integration with Amazon Redshift. This general availability release supports creating and managing the zero-ETL integrations using the AWS Command Line Interface (AWS CLI), AWS SDKs, API, and AWS Management Console. In this post, we demonstrate using the console.

Prerequisites

Complete the following prerequisite steps:

  1. Enable point-in-time recovery (PITR) on the DynamoDB table.
  2. Enable case sensitivity for the target Redshift data warehouse.
  3. Attach the resource-based policies to both DynamoDB and Amazon Redshift as mentioned in here.
  4. Make sure the AWS Identity and Access Management (IAM) user or role creating the integration has an identity-based policy that authorizes actions listed in here.

Create the DynamoDB zero-ETL integration

You can create the integration either on the DynamoDB console or Amazon Redshift console. The following steps use the Amazon Redshift console.

  1. On the Amazon Redshift console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create DynamoDB integration.

If you choose to create the integration on the DynamoDB console, choose Integrations in the navigation pane and then choose Create integration and Amazon Redshift.

  1. For Integration name, enter a name (for example, ddb-rs-customerprofiles-zetl-integration).
  2. Choose Next.
  1. Choose Browse DynamoDB tables and choose the table that will be the source for this integration.
  2. Choose Next.

You can only choose one table. If you need data from multiple tables in a single Redshift cluster, you need to create a separate integration for each table.

If you don’t have PITR enabled on the source DynamoDB table, an error will pop up while choosing the source. In this case, you can select Fix it for me for DynamoDB to enable the PITR on your source table. Review the changes and choose Continue.

  1. Choose Next.
  1. Choose your target Redshift data warehouse. If it’s in the same account, you can browse and choose the target. If the target resides in a different account, you can provide the Amazon Resource Name (ARN) of the target Redshift cluster.

If you get an error about the resource policy, select Fix it for me for Amazon Redshift to fix policies as part of this creation process. Alternatively, you can add resource policies for Amazon Redshift manually prior to creating zero-ETL integration. Review the changes and choose Reboot and continue.

  1. Choose Next and complete your integration.

The zero-ETL integration creation should show the status Creating. Wait for the status to change to Active.

Create a Redshift database from the integration

Complete the following steps to create a Redshift database:

  1. On the Amazon Redshift console, navigate to the recently created zero-ETL integration.
  2. Choose Create database from integration.

  1. For Destination database name, enter a name (for example, ddb_rs_customerprofiles_zetl_db).
  2. Choose Create database.

After you create the database, the database state should change from Creating to Active. This will start the replication of data in the source DynamoDB tables to the target Redshift tables, which will be created under the public schema of the destination database (ddb_rs_customerprofiles_zetl_db).

Now you can query your data in Amazon Redshift using the integration with DynamoDB.

Understanding your data

Data exported from DynamoDB to Amazon Redshift is stored in the Redshift database that you created from your zero-ETL integration (ddb_rs_customerprofiles_zetl_db). A single table of the same name as the DynamoDB source table is created and is under the default (public) Redshift schema. DynamoDB only enforces schemas for the primary key attributes (partition key and optionally sort key). Because of this, your DynamoDB table structure is replicated to Amazon Redshift in three columns: partition key, sort key, and a SUPER data type column named value that contains all the attributes. The data in this value column is in DynamoDB JSON format. For information about the data format, see DynamoDB table export output format.

The DynamoDB partition key is used as the Redshift table distribution key, and the combination of the DynamoDB partition and sort keys are used as the Redshift table sort keys. Amazon Redshift also allows changing the sort keys on the zero-ETL integration replicated tables using the ALTER SORT KEY command.

The DynamoDB data in Amazon Redshift is read-only data. After the data is available in the Amazon Redshift table, you can query the value column as a SUPER data type using PartiQL SQL or create and query materialized views on the table, which are incrementally refreshed automatically.

For more information about the SUPER data type, see Semistructured data in Amazon Redshift.

Query the data

To validate the ingested records, you can use the Amazon Redshift Query Editor to query the target table in Amazon Redshift using PartiQL SQL. For example, you can use the following query to select email and unnest the data in the value column to the retrieve the customer name and address:

select email, 
       value.custname."S"::text custname, 
       value.address."S"::text custaddress, 
       value 
from "ddb_rs_customerprofiles_zetl_db".public."customerprofiles"

To demonstrate the replication of incremental changes in action, we make the following updates to the source DynamoDB table:

  1. Add two new items in the DynamoDB table:
    ##Incremental changes
    ##add 2 items
    
    aws dynamodb put-item --table-name customerprofiles --item  '{ "email": { "S": "sarah.wilson@example.com" }, "custname": { "S": "Sarah Wilson" }, "username": { "S": "swilson789" }, "phone": { "S": "555-012-3456" }, "address": { "S": "789 Oak St, Chicago, IL 60601" }, "custcreatedt": { "S": "2023-04-01T09:00:00Z" }, "custupddt": { "S": "2023-04-01T09:00:00Z" }, "status": { "S": "active" } }'
    
    aws dynamodb put-item --table-name customerprofiles --item  '{ "email": { "S": "michael.taylor@example.com" }, "custname": { "S": "Michael Taylor" }, "username": { "S": "mtaylor123" }, "phone": { "S": "555-246-8024" }, "address": { "S": "246 Maple Ave, Los Angeles, CA 90001" }, "custcreatedt": { "S": "2022-11-01T08:00:00Z" }, "custupddt": { "S": "2022-11-01T08:00:00Z" }, "status": { "S": "active" } }'

  2. Update the address for one of the items in the DynamoDB table:
    ##update an item
    aws dynamodb update-item --table-name customerprofiles --key '{"email": {"S": "sarahjones@example.com"}}' --update-expression "SET address = :a" --expression-attribute-values '{":a":{"S":"124 Main St, Somewhereville USA "}}' 

  3. Delete the item where email is michaelwilson@example.com:
    # # delete an item
    
    aws dynamodb delete-item --table-name customerprofiles --key '{"email": {"S": "michaelwilson@example.com"}}' 

With these changes, the DynamoDB table customerprofiles has four items (three existing, two new, and one delete), as shown in the following screenshot.

Next, you can go to the query editor to validate these changes. At this point, you can expect incremental changes to reflect in the Redshift table (four records in table).

Create materialized views on zero-ETL replicated tables

Common analytics use cases generally involve aggregating data across multiple source tables using complex queries to generate reports and dashboards for downstream applications. Customers usually create late binding views to meet such use cases, which aren’t always optimized to meet the stringent query SLAs due to the long underlying query runtimes. Another option is to create a table that stores the data across multiple source tables, which brings the challenge of incrementally updating and refreshing data based on the changes in the source table.

To serve such use cases and get around the challenges associated with traditional options, you can create materialized views on top of zero-ETL replicated tables in Amazon Redshift, which can get automatically refreshed incrementally as the underlying data changes. Materialized views are also convenient for storing frequently accessed data by unnesting and shredding data stored in the SUPER column value by the zero-ETL integration.

For example, we can use the following query to create a materialized view on the customerprofiles table to analyze customer data:

CREATE MATERIALIZED VIEW dev.public.customer_mv
AUTO REFRESH YES
AS
SELECT value."custname"."S"::varchar(30) as cust_name, value."username"."S"::varchar(100) as user_name, value."email"."S"::varchar(60) as cust_email, value."address"."S"::varchar(100) as cust_addres, value."phone"."S"::varchar(100) as cust_phone_nbr, value."status"."S"::varchar(10) as cust_status,
value."custcreatedt"."S"::varchar(10) as cust_create_dt, value."custupddt"."S"::varchar(10) as cust_update_dt FROM "ddb_rs_customerprofiles_zetl_db"."public"."customerprofiles"
group by 1,2,3,4,5,6,7,8;

This view is set to AUTO REFRESH, which means it will be automatically and incrementally refreshed when the new data arrives in the underlying source table customerprofiles.

Now let’s say you want to understand the distribution of customers across different status categories. You can query the materialized view customer_mv created from the zero-ETL DynamoDB table as follows:

-- Customer count by status
select cust_status,count(distinct user_name) cust_status_count
from dev.public.customer_mv
group by 1;

Next, let’s say you want to compare the number of active customer profiles over different time periods. You can run the following query on customer_mv to get that data:

-- Customer active count by date
select cust_create_dt,count(distinct user_name) cust_count
from dev.public.customer_mv
where cust_status="active"
group by 1;

Let’s try to make a few incremental changes, which involves two new items and one delete on the source DynamoDB table using following AWS CLI commands.

aws dynamodb put-item --table-name customerprofiles --item  '{ "email": { "S": "robert.davis@example.com" }, "custname": { "S": "Robert Davis" }, "username": { "S": "rdavis789" }, "phone": { "S": "555-012-3456" }, "address": { "S": "789 Pine St, Seattle, WA 98101" }, "custcreatedt": { "S": "2022-07-01T14:00:00Z" }, "custupddt": { "S": "2023-04-01T11:30:00Z" }, "status": { "S": "inactive" } }'

aws dynamodb put-item --table-name customerprofiles --item '{ "email": { "S": "william.jones@example.com" }, "custname": { "S": "William Jones" }, "username": { "S": "wjones456" }, "phone": { "S": "555-789-0123" }, "address": { "S": "456 Elm St, Atlanta, GA 30301" }, "custcreatedt": { "S": "2022-09-15T12:30:00Z" }, "custupddt": { "S": "2022-09-15T12:30:00Z" }, "status": { "S": "active" } }'

aws dynamodb delete-item --table-name customerprofiles --key '{"email": {"S": "emily.brown@example.com"}}'

Validate the incremental refresh of the materialized view

To monitor the history of materialized view refreshes, you can use the SYS_MV_REFRESH_HISTORY system view. As you can see in the following output, the materialized view customer_mv was incrementally refreshed.

Now let’s query the materialized view created from the zero-ETL table. You can see two new records. The changes were propagated into the materialized view with an incremental refresh.

Monitor the zero-ETL integration

There are several options to obtain metrics on the performance and status of the DynamoDB zero-ETL integration with Amazon Redshift.

On the Amazon Redshift console, choose Zero-ETL integrations in the navigation pane. You can choose the zero-ETL integration you want and display Amazon CloudWatch metrics related to the integration. These metrics are also directly available in CloudWatch.

For each integration, there are two tabs with information available:

  • Integration metrics – Shows metrics such as the lag (in minutes) and data transferred (in KBps)
  • Table statistics – Shows details about tables replicated from DynamoDB to Amazon Redshift such as status, last updated time, table row count, and table size

After inserting, deleting, and updating rows in the source DynamoDB table, the Table statistics section displays the details, as shown in the following screenshot.

In addition to the CloudWatch metrics, you can query the following system views, which provide information about the integrations:

Pricing

AWS does not charge an additional fee for the zero-ETL integration. You pay for existing DynamoDB and Amazon Redshift resources used to create and process the change data created as part of a zero-ETL integration. These include DynamoDB PITR, DynamoDB exports for the initial and ongoing data changes to your DynamoDB data, additional Amazon Redshift storage for storing replicated data, and Amazon Redshift compute on the target. For pricing on DynamoDB PITR and DynamoDB exports, see Amazon DynamoDB pricing. For pricing on Redshift clusters, see Amazon Redshift pricing.

Clean up

When you delete a zero-ETL integration, your data isn’t deleted from the DynamoDB table or Redshift, but data changes happening after that point of time aren’t sent to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon Redshift console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and on the Actions menu, choose Delete.
  1. To confirm the deletion, enter confirm and choose Delete.

Conclusion

In this post, we explained how you can set up the zero-ETL integration from DynamoDB to Amazon Redshift to derive holistic insights across many applications, break data silos in your organization, and gain significant cost savings and operational efficiencies.

To learn more about zero-ETL integration, refer to documentation.


About the authors

Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys landscape photography, traveling, and board games.

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Veerendra Nayak is a Principal Database Solutions Architect based in the Bay Area, California. He works with customers to share best practices on database migrations, resiliency, and integrating operational data with analytics and AI services.

RELATED ARTICLES
- Advertisment -
Google search engine

Most Popular

Recent Comments