sábado, novembro 23, 2024
HomeBig DataIncremental refresh for Amazon Redshift materialized views on data lake tables

Incremental refresh for Amazon Redshift materialized views on data lake tables


Amazon Redshift is a fast, fully managed cloud data warehouse that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. You can use Amazon Redshift to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using AWS designed hardware and automated machine learning (ML)-based tuning to deliver top-tier price performance at scale.

Amazon Redshift delivers price performance right out of the box. However, it also offers additional optimizations that you can use to further improve this performance and achieve even faster query response times from your data warehouse.

One such optimization for reducing query runtime is to precompute query results in the form of a materialized view. Materialized views in Redshift speed up running queries on large tables. This is useful for queries that involve aggregations and multi-table joins. Materialized views store a precomputed result set of these queries and also support incremental refresh capability for local tables.

Customers use data lake tables to achieve cost effective storage and interoperability with other tools. With open table formats (OTFs) such as Apache Iceberg, data is continuously being added and updated.

Amazon Redshift now provides the ability to incrementally refresh your materialized views on data lake tables including open file and table formats such as Apache Iceberg.

In this post, we will show you step-by-step what operations are supported on both open file formats and transactional data lake tables to enable incremental refresh of the materialized view.

Prerequisites

To walk through the examples in this post, you need the following prerequisites:

  1. You can test the incremental refresh of materialized views on standard data lake tables in your account using an existing Redshift data warehouse and data lake. However, if you want to test the examples using sample data, download the sample data. The sample files are ‘|’ delimited text files.
  2. An AWS Identity and Access Management (IAM) role attached to Amazon Redshift to grant the minimum permissions required to use Redshift Spectrum with Amazon Simple Storage Service (Amazon S3) and AWS Glue.
  3. Set the IAM Role as the default role in Amazon Redshift.

Incremental materialized view refresh on standard data lake tables

In this section, you learn how to can build and incrementally refresh materialized views in Amazon Redshift on standard text files in Amazon S3, maintaining data freshness with a cost-effective approach.

  1. Upload the first file, customer.tbl.1, downloaded from the Prerequisites section in your desired S3 bucket with the prefix customer.
  2. Connect to your Amazon Redshift Serverless workgroup or Redshift provisioned cluster using Query editor v2.
  3. Create an external schema.
    create external schema datalake_mv_demo
    from data catalog   
    database 'datalake-mv-demo'
    iam_role default;

  4. Create an external table named customer in the external schema datalake_mv_demo created in the preceding step.
    create external table datalake_mv_demo.customer(
            c_custkey int8,
            c_name varchar(25),
            c_address varchar(40),
            c_nationkey int4,
            c_phone char(15),
            c_acctbal numeric(12, 2),
            c_mktsegment char(10),
            c_comment varchar(117)
        ) row format delimited fields terminated by '|' stored as textfile location 's3:///customer/';

  5. Validate the sample data in the external customer.
    select * from datalake_mv_demo.customer;

  6. Create a materialized view on the external table.
    CREATE MATERIALIZED VIEW customer_mv 
    AS
    select * from datalake_mv_demo.customer;

  7. Validate the data in the materialized view.
    select * from customer_mv limit 5;

  8. Upload a new file customer.tbl.2 in the same S3 bucket and customer prefix location. This file contains one additional record.
  9. Using Query editor v2 , refresh the materialized view customer_mv.
    REFRESH MATERIALIZED VIEW customer_mv;

  10. Validate the incremental refresh of the materialized view when the new file is added.
    select mv_name, status, start_time, end_time
    from SYS_MV_REFRESH_HISTORY
    where mv_name="customer_mv"
    order by start_time DESC;

  11. Retrieve the current number of rows present in the materialized view customer_mv.
    select count(*) from customer_mv;

  12. Delete the existing file customer.tbl.1 from the same S3 bucket and prefix customer. You should only have customer.tbl.2 in the customer prefix of your S3 bucket.
  13. Using Query editor v2, refresh the materialized view customer_mv again.
    REFRESH MATERIALIZED VIEW customer_mv;

  14. Verify that the materialized view is refreshed incrementally when the existing file is deleted.
    select mv_name, status, start_time, end_time
    from SYS_MV_REFRESH_HISTORY
    where mv_name="customer_mv"
    order by start_time DESC;

  15. Retrieve the current row count in the materialized view customer_mv. It should now have one record as present in the customer.tbl.2 file.
    select count(*) from customer_mv;

  16. Modify the contents of the previously downloaded customer.tbl.2 file by altering the customer key from 999999999 to 111111111.
  17. Save the modified file and upload it again to the same S3 bucket, overwriting the existing file within the customer prefix.
  18. Using Query editor v2, refresh the materialized view customer_mv
    REFRESH MATERIALIZED VIEW customer_mv;

  19. Validate that the materialized view was incrementally refreshed after the data was modified in the file.
    select mv_name, status, start_time, end_time
    from SYS_MV_REFRESH_HISTORY
    where mv_name="customer_mv"
    order by start_time DESC;

  20. Validate that the data in the materialized view reflects your prior data changes from 999999999 to 111111111.
    select * from customer_mv;

Incremental materialized view refresh on Apache Iceberg data lake tables

Apache Iceberg is a data lake open table format that’s rapidly becoming an industry standard for managing data in data lakes. Iceberg introduces new capabilities that enable multiple applications to work together on the same data in a transactionally consistent manner.

In this section, we will explore how Amazon Redshift can seamlessly integrate with Apache Iceberg. You can use this integration to build materialized views and incrementally refresh them using a cost-effective approach, maintaining the freshness of the stored data.

  1. Sign in to the AWS Management Console, go to Amazon Athena, and execute the following SQL to create a database in an AWS Glue catalog.
    create database iceberg_mv_demo;

  2. Create a new Iceberg table
    create table iceberg_mv_demo.category (
      catid int ,
      catgroup string ,
      catname string ,
      catdesc string)
      PARTITIONED BY (catid, bucket(16,catid))
      LOCATION 's3:///iceberg/'
      TBLPROPERTIES (
      'table_type'='iceberg',
      'write_compression'='snappy',
      'format'='parquet');

  3. Add some sample data to iceberg_mv_demo.category.
    insert into iceberg_mv_demo.category values
    (1, 'Sports', 'MLB', 'Major League Basebal'),
    (2, 'Sports', 'NHL', 'National Hockey League'),
    (3, 'Sports', 'NFL', 'National Football League'),
    (4, 'Sports', 'NBA', 'National Basketball Association'),
    (5, 'Sports', 'MLS', 'Major League Soccer');

  4. Validate the sample data in iceberg_mv_demo.category.
    select * from iceberg_mv_demo.category;

  5. Connect to your Amazon Redshift Serverless workgroup or Redshift provisioned cluster using Query editor v2.
  6. Create an external schema
    CREATE external schema iceberg_schema
    from data catalog
    database 'iceberg_mv_demo'
    region 'us-east-1'
    iam_role default;

  7. Query the Iceberg table data from Amazon Redshift.
    SELECT *  FROM "dev"."iceberg_schema"."category";

  8. Create a materialized view using the external schema.
    create MATERIALIZED view mv_category as
    select  * from
    "dev"."iceberg_schema"."category";

  9. Validate the data in the materialized view.
    select  * from
    "dev"."iceberg_schema"."category";

  10. Using Amazon Athena, modify the Iceberg table iceberg_mv_demo.category and insert sample data.
    insert into category values
    (12, 'Concerts', 'Comedy', 'All stand-up comedy performances'),
    (13, 'Concerts', 'Other', 'General');

  11. Using Query editor v2, refresh the materialized view mv_category.
    Refresh  MATERIALIZED view mv_category;

  12. Validate the incremental refresh of the materialized view after the additional data was populated in the Iceberg table.
    select mv_name, status, start_time, end_time
    from SYS_MV_REFRESH_HISTORY
    where mv_name="mv_category"
    order by start_time DESC;

  13. Using Amazon Athena, modify the Iceberg table iceberg_mv_demo.category by deleting and updating records.
    delete from iceberg_mv_demo.category
    where catid = 3;
     
    update iceberg_mv_demo.category
    set catdesc="American National Basketball Association"
    where catid=4;

  14. Validate the sample data in iceberg_mv_demo.category to confirm that catid=4 has been updated and catid=3 has been deleted from the table.
    select * from iceberg_mv_demo.category;

  15. Using Query editor v2, Refresh the materialized view mv_category.
    Refresh  MATERIALIZED view mv_category;

  16. Validate the incremental refresh of the materialized view after one row was updated and another was deleted.
    select mv_name, status, start_time, end_time
    from SYS_MV_REFRESH_HISTORY
    where mv_name="mv_category"
    order by start_time DESC;

Performance Improvements

To understand the performance improvements of incremental refresh over full recompute, we used the industry-standard TPC-DS benchmark using 3 TB data sets for Iceberg tables configured in copy-on-write. In our benchmark, fact tables are stored on Amazon S3, while dimension tables are in Redshift. We created 34 materialized views representing different customer use cases on a Redshift provisioned cluster of size ra3.4xl with 4 nodes. We applied 1% inserts and deletes on fact tables, i.e., tables store_sales, catalog_sales and web_sales. We ran the inserts and deletes with Spark SQL on EMR serverless. We refreshed all 34 materialized views using incremental refresh and measured refresh latencies. We repeated the experiment using full recompute.

Our experiments show that incremental refresh provides substantial performance gains over full recompute. After insertions, incremental refresh was 13.5X faster on average than full recompute (maximum 43.8X, minimum 1.8X). After deletions, incremental refresh was 15X faster on average (maximum 47X, minimum 1.2X). The following graphs illustrate the latency of refresh.

Inserts

Deletes

Clean up

When you’re done, remove any resources that you no longer need to avoid ongoing charges.

  1. Run the following script to clean up the Amazon Redshift objects.
    DROP  MATERIALIZED view mv_category;
    
    DROP  MATERIALIZED view customer_mv;

  2. Run the following script to clean up the Apache Iceberg tables using Amazon Athena.
    DROP  TABLE iceberg_mv_demo.category;

Conclusion

Materialized views on Amazon Redshift can be a powerful optimization tool. With incremental refresh of materialized views on data lake tables, you can store pre-computed results of your queries over one or more base tables, providing a cost-effective approach to maintaining fresh data. We encourage you to update your data lake workloads and use the incremental materialized view feature. If you’re new to Amazon Redshift, try the Getting Started tutorial and use the free trial to create and provision your first cluster and experiment with the feature.

See Materialized views on external data lake tables in Amazon Redshift Spectrum for considerations and best practices.


About the authors

Raks KhareRaks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Enrico Siragusa is a Senior Software Development Engineer at Amazon Redshift. He contributed to query processing and materialized views. Enrico holds a M.Sc. in Computer Science from the University of Paris-Est and a Ph.D. in Bioinformatics from the International Max Planck Research School in Computational Biology and Scientific Computing in Berlin.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments