quinta-feira, abril 3, 2025
HomeBig DataUsing Amazon S3 Tables with Amazon Redshift to query Apache Iceberg tables

Using Amazon S3 Tables with Amazon Redshift to query Apache Iceberg tables


Amazon Redshift supports querying data stored using Apache Iceberg tables, an open table format that simplifies management of tabular data residing in data lakes on Amazon Simple Storage Service (Amazon S3). Amazon S3 Tables delivers the first cloud object store with built-in Iceberg support and streamlines storing tabular data at scale, including continual table optimizations that help improve query performance. Amazon SageMaker Lakehouse unifies your data across S3 data lakes, including S3 Tables, and Amazon Redshift data warehouses, helps you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data, querying data stored in S3 Tables without the need for complex extract, transform, and load (ETL) or data movement processes. You can take advantage of the scalability of S3 Tables to store and manage large volumes of data, optimize costs by avoiding additional data movement steps, and simplify data management through centralized fine-grained access control from SageMaker Lakehouse.

In this post, we demonstrate how to get started with S3 Tables and Amazon Redshift Serverless for querying data in Iceberg tables. We show how to set up S3 Tables, load data, register them in the unified data lake catalog, set up basic access controls in SageMaker Lakehouse through AWS Lake Formation, and query the data using Amazon Redshift.

Note – Amazon Redshift is just one option for querying data stored in S3 Tables. You can learn more about S3 Tables and additional ways to query and analyze data on the S3 Tables product page.

Solution overview

In this solution, we show how to query Iceberg tables managed in S3 Tables using Amazon Redshift. Specifically, we load a dataset into S3 Tables, link the data in S3 Tables to a Redshift Serverless workgroup with appropriate permissions, and finally run queries to analyze our dataset for trends and insights. The following diagram illustrates this workflow.

In this post, we will walk through the following steps:

  1. Create a table bucket in S3 Tables and integrate with other AWS analytics services.
  2. Set up permissions and create Iceberg tables with SageMaker Lakehouse using Lake Formation.
  3. Load data with Amazon Athena. There are different ways to ingest data into S3 Tables, but for this post, we show how we can quickly get started with Athena.
  4. Use Amazon Redshift to query your Iceberg tables stored in S3 Tables through the auto mounted catalog.

Prerequisites

The examples in this post require you to use the following AWS services and features:

Create a table bucket in S3 Tables

Before you can use Amazon Redshift to query the data in S3 Tables, you must first create a table bucket. Complete the following steps:

  1. In the Amazon S3 console, choose Table buckets on the left navigation pane.
  2. In the Integration with AWS analytics services section, choose Enable integration if you haven’t previously set this up.

This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Athena.

After a few seconds, the status will change to Enabled.

  1. Choose Create table bucket.
  2. Enter a bucket name. For this example, we use the bucket name redshifticeberg.
  3. Choose Create table bucket.

After the S3 table bucket is created, you will be redirected to the table buckets list.

Now that your table bucket is created, the next step is to configure the unified catalog in SageMaker Lakehouse through the Lake Formation console. This will make the table bucket in S3 Tables available to Amazon Redshift for querying Iceberg tables.

Publishing Iceberg tables in S3 Tables to SageMaker Lakehouse

Before you can query Iceberg tables in S3 Tables with Amazon Redshift, you must first make the table bucket available in the unified catalog in SageMaker Lakehouse. You can do this through the Lake Formation console, which lets you publish catalogs and manage tables through the catalogs feature, and assign permissions to users. The following steps show you how to set up Lake Formation so you can use Amazon Redshift to query Iceberg tables in your table bucket:

  1. If you’ve never visited the Lake Formation console before, you must first do so as an AWS user with admin permissions to activate Lake Formation.

You will be redirected to the Catalogs page on the Lake Formation console. You will see that one of the catalogs available is the s3tablescatalog, which maintains a catalog of the table buckets you’ve created. The following steps will configure Lake Formation to make data in the s3tablescatalog catalog available to Amazon Redshift.

Next, you need to create a database in Lake Formation. The Lake Formation database maps to a Redshift schema.

  1. Choose Databases under Data Catalog in the navigation pane.
  2. On the Create menu, choose Database.

  1. Enter a name for this database. This example uses icebergsons3.
  2. For Catalog, choose the table bucket that you created. In this example, the name will have the format :s3tablescatalog/redshifticeberg.
  3. Choose Create database.

You will be redirected on the Lake Formation console to a page with more information about your new database. Now you can create an Iceberg table in S3 Tables.

  1. On the database details page, on the View menu, choose Tables.

This will open up a new browser window with the table editor for this database.

  1. After the table view loads, choose Create table to start creating the table.

  1. In the editor, enter the name of the table. We call this table examples.
  2. Choose the catalog (:s3tablescatalog/redshifticeberg) and database (icebergsons3).

Next, add columns to your table.

  1. In the Schema section, choose Add column, and add a column that represents an ID.

  1. Repeat this step and add columns for additional data:
    1. category_id (long)
    2. insert_date (date)
    3. data (string)

The final schema looks like the following screenshot.

  1. Choose Submit to create the table.

Next, you need to set up a read-only permission so you can query Iceberg data in S3 Tables using the Amazon Redshift Query Editor v2. For more information, see Prerequisites for managing Amazon Redshift namespaces in the AWS Glue Data Catalog.

  1. Under Administration in the navigation pane, choose Administrative roles and tasks.
  2. In the Data lake administrators section, choose Add.

  1. For Access type, select Read-only administrator.
  2. For IAM users and roles, enter AWSServiceRoleForRedshift.

AWSServiceRoleForRedshift is a service-linked role that’s managed by AWS.

  1. Choose Confirm.

You have now configured SageMaker Lakehouse using Lake Formation to allow Amazon Redshift to query Iceberg tables in S3 Tables. Next, you populate some data into the Iceberg table, and query it with Amazon Redshift.

Use SQL to query Iceberg data with Amazon Redshift

For this example, we use Athena to load data into our Iceberg table. This is one option for ingesting data into an Iceberg table; see Using Amazon S3 Tables with AWS analytics services for other options, including Amazon EMR with Spark, Amazon Data Firehose, and AWS Glue ETL.

  1. On the Athena console, navigate to the query editor.
  2. If this is your first time using Athena, you must first specify a query result location before executing your first query.
  3. In the query editor, under Data, choose your data source (AwsDataCatalog).
  4. For Catalog, choose the table bucket you created (s3tablescatalog/redshifticeberg).
  5. For Database, choose the database you created (icebergsons3).

  1. Let’s execute a query to generate data for the examples table. The following query generates over 1.5 million rows corresponding to 30 days of data. Enter the query and choose Run.
INSERT INTO icebergsons3.examples
SELECT
    b.id * (date_diff('day', CURRENT_DATE, a.insert_date) + 1),
    b.id % 1000, a.insert_date,
    CAST(random() AS varchar)
FROM
    unnest(
        sequence(CURRENT_DATE, CURRENT_DATE + INTERVAL '30' DAY, INTERVAL '1' DAY)
    ) AS a(insert_date),
    unnest(sequence(1, 50000)) AS b(id);

The following screenshot shows our query.

The query takes about 10 seconds to execute.

Now you can use Redshift Serverless to query the data.

  1. On the Redshift Serverless console, provision a Redshift Serverless workgroup if you haven’t already done so. For instructions, see Get started with Amazon Redshift Serverless data warehouses guide. In this example, we use a Redshift Serverless workgroup called iceberg.
  2. Make sure that your Amazon Redshift patch version is patch 188 or higher.

  1. Choose Query data to open the Amazon Redshift Query Editor v2.

  1. In the query editor, choose the workgroup you want to use.

A pop-up window will appear, prompting what user to use.

  1. Select Federated user, which will use your current account, and choose Create connection.

It will take a few seconds to start the connection. When you’re connected, you will see a list of available databases.

  1. Choose External databases.

You will see the table bucket from S3 Tables in the view (in this example, this is redshifticeberg@s3tablescatalog).

  1. If you continue clicking through the tree, you will see the examples table, which is the Iceberg table you previously created that’s stored in the table bucket.

You can now use Amazon Redshift to query the Iceberg table in S3 Tables.

Before you execute the query, review the Amazon Redshift syntax for querying catalogs registered in SageMaker Lakehouse. Amazon Redshift uses the following syntax to reference a table: database@namespace.schema.table or database@namespace".schema.table.

In this example, we use the following syntax to query the examples table in the table bucket: redshifticeberg@s3tablescatalog.icebergsons3.examples.

Learn more about this mapping in Using Amazon S3 Tables with AWS analytics services.

Let’s run some queries. First, let’s see how many rows are in the examples table.

  1. Run the following query in the query editor:
SELECT count(*)
FROM redshifticeberg@s3tablescatalog.icebergsons3.examples; 

The query will take a few seconds to execute. You will see the following result.

Let’s try a slightly more complicated query. In this case, we want to find all the days that had example data starting with 0.2 and a category_id between 50–75 with at least 130 rows. We will order the results from most to least.

  1. Run the following query:
SELECT examples.insert_date, count(*)
FROM redshifticeberg@s3tablescatalog.icebergsons3.examples
WHERE
    examples.data LIKE '0.2%' AND
    examples.category_id BETWEEN 50 AND 75
GROUP BY examples.insert_date
HAVING count(*) > 130
ORDER BY count DESC;

You might see different results than the following screenshot due the randomly generated source data.

Congratulations, you have set up and queried Iceberg data in S3 Tables from Amazon Redshift!

Clean up

If you implemented the example and want to remove the resources, complete the following steps:

  1. If you no longer need your Redshift Serverless workgroup, delete the workgroup.
  2. If you don’t need to access your SageMaker Lakehouse data from the Amazon Redshift Query Editor v2, remove the data lake administrator:
    1. On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
    2. Remove the read-only data lake administrator that has the AWSServiceRoleForRedshift privilege.
  3. If you want to permanently delete the data from this post, delete the database:
    1. On the Lake Formation console, choose Databases in the navigation pane.
    2. Delete the icebergsahead database.
  4. If you no longer need the table bucket, delete the table bucket.
  5. In you want to deactivate the integration between S3 Tables and AWS analytics services, see Migrating to the updated integration process.

Conclusion

In this post, we showed how to get started with Amazon Redshift to query Iceberg tables stored in S3 Tables. This is just the beginning for how you can use Amazon Redshift to analyze your Iceberg data that’s stored in S3 Tables—you can combine this with other Amazon Redshift features, including writing queries that join data from Iceberg tables stored in S3 Tables and Redshift Managed Storage (RMS), or implement data access controls that give you fine-granted access control rules for different users across the S3 Tables. Additionally, you can use features like Redshift Serverless to automatically select the amount of compute for analyzing your Iceberg tables, and use AI to intelligently scale on demand and optimize query performance characteristics for your analytical workload.

We invite you to leave feedback in the comments.


About the Authors

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments