sábado, maio 10, 2025
HomeBig DataConfigure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue...

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark


An IAM role, Glue-execution-role, in the consumer account, with the following policies:

  1. AWS managed policies AWSGlueServiceRole and AmazonRedshiftDataFullAccess.
  2. Create a new in-line policy with the following permissions and attach it:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "LFandRSserverlessAccess",
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataAccess",
                    "redshift-serverless:GetCredentials"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": "iam:PassRole",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "iam:PassedToService": "glue.amazonaws.com"
                    }
                }
            }
        ]
    }

  3. Add the following trust policy to Glue-execution-role, allowing AWS Glue to assume this role:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "glue.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

Steps for producer account setup

For the producer account setup, you can either use your IAM administrator role added as Lake Formation administrator or use a Lake Formation administrator role with permissions added as discussed in the prerequisites. For illustration purposes, we use the IAM admin role Admin added as Lake Formation administrator.

002-BDB 5089

Configure your catalog

Complete the following steps to set up your catalog:

  1. Log in to AWS Management Console as Admin.
  2. On the Amazon Redshift console, follow the instructions in Registering Amazon Redshift clusters and namespaces to the AWS Glue Data Catalog.
  3. After the registration is initiated, you will see the invite from Amazon Redshift on the Lake Formation console.
  4. Select the pending catalog invitation and choose Approve and create catalog.

003-BDB 5089

  1. On the Set catalog details page, configure your catalog:
    1. For Name, enter a name (for this post, redshiftserverless1-uswest2).
    2. Select Access this catalog from Apache Iceberg compatible engines.
    3. Choose the IAM role you created for the data transfer.
    4. Choose Next.

    004-BDB 5089

  2. On the Grant permissions – optional page, choose Add permissions.
    1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
    2. Choose Add.

    005-BDB 5089

  3. Verify the granted permission on the next page and choose Next.
    006-BDB 5089
  4. Review the details on the Review and create page and choose Create catalog.
    007-BDB 5089

Wait a few seconds for the catalog to show up.

  1. Choose Catalogs in the navigation pane and verify that the redshiftserverless1-uswest2 catalog is created.
    008-BDB 5089
  2. Explore the catalog detail page to verify the ordersdb.public database.
    009-BDB 5089
  3. On the database View dropdown menu, view the table and verify that the orderstbl table shows up.
    010-BDB 5089

As the Admin role, you can also query the orderstbl in Amazon Athena and confirm the data is available.

011-BDB 5089

Grant permissions on the tables from the producer account to the consumer account

In this step, we share the Amazon Redshift federated catalog database redshiftserverless1-uswest2:ordersdb.public and table orderstbl as well as the Amazon S3 based Iceberg table returnstbl_iceberg and its database customerdb from the default catalog to the consumer account. We can’t share the entire catalog to external accounts as a catalog-level permission; we just share the database and table.

  1. On the Lake Formation console, choose Data permissions in the navigation pane.
  2. Choose Grant.
    012-BDB 5089
  3. Under Principals, select External accounts.
  4. Provide the consumer account ID.
  5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
  6. For Catalogs, choose the account ID that represents the default catalog.
  7. For Databases, choose customerdb.
    013-BDB 5089
  8. Under Database permissions, select Describe under Database permissions and Grantable permissions.
  9. Choose Grant.
    014-BDB 5089
  10. Repeat these steps and grant table-level Select and Describe permissions on returnstbl_iceberg.
  11. Repeat these steps again to grant database- and table-level permissions for the ordertbl table of the federated catalog database redshiftserverless1-uswest2/ordersdb.

The following screenshots show the configuration for database-level permissions.

015-BDB 5089

016-BDB 5089

The following screenshots show the configuration for table-level permissions.

017-BDB 5089

018-BDB 5089

  1. Choose Data permissions in the navigation pane and verify that the consumer account has been granted database- and table-level permissions for both orderstbl from the federated catalog and returnstbl_iceberg from the default catalog.
    019-BDB 5089

Register the Amazon S3 location of the returnstbl_iceberg with Lake Formation.

In this step, we register the Amazon S3 based Iceberg table returnstbl_iceberg data location with Lake Formation to be managed by Lake Formation permissions. Complete the following steps:

  1. On the Lake Formation console, choose Data lake locations in the navigation pane.
  2. Choose Register location.
    020-BDB 5089
  3. For Amazon S3 path, enter the path for your S3 bucket that you provided while creating the Iceberg table returnstbl_iceberg.
  4. For IAM role, provide the user-defined role LakeFormationS3Registration_custom that you created as a prerequisite.
  5. For Permission mode, select Lake Formation.
  6. Choose Register location.
    021-BDB 5089
  7. Choose Data lake locations in the navigation pane to verify the Amazon S3 registration.
    022-BDB 5089

With this step, the producer account setup is complete.

Steps for consumer account setup

For the consumer account setup, we use the IAM admin role Admin, added as a Lake Formation administrator.

The steps in the consumer account are quite involved. In the consumer account, a Lake Formation administrator will accept the AWS Resource Access Manager (AWS RAM) shares and create the required resource links that point to the shared catalog, database, and tables. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. The admin further grants permissions to the role Glue-execution-role on the resource links, database, and tables. The admin then runs a join query in AWS Glue 5.0 Spark using Glue-execution-role.

Accept and verify the shared resources

Lake Formation uses AWS RAM shares to enable cross-account sharing with Data Catalog resource policies in the AWS RAM policies. To view and verify the shared resources from producer account, complete the following steps:

  1. Log in to the consumer AWS console and set the AWS Region to match the producer’s shared resource Region. For this post, we use us-west-2.
  2. Open the Lake Formation console. You will see a message indicating there is a pending invite and asking you accept it on the AWS RAM console.
    023-BDB 5089
  3. Follow the instructions in Accepting a resource share invitation from AWS RAM to review and accept the pending invites.
  4. When the invite status changes to Accepted, choose Shared resources under Shared with me in the navigation pane.
  5. Verify that the Redshift Serverless federated catalog redshiftserverless1-uswest2, the default catalog database customerdb, the table returnstbl_iceberg, and the producer account ID under Owner ID column display correctly.
    024-BDB 5089
  6. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  7. Search by the producer account ID.
    You should see the customerdb and public databases. You can further select each database and choose View tables on the Actions dropdown menu and verify the table names

025-BDB 5089

You will not see an AWS RAM share invite for the catalog level on the Lake Formation console, because catalog-level sharing isn’t possible. You can review the shared federated catalog and Amazon Redshift managed catalog names on the AWS RAM console, or using the AWS Command Line Interface (AWS CLI) or SDK.

Create a catalog link container and resource links

A catalog link container is a Data Catalog object that references a local or cross-account federated database-level catalog from other AWS accounts. For more details, refer to Accessing a shared federated catalog. Catalog link containers are essentially Lake Formation resource links at the catalog level that reference or point to a Redshift cluster federated catalog or Amazon Redshift managed catalog object from other accounts.

In the following steps, we create a catalog link container that points to the producer shared federated catalog redshiftserverless1-uswest2. Inside the catalog link container, we create a database. Inside the database, we create a resource link for the table that points to the shared federated catalog table <>:redshiftserverless1-uswest2/ordersdb.public.orderstbl.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
  2. Choose Create catalog.

026-BDB 5089

  1. Provide the following details for the catalog:
    1. For Name, enter a name for the catalog (for this post, rl_link_container_ordersdb).
    2. For Type, choose Catalog Link container.
    3. For Source, choose Redshift.
    4. For Target Redshift Catalog, enter the Amazon Resource Name (ARN) of the producer federated catalog (arn:aws:glue:us-west-2:<>:catalog/redshiftserverless1-uswest2/ordersdb).
    5. Under Access from engines, select Access this catalog from Apache Iceberg compatible engines.
    6. For IAM role, provide the Redshift-S3 data transfer role that you had created in the prerequisites.
    7. Choose Next.

027-BDB 5089

  1. On the Grant permissions – optional page, choose Add permissions.
    1. Grant the Admin user Super user permissions for Catalog permissions and Grantable permissions.
    2. Choose Add and then choose Next.

028-BDB 5089

  1. Review the details on the Review and create page and choose Create catalog.

Wait a few seconds for the catalog to show up.

029-BDB 5089

  1. In the navigation pane, choose Catalogs.
  2. Verify that rl_link_container_ordersdb is created.

030-BDB 5089

Create a database under rl_link_container_ordersdb

Complete the following steps:

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. On the Choose catalog dropdown menu, choose rl_link_container_ordersdb.
  3. Choose Create database.

Alternatively, you can choose the Create dropdown menu and then choose Database.

  1. Provide details for the database:
    1. For Name, enter a name (for this post, public_db).
    2. For Catalog, choose rl_link_container_ordersdb.
    3. Leave Location – optional as blank.
    4. Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
    5. Choose Create database.

031-BDB 5089

  1. Choose Catalogs in the navigation pane to verify that public_db is created under rl_link_container_ordersdb.

032-BDB 5089

Create a table resource link for the shared federated catalog table

A resource link to a shared federated catalog table can reside only inside the database of a catalog link container. A resource link for such tables will not work if created inside the default catalog. For more details on resource links, refer to Creating a resource link to a shared Data Catalog table.

Complete the following steps to create a table resource link:

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
  2. On the Create dropdown menu, choose Resource link.

033-BDB 5089

  1. Provide details for the table resource link:
    1. For Resource link name, enter a name (for this post, rl_orderstbl).
    2. For Destination catalog, choose rl_link_container_ordersdb.
    3. For Database, choose public_db.
    4. For Shared table’s region, choose US West (Oregon).
    5. For Shared table, choose orderstbl.
    6. After the Shared table is selected, Shared table’s database and Shared table’s catalog ID should get automatically populated.
    7. Choose Create.

034-BDB 5089

  1. In the navigation pane, choose Databases to verify that rl_orderstbl is created under public_db, inside rl_link_container_ordersdb.

035-BDB 5089

036-BDB 5089

Create a database resource link for the shared default catalog database.

Now we create a database resource link in the default catalog to query the Amazon S3 based Iceberg table shared from the producer. For details on database resource links, refer Creating a resource link to a shared Data Catalog database.

Though we are able to see the shared database in the default catalog of the consumer, a resource link is required to query from analytics engines, such as Athena, Amazon EMR, and AWS Glue. When using AWS Glue with Lake Formation tables, the resource link needs to be named identically to the source account’s resource. For additional details on using AWS Glue with Lake Formation, refer to Considerations and limitations.

Complete the following steps to create a database resource link:

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. On the Choose catalog dropdown menu, choose the account ID to choose the default catalog.
  3. Search for customerdb.

You should see the shared database name customerdb with the Owner account ID as that of your producer account ID.

  1. Select customerdb, and on the Create dropdown menu, choose Resource link.
  2. Provide details for the resource link:
    1. For Resource link name, enter a name (for this post, customerdb).
    2. The rest of the fields should be already populated.
    3. Choose Create.
  3. In the navigation pane, choose Databases and verify that customerdb is created under the default catalog. Resource link names will show in italicized font.

037-BDB 5089

Verify access as Admin using Athena

Now you can verify your access using Athena. Complete the following steps:

  1. Open the Athena console.
  2. Make sure an S3 bucket is provided to store the Athena query results. For details, refer to Specify a query result location using the Athena console.
  3. In the navigation pane, verify both the default catalog and federated catalog tables by previewing them.
  4. You can also run a join query as follows. Pay attention to the three-point notation for referring to the tables from two different catalogs:
SELECT
returns_tb.market as Market,
sum(orders_tb.quantity) as Total_Quantity
FROM rl_link_container_ordersdb.public_db.rl_orderstbl as orders_tb
JOIN awsdatacatalog.customerdb.returnstbl_iceberg as returns_tb
ON orders_tb.order_id = returns_tb.order_id
GROUP BY returns_tb.market;

038-BDB 5089

This verifies the new capability of SageMaker Lakehouse, which enables accessing Redshift cluster tables and Amazon S3 based Iceberg tables in the same query, across AWS accounts, through the Data Catalog, using Lake Formation permissions.

Grant permissions to Glue-execution-role

Now we will share the resources from the producer account with additional IAM principals in the consumer account. Usually, the data lake admin grants permissions to data analysts, data scientists, and data engineers in the consumer account to do their job functions, such as processing and analyzing the data.

We set up Lake Formation permissions on the catalog link container, databases, tables, and resource links to the AWS Glue job execution role Glue-execution-role that we created in the prerequisites.

Resource links allow only Describe and Drop permissions. You need to use the Grant on target configuration to provide database Describe and table Select permissions.

Complete the following steps:

  1. On the Lake Formation console, choose Data permissions in the navigation pane.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. For IAM users and roles, enter Glue-execution-role.
  5. Under LF-Tags or catalog resources, select Named Data Catalog resources.
  6. For Catalogs, choose rl_link_container_ordersdb and the consumer account ID, which indicates the default catalog.
  7. Under Catalog permissions, select Describe for Catalog permissions.
  8. Choose Grant.

039-BDB 5089

040-BDB 5089

  1. Repeat these steps for the catalog rl_link_container_ordersdb:
    1. On the Databases dropdown menu, choose public_db.
    2. Under Database permissions, select Describe.
    3. Choose Grant.
  2. Repeat these steps again, but after choosing rl_link_container_ordersdb and public_db, on the Tables dropdown menu, choose rl_orderstbl.
    1. Under Resource link permissions, select Describe.
    2. Choose Grant.
  3. Repeat these steps to grant additional permissions to Glue-execution-role.
    1. For this iteration, grant Describe permissions on the default catalog databases public and customerdb.
    2. Grant Describe permission on the resource link customerdb.
    3. Grant Select permission on the tables returnstbl_iceberg and orderstbl.

The following screenshots show the configuration for database public and customerdb permissions.

041-BDB 5089

042-BDB 5089

The following screenshots show the configuration for resource link customerdb permissions.

043-BDB 5089

044-BDB 5089

The following screenshots show the configuration for table returnstbl_iceberg permissions.

045-BDB 5089

046-BDB 5089

The following screenshots show the configuration for table orderstbl permissions.

047-BDB 5089

048-BDB 5089

  1. In the navigation pane, choose Data permissions and verify permissions on Glue-execution-role.

049-BDB 5089

Run a PySpark job in AWS Glue 5.0

Download the PySpark script LakeHouseGlueSparkJob.py. This AWS Glue PySpark script runs Spark SQL by joining the producer shared federated orderstbl table and Amazon S3 based returns table in the consumer account to analyze the data and identify the total orders placed per market.

Replace <> in the script with your consumer account ID. Complete the following steps to create and run an AWS Glue job:

  1. On the AWS Glue console, in the navigation pane, choose ETL jobs.
  2. Choose Create job, then choose Script editor.

050-BDB 5089

  1. For Engine, choose Spark.
  2. For Options, choose Start fresh.
  3. Choose Upload script.
  4. Browse to the location where you downloaded and edited the script, select the script, and choose Open.
  5. On the Job details tab, provide the following information:
    1. For Name, enter a name (for this post, LakeHouseGlueSparkJob).
    2. Under Basic properties, for IAM role, choose Glue-execution-role.
    3. For Glue version, select Glue 5.0.
    4. Under Advanced properties, for Job parameters, choose Add new parameter.
    5. Add the parameters --datalake-formats = iceberg and --enable-lakeformation-fine-grained-access = true.
  6. Save the job.
  7. Choose Run to execute the AWS Glue job, and wait for the job to complete.
  8. Review the job run details from the Output logs

051-BDB 5089

052-BDB 5089

Clean up

To avoid incurring costs on your AWS accounts, clean up the resources you created:

  1. Delete the Lake Formation permissions, catalog link container, database, and tables in the consumer account.
  2. Delete the AWS Glue job in the consumer account.
  3. Delete the federated catalog, database, and table resources in the producer account.
  4. Delete the Redshift Serverless namespace in the producer account.
  5. Delete the S3 buckets you created as part of data transfer in both accounts and the Athena query results bucket in the consumer account.
  6. Clean up the IAM roles you created for the SageMaker Lakehouse setup as part of the prerequisites.

Conclusion

In this post, we illustrated how to bring your existing Redshift tables to SageMaker Lakehouse and share them securely with external AWS accounts. We also showed how to query the shared data warehouse and data lakehouse tables in the same Spark session, from a recipient account, using Spark in AWS Glue 5.0.

We hope you find this useful to integrate your Redshift tables with an existing data mesh and access the tables using AWS Glue Spark. Test this solution in your accounts and share feedback in the comments section. Stay tuned for more updates and feel free to explore the features of SageMaker Lakehouse and AWS Glue versions.

Appendix: Table creation

Complete the following steps to create a returns table in the Amazon S3 based default catalog and an orders table in Amazon Redshift:

  1. Download the CSV format datasets orders and returns.
  2. Upload them to your S3 bucket under the corresponding table prefix path.
  3. Use the following SQL statements in Athena. First-time users of Athena should refer to Specify a query result location.
CREATE DATABASE customerdb;
CREATE EXTERNAL TABLE customerdb.returnstbl_csv(
  `returned` string, 
  `order_id` string, 
  `market` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;' 
LOCATION
  's3:////'
TBLPROPERTIES (
  'skip.header.line.count'='1'
);

select * from customerdb.returnstbl_csv limit 10; 

053-BDB 5089

  1. Create an Iceberg format table in the default catalog and insert data from the CSV format table:
CREATE TABLE customerdb.returnstbl_iceberg(
  `returned` string, 
  `order_id` string, 
  `market` string)
LOCATION 's3:///returnstbl_iceberg/' 
TBLPROPERTIES (
  'table_type'='ICEBERG'
);

INSERT INTO customerdb.returnstbl_iceberg
SELECT *
FROM returnstbl_csv;  

SELECT * FROM customerdb.returnstbl_iceberg LIMIT 10; 

054-BDB 5089

  1. To create the orders table in the Redshift Serverless namespace, open the Query Editor v2 on the Amazon Redshift console.
  2. Connect to the default namespace using your database admin user credentials.
  3. Run the following commands in the SQL editor to create the database ordersdb and table orderstbl in it. Copy the data from your S3 location of the orders data to the orderstbl:
create database ordersdb;
use ordersdb;

create table orderstbl(
  row_id int, 
  order_id VARCHAR, 
  order_date VARCHAR, 
  ship_date VARCHAR, 
  ship_mode VARCHAR, 
  customer_id VARCHAR, 
  customer_name VARCHAR, 
  segment VARCHAR, 
  city VARCHAR, 
  state VARCHAR, 
  country VARCHAR, 
  postal_code int, 
  market VARCHAR, 
  region VARCHAR, 
  product_id VARCHAR, 
  category VARCHAR, 
  sub_category VARCHAR, 
  product_name VARCHAR, 
  sales VARCHAR, 
  quantity bigint, 
  discount VARCHAR, 
  profit VARCHAR, 
  shipping_cost VARCHAR, 
  order_priority VARCHAR
  );

copy orderstbl
from 's3:///ordersdatacsv/orders.csv' 
iam_role 'arn:aws:iam:::role/service-role/'
CSV 
DELIMITER ';'
IGNOREHEADER 1
;

select * from ordersdb.orderstbl limit 5;

About the Authors

055-BDB 5089Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lakehouse solutions, and establishes best practices for data governance.

056-BDB 5089Subhasis Sarkar is a Senior Data Engineer with Amazon. Subhasis thrives on solving complex technological challenges with innovative solutions. He specializes in AWS data architectures, particularly data mesh implementations using AWS CDK components.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments