Glue-execution-role
, in the consumer account, with the following policies:
- AWS managed policies
AWSGlueServiceRole
andAmazonRedshiftDataFullAccess
. - Create a new in-line policy with the following permissions and attach it:
- Add the following trust policy to
Glue-execution-role
, allowing AWS Glue to assume this role:
Steps for producer account setup
For the producer account setup, you can either use your IAM administrator role added as Lake Formation administrator or use a Lake Formation administrator role with permissions added as discussed in the prerequisites. For illustration purposes, we use the IAM admin role Admin
added as Lake Formation administrator.
Configure your catalog
Complete the following steps to set up your catalog:
- Log in to AWS Management Console as
Admin
. - On the Amazon Redshift console, follow the instructions in Registering Amazon Redshift clusters and namespaces to the AWS Glue Data Catalog.
- After the registration is initiated, you will see the invite from Amazon Redshift on the Lake Formation console.
- Select the pending catalog invitation and choose Approve and create catalog.
- On the Set catalog details page, configure your catalog:
- For Name, enter a name (for this post,
redshiftserverless1-uswest2
). - Select Access this catalog from Apache Iceberg compatible engines.
- Choose the IAM role you created for the data transfer.
- Choose Next.
- For Name, enter a name (for this post,
- On the Grant permissions – optional page, choose Add permissions.
- Grant the
Admin
user Super user permissions for Catalog permissions and Grantable permissions. - Choose Add.
- Grant the
- Verify the granted permission on the next page and choose Next.
- Review the details on the Review and create page and choose Create catalog.
Wait a few seconds for the catalog to show up.
- Choose Catalogs in the navigation pane and verify that the
redshiftserverless1-uswest2
catalog is created. - Explore the catalog detail page to verify the
ordersdb.public
database. - On the database View dropdown menu, view the table and verify that the
orderstbl
table shows up.
As the Admin
role, you can also query the orderstbl
in Amazon Athena and confirm the data is available.
Grant permissions on the tables from the producer account to the consumer account
In this step, we share the Amazon Redshift federated catalog database redshiftserverless1-uswest2:ordersdb.public
and table orderstbl
as well as the Amazon S3 based Iceberg table returnstbl_iceberg
and its database customerdb
from the default catalog to the consumer account. We can’t share the entire catalog to external accounts as a catalog-level permission; we just share the database and table.
- On the Lake Formation console, choose Data permissions in the navigation pane.
- Choose Grant.
- Under Principals, select External accounts.
- Provide the consumer account ID.
- Under LF-Tags or catalog resources, select Named Data Catalog resources.
- For Catalogs, choose the account ID that represents the default catalog.
- For Databases, choose
customerdb
. - Under Database permissions, select Describe under Database permissions and Grantable permissions.
- Choose Grant.
- Repeat these steps and grant table-level Select and Describe permissions on
returnstbl_iceberg
. - Repeat these steps again to grant database- and table-level permissions for the
ordertbl
table of the federated catalog databaseredshiftserverless1-uswest2/ordersdb
.
The following screenshots show the configuration for database-level permissions.
The following screenshots show the configuration for table-level permissions.
- Choose Data permissions in the navigation pane and verify that the consumer account has been granted database- and table-level permissions for both
orderstbl
from the federated catalog andreturnstbl_iceberg
from the default catalog.
Register the Amazon S3 location of the returnstbl_iceberg with Lake Formation.
In this step, we register the Amazon S3 based Iceberg table returnstbl_iceberg
data location with Lake Formation to be managed by Lake Formation permissions. Complete the following steps:
- On the Lake Formation console, choose Data lake locations in the navigation pane.
- Choose Register location.
- For Amazon S3 path, enter the path for your S3 bucket that you provided while creating the Iceberg table
returnstbl_iceberg
. - For IAM role, provide the user-defined role
LakeFormationS3Registration_custom
that you created as a prerequisite. - For Permission mode, select Lake Formation.
- Choose Register location.
- Choose Data lake locations in the navigation pane to verify the Amazon S3 registration.
With this step, the producer account setup is complete.
Steps for consumer account setup
For the consumer account setup, we use the IAM admin role Admin
, added as a Lake Formation administrator.
The steps in the consumer account are quite involved. In the consumer account, a Lake Formation administrator will accept the AWS Resource Access Manager (AWS RAM) shares and create the required resource links that point to the shared catalog, database, and tables. The Lake Formation admin verifies that the shared resources are accessible by running test queries in Athena. The admin further grants permissions to the role Glue-execution-role
on the resource links, database, and tables. The admin then runs a join query in AWS Glue 5.0 Spark using Glue-execution-role
.
Accept and verify the shared resources
Lake Formation uses AWS RAM shares to enable cross-account sharing with Data Catalog resource policies in the AWS RAM policies. To view and verify the shared resources from producer account, complete the following steps:
- Log in to the consumer AWS console and set the AWS Region to match the producer’s shared resource Region. For this post, we use
us-west-2
. - Open the Lake Formation console. You will see a message indicating there is a pending invite and asking you accept it on the AWS RAM console.
- Follow the instructions in Accepting a resource share invitation from AWS RAM to review and accept the pending invites.
- When the invite status changes to Accepted, choose Shared resources under Shared with me in the navigation pane.
- Verify that the Redshift Serverless federated catalog
redshiftserverless1-uswest2
, the default catalog databasecustomerdb
, the tablereturnstbl_iceberg
, and the producer account ID under Owner ID column display correctly. - On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
- Search by the producer account ID.
You should see thecustomerdb
andpublic
databases. You can further select each database and choose View tables on the Actions dropdown menu and verify the table names
You will not see an AWS RAM share invite for the catalog level on the Lake Formation console, because catalog-level sharing isn’t possible. You can review the shared federated catalog and Amazon Redshift managed catalog names on the AWS RAM console, or using the AWS Command Line Interface (AWS CLI) or SDK.
Create a catalog link container and resource links
A catalog link container is a Data Catalog object that references a local or cross-account federated database-level catalog from other AWS accounts. For more details, refer to Accessing a shared federated catalog. Catalog link containers are essentially Lake Formation resource links at the catalog level that reference or point to a Redshift cluster federated catalog or Amazon Redshift managed catalog object from other accounts.
In the following steps, we create a catalog link container that points to the producer shared federated catalog redshiftserverless1-uswest2
. Inside the catalog link container, we create a database. Inside the database, we create a resource link for the table that points to the shared federated catalog table <
.
- On the Lake Formation console, under Data Catalog in the navigation pane, choose Catalogs.
- Choose Create catalog.
- Provide the following details for the catalog:
- For Name, enter a name for the catalog (for this post,
rl_link_container_ordersdb
). - For Type, choose Catalog Link container.
- For Source, choose Redshift.
- For Target Redshift Catalog, enter the Amazon Resource Name (ARN) of the producer federated catalog (
arn:aws:glue:us-west-2:<
).>:catalog/redshiftserverless1-uswest2/ordersdb - Under Access from engines, select Access this catalog from Apache Iceberg compatible engines.
- For IAM role, provide the Redshift-S3 data transfer role that you had created in the prerequisites.
- Choose Next.
- For Name, enter a name for the catalog (for this post,
- On the Grant permissions – optional page, choose Add permissions.
- Grant the
Admin
user Super user permissions for Catalog permissions and Grantable permissions. - Choose Add and then choose Next.
- Grant the
- Review the details on the Review and create page and choose Create catalog.
Wait a few seconds for the catalog to show up.
- In the navigation pane, choose Catalogs.
- Verify that
rl_link_container_ordersdb
is created.
Create a database under rl_link_container_ordersdb
Complete the following steps:
- On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
- On the Choose catalog dropdown menu, choose
rl_link_container_ordersdb
. - Choose Create database.
Alternatively, you can choose the Create dropdown menu and then choose Database.
- Provide details for the database:
- For Name, enter a name (for this post,
public_db
). - For Catalog, choose
rl_link_container_ordersdb
. - Leave Location – optional as blank.
- Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
- Choose Create database.
- For Name, enter a name (for this post,
- Choose Catalogs in the navigation pane to verify that
public_db
is created underrl_link_container_ordersdb
.
Create a table resource link for the shared federated catalog table
A resource link to a shared federated catalog table can reside only inside the database of a catalog link container. A resource link for such tables will not work if created inside the default catalog. For more details on resource links, refer to Creating a resource link to a shared Data Catalog table.
Complete the following steps to create a table resource link:
- On the Lake Formation console, under Data Catalog in the navigation pane, choose Tables.
- On the Create dropdown menu, choose Resource link.
- Provide details for the table resource link:
- For Resource link name, enter a name (for this post,
rl_orderstbl
). - For Destination catalog, choose
rl_link_container_ordersdb
. - For Database, choose
public_db
. - For Shared table’s region, choose US West (Oregon).
- For Shared table, choose
orderstbl
. - After the Shared table is selected, Shared table’s database and Shared table’s catalog ID should get automatically populated.
- Choose Create.
- For Resource link name, enter a name (for this post,
- In the navigation pane, choose Databases to verify that
rl_orderstbl
is created underpublic_db
, insiderl_link_container_ordersdb
.
Create a database resource link for the shared default catalog database.
Now we create a database resource link in the default catalog to query the Amazon S3 based Iceberg table shared from the producer. For details on database resource links, refer Creating a resource link to a shared Data Catalog database.
Though we are able to see the shared database in the default catalog of the consumer, a resource link is required to query from analytics engines, such as Athena, Amazon EMR, and AWS Glue. When using AWS Glue with Lake Formation tables, the resource link needs to be named identically to the source account’s resource. For additional details on using AWS Glue with Lake Formation, refer to Considerations and limitations.
Complete the following steps to create a database resource link:
- On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
- On the Choose catalog dropdown menu, choose the account ID to choose the default catalog.
- Search for
customerdb
.
You should see the shared database name customerdb
with the Owner account ID as that of your producer account ID.
- Select
customerdb
, and on the Create dropdown menu, choose Resource link. - Provide details for the resource link:
- For Resource link name, enter a name (for this post,
customerdb
). - The rest of the fields should be already populated.
- Choose Create.
- For Resource link name, enter a name (for this post,
- In the navigation pane, choose Databases and verify that
customerdb
is created under the default catalog. Resource link names will show in italicized font.
Verify access as Admin using Athena
Now you can verify your access using Athena. Complete the following steps:
- Open the Athena console.
- Make sure an S3 bucket is provided to store the Athena query results. For details, refer to Specify a query result location using the Athena console.
- In the navigation pane, verify both the default catalog and federated catalog tables by previewing them.
- You can also run a join query as follows. Pay attention to the three-point notation for referring to the tables from two different catalogs:
This verifies the new capability of SageMaker Lakehouse, which enables accessing Redshift cluster tables and Amazon S3 based Iceberg tables in the same query, across AWS accounts, through the Data Catalog, using Lake Formation permissions.
Grant permissions to Glue-execution-role
Now we will share the resources from the producer account with additional IAM principals in the consumer account. Usually, the data lake admin grants permissions to data analysts, data scientists, and data engineers in the consumer account to do their job functions, such as processing and analyzing the data.
We set up Lake Formation permissions on the catalog link container, databases, tables, and resource links to the AWS Glue job execution role Glue-execution-role
that we created in the prerequisites.
Resource links allow only Describe and Drop permissions. You need to use the Grant on target configuration to provide database Describe and table Select permissions.
Complete the following steps:
- On the Lake Formation console, choose Data permissions in the navigation pane.
- Choose Grant.
- Under Principals, select IAM users and roles.
- For IAM users and roles, enter
Glue-execution-role
. - Under LF-Tags or catalog resources, select Named Data Catalog resources.
- For Catalogs, choose
rl_link_container_ordersdb
and the consumer account ID, which indicates the default catalog. - Under Catalog permissions, select Describe for Catalog permissions.
- Choose Grant.
- Repeat these steps for the catalog
rl_link_container_ordersdb
:- On the Databases dropdown menu, choose
public_db
. - Under Database permissions, select Describe.
- Choose Grant.
- On the Databases dropdown menu, choose
- Repeat these steps again, but after choosing
rl_link_container_ordersdb
andpublic_db
, on the Tables dropdown menu, chooserl_orderstbl
.- Under Resource link permissions, select Describe.
- Choose Grant.
- Repeat these steps to grant additional permissions to
Glue-execution-role
.- For this iteration, grant Describe permissions on the default catalog databases
public
andcustomerdb
. - Grant Describe permission on the resource link
customerdb
. - Grant Select permission on the tables
returnstbl_iceberg
andorderstbl
.
- For this iteration, grant Describe permissions on the default catalog databases
The following screenshots show the configuration for database public
and customerdb
permissions.
The following screenshots show the configuration for resource link customerdb
permissions.
The following screenshots show the configuration for table returnstbl_iceberg
permissions.
The following screenshots show the configuration for table orderstbl
permissions.
- In the navigation pane, choose Data permissions and verify permissions on
Glue-execution-role
.
Run a PySpark job in AWS Glue 5.0
Download the PySpark script LakeHouseGlueSparkJob.py. This AWS Glue PySpark script runs Spark SQL by joining the producer shared federated orderstbl
table and Amazon S3 based returns
table in the consumer account to analyze the data and identify the total orders placed per market.
Replace <
in the script with your consumer account ID. Complete the following steps to create and run an AWS Glue job:
- On the AWS Glue console, in the navigation pane, choose ETL jobs.
- Choose Create job, then choose Script editor.
- For Engine, choose Spark.
- For Options, choose Start fresh.
- Choose Upload script.
- Browse to the location where you downloaded and edited the script, select the script, and choose Open.
- On the Job details tab, provide the following information:
- For Name, enter a name (for this post,
LakeHouseGlueSparkJob
). - Under Basic properties, for IAM role, choose
Glue-execution-role
. - For Glue version, select Glue 5.0.
- Under Advanced properties, for Job parameters, choose Add new parameter.
- Add the parameters
--datalake-formats = iceberg
and--enable-lakeformation-fine-grained-access = true
.
- For Name, enter a name (for this post,
- Save the job.
- Choose Run to execute the AWS Glue job, and wait for the job to complete.
- Review the job run details from the Output logs
Clean up
To avoid incurring costs on your AWS accounts, clean up the resources you created:
- Delete the Lake Formation permissions, catalog link container, database, and tables in the consumer account.
- Delete the AWS Glue job in the consumer account.
- Delete the federated catalog, database, and table resources in the producer account.
- Delete the Redshift Serverless namespace in the producer account.
- Delete the S3 buckets you created as part of data transfer in both accounts and the Athena query results bucket in the consumer account.
- Clean up the IAM roles you created for the SageMaker Lakehouse setup as part of the prerequisites.
Conclusion
In this post, we illustrated how to bring your existing Redshift tables to SageMaker Lakehouse and share them securely with external AWS accounts. We also showed how to query the shared data warehouse and data lakehouse tables in the same Spark session, from a recipient account, using Spark in AWS Glue 5.0.
We hope you find this useful to integrate your Redshift tables with an existing data mesh and access the tables using AWS Glue Spark. Test this solution in your accounts and share feedback in the comments section. Stay tuned for more updates and feel free to explore the features of SageMaker Lakehouse and AWS Glue versions.
Appendix: Table creation
Complete the following steps to create a returns table in the Amazon S3 based default catalog and an orders table in Amazon Redshift:
- Download the CSV format datasets orders and returns.
- Upload them to your S3 bucket under the corresponding table prefix path.
- Use the following SQL statements in Athena. First-time users of Athena should refer to Specify a query result location.
- Create an Iceberg format table in the default catalog and insert data from the CSV format table:
- To create the orders table in the Redshift Serverless namespace, open the Query Editor v2 on the Amazon Redshift console.
- Connect to the default namespace using your database admin user credentials.
- Run the following commands in the SQL editor to create the database
ordersdb
and table orderstbl in it. Copy the data from your S3 location of the orders data to theorderstbl
:
About the Authors
Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lakehouse solutions, and establishes best practices for data governance.
Subhasis Sarkar is a Senior Data Engineer with Amazon. Subhasis thrives on solving complex technological challenges with innovative solutions. He specializes in AWS data architectures, particularly data mesh implementations using AWS CDK components.