terça-feira, abril 29, 2025
HomeBig DataAccess your existing data and resources through Amazon SageMaker Unified Studio, Part...

Access your existing data and resources through Amazon SageMaker Unified Studio, Part 1: AWS Glue Data Catalog and Amazon Redshift


Amazon SageMaker Unified Studio provides a unified environment for data, analytics, machine learning (ML), and AI workloads. Part of the next generation of Amazon SageMaker, SageMaker Unified Studio allows you to discover your data and put it to work using familiar AWS tools to complete end-to-end development workflows, including data analysis, data processing, model training, generative AI app development, and more, in a single governed environment. You can create or join projects to collaborate with your teams, share AI and analytics artifacts securely, and discover and use your data stored in different data sources through Amazon SageMaker Lakehouse.

This series of posts demonstrates how you can onboard and access existing AWS data sources using SageMaker Unified Studio. This post focuses on onboarding existing AWS Glue Data Catalog tables and database tables available in Amazon Redshift. Part 2 discusses using Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR.

This series primarily focuses on the UI experience. If you prefer script-based automation, refer to Bringing existing resources into Amazon SageMaker Unified Studio.

Access management with SageMaker Unified Studio

The SageMaker Unified Studio authorization model is a hierarchical access control list (ACL) based on the resource type such as a domain or a project. For example, at the domain level, a user might have a domain owner designation and at the project level, the user can be an owner or contributor. You can configure these profiles at AWS Identity and Access Management (IAM) user, single sign-on (SSO) user, and SSO group level.

Each project has a project role. When the user interacts with resources within SageMaker Unified Studio, it generates IAM session credentials based on the user’s effective profile in the specific project context, and then users can use tools such as Amazon Athena or Amazon Redshift to query the relevant data. The project owner can add or remove project members for their project, create publishing agreements with a domain, and publish assets to a domain.

SageMaker Unified Studio can be accessed by IAM users or SSO authenticated users, and IAM roles can interact with the SageMaker Unified Studio through its APIs.

Solution overview

AWS Lake Formation enables you to define fine-grained access control on the Data Catalog, where you can configure access at database, table, row, or column level or define permissions with tags. When setting up Lake Formation, you can configure it with hybrid access mode, where you get flexibility to selectively enable Lake Formation permissions for specific databases and tables, and continue using IAM permissions for others. SageMaker Unified Studio supports Lake Formation hybrid mode.

When you create a project in SageMaker Unified Studio, an AWS Glue database is added by default as part of the project. Assets published into that database don’t need any additional permissions, but if you want to publish or subscribe assets from an existing AWS Glue database, then you need to provide explicit permissions to SageMaker Unified Studio to be able to access the database and tables. For more details, see Configure Lake Formation permissions for Amazon SageMaker Unified Studio.

Let’s understand how we can access existing datasets through SageMaker Unified Studio.

Prerequisites

To run the instruction, you must complete the following prerequisites:

  • An AWS account
  • A SageMaker Unified Studio domain
  • A SageMaker Unified Studio project with All capabilities project profile

In the SageMaker Unified Studio, select the project and navigate to the Project overview page. Copy the Project role ARN as highlighted in the screenshot. This project role will be used further in the post to provide permissions on existing datasets and resources.

Use existing AWS Glue tables

This section has following prerequisites:

One extra prerequisite step is to revoke IAMAllowedPrincipals group permission on both database and table to enforce Lake Formation permission for access. For detailed instruction see Revoking permission using the Lake Formation console.

To access existing Data Catalog tables in SageMaker Unified Studio, complete the following steps:

  1. On the Lake Formation console using the data lake administrator, choose Data lake locations in the navigation pane and choose Register location.
  2. Enter the S3 prefix for Amazon S3 path.
  3. For IAM role, choose your Lake Formation data access IAM role, which is not a service linked role.
  4. Select Lake Formation for Permission mode and choose Register location.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.

  1. For IAM users and roles, choose the project role.
  2. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  3. For Databases, choose your existing Data Catalog database.

  1. For Database permissions, select Describe and choose Grant.

The next step is to grant the permission on the tables to the project role.

  1. On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  2. Select the existing Data Catalog database.
  3. From the Actions menu, choose Grant to grant permissions to the project role.
  4. For IAM users and roles, choose the project role.
  5. Select Named Data Catalog resources, and for Catalogs, choose the default catalog.
  6. For Databases, choose your Data Catalog database.
  7. For Tables, select the tables that you need to provide permission to the project role.

  1. For Table permissions, select Select and Describe.
  2. For Grantable permissions, select Select and Describe.
  3. Choose Grant.

You should revoke any existing permissions of IAMAllowedPrincipals on the databases and tables within Lake Formation.

Now let’s verify that we can access the existing AWS Glue table from the SageMaker Unified Studio Query Editor.

  1. In SageMaker Unified Studio, navigate to your project.
  2. On the project page, under Lakehouse, choose Data.
  3. Next to the Data Catalog table, choose the options menu (three dots), and choose Query with Athena.

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark. To access the data through the unified JupyterLab experience, complete the following steps:

  1. On the SageMaker Unified Studio project page, on the top menu, choose Build, and under IDE & APPLICATIONS, choose
  2. Wait for the space to be ready.
  3. Choose the plus sign and for Notebook, choose Python 3.
  4. In the notebook, switch the connection type to PySpark, choose spark.fineGrained, and query the existing Data Catalog table:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_sql = spark.sql("""
select * from retaildb.salesorders
""" )

df_sql.show()

Use existing Redshift clusters

This section has following prerequisites:

To bring in existing Redshift clusters, follow these steps:

  1. To use your provisioned Redshift cluster or a Redshift Serverless workgroup, add either of the following tags (key/value) to the resource:
    1. Add AmazonDataZoneProject: if you want to allow only a specific SageMaker Unified Studio project to access the Amazon Redshift resource. Replace with the ID of the project created in SageMaker Unified Studio.
    2. Add for-use-with-all-datazone-projects: true if you want to allow all SageMaker Unified Studio projects to access the Amazon Redshift resource.

  1. To add the compute connection in SageMaker Unified Studio, you can authenticate the cluster using either the user name and password of the database, IAM credentials, or AWS Secrets Manager. To provide the authentication using Secrets Manager, add either of the following tags. This will enable the existing secret to appear on the dropdown menu, while defining the connection in SageMaker Unified Studio.
    1. AmazonDataZoneProject:
    2. for-use-with-all-datazone-projects: true

In the following screenshot, you can see the tag configuration section within Secrets Manager settings for Redshift Serverless compute. To understand how to create a secret for a database in a Redshift cluster using Secrets Manager, refer to Managing Amazon Redshift admin passwords using AWS Secrets Manager.

  1. After the tags are applied, log in to SageMaker Unified Studio and choose the project.
  2. Go to the Compute section of your project, and on the Data warehouse tab, choose Add compute.
  3. Select Connect to existing compute resources.
  4. Choose the compute type: Amazon Redshift Provisioned cluster or Amazon Redshift Serverless.
  5. Configure the parameters by selecting the existing compute and authentication and choose Add compute.

The detailed walkthrough process is illustrated in the following screenshot.

Use Redshift tables with existing compute

This section has following prerequisites:

In this section, we illustrate steps to create a federated connection for an existing Amazon Redshift data source. You can register an existing Redshift provisioned cluster as well as Redshift Serverless with the Data Catalog using SageMaker Unified Studio. This creates a federated multi-level catalog and provides the ability to centrally manage permissions to the data with fine-grained access control using Lake Formation. By mounting Amazon Redshift data in the Data Catalog, you can query it using your preferred tools such as Athena or AWS Glue extract, transform, and load (ETL) without having to copy or move the data.

Create an Amazon Redshift managed VPC endpoint for Amazon Redshift

Amazon Redshift managed virtual private cloud (VPC) endpoints use AWS PrivateLink to allow one VPC to privately access resources in another VPC as if they were local to the same VPC. With an Amazon Redshift managed VPC endpoint, you can connect to your private Redshift cluster with the RA3 instance type or Redshift Serverless within your VPC.

In this section, we explain how to create an Amazon Redshift managed VPC endpoint for both Redshift Serverless and an Amazon Redshift provisioned cluster in a single account. The managed VPC endpoint needs to be created only if your Redshift provisioned or Redshift Serverless cluster is in a different VPC than the SageMaker Unified Studio domain VPC.

If the SageMaker Unified Studio domain account is in a different account, allow the additional AWS accounts to create cluster endpoints. For steps to authorize your Amazon Redshift provisioned or Redshift Serverless cluster to deploy endpoints in additional accounts and grant access to the cross-account VPC, refer to Granting access to a VPC.

Redshift Serverless

For Redshift Serverless, follow these instructions.

The common practice is to allow port 5439 (Amazon Redshift connectivity port) to the security group or CIDR range in which your consumption workloads run.

  1. In the security group associated with the Redshift cluster, add an inbound rule with Type as Redshift, Protocol as TCP, Port range as 5439 (Amazon Redshift connectivity port), and Source as the CIDR range in which your consumption workloads run.

  1. On the Amazon Redshift console of the workgroup, go to Redshift-managed VPC endpoints.
  2. Choose Create endpoint.
  3. In the Endpoint settings section, choose the VPC, associated private subnet, and security group created for the SageMaker Unified Studio domain account to deploy the endpoint against.

The following screenshot shows the Amazon Redshift managed VPC endpoint created for Redshift Serverless.

Redshift provisioned

For Amazon Redshift provisioned, follow these instructions:

  1. To implement an Amazon Redshift managed VPC endpoint for a provisioned cluster, you need to enable cluster relocation and create subnet groups. In the cluster subnet group, choose the VPC and subnets of the SageMaker Unified Studio domain account.
  2. On the Amazon Redshift console, choose Configurations in the navigation pane.
  3. Provide the endpoint details, then choose Create endpoint.

Create a federated connection for Amazon Redshift

Complete the following steps to create a federated catalog in the Data Catalog to query the data using various preferred analytics tools such as Athena, visual ETL in SageMaker Unified Studio, Amazon EMR, and more:

  1. On the SageMaker Unified Studio console, choose your project.
  2. Choose Data in the navigation pane.
  3. In the data explorer, choose the plus sign to add a data source.
  4. Under Add a data source, choose Add connection, then choose Amazon Redshift.
  5. Enter the following parameters in the connection details, and choose Add data.
    1. Name: Enter the connection name.
    2. Host: Enter the Amazon Redshift managed VPC endpoint.
    3. Port: Enter the port number (Amazon Redshift uses 5439 as the default port).
    4. Database: Enter the database name.
    5. Authentication: Choose either the database user name and password credentials or Secrets Manager.

After the connection is established, you will see that the federated catalog is created, as shown in the following screenshot. This catalog uses the AWS Glue connection to connect to Amazon Redshift. The databases, tables, and views are automatically cataloged in the Data Catalog and registered with Lake Formation.

With Athena, data analysts can run federated SQL queries to scan data from multiple data sources in-place without creating complex data pipelines or data replication.

Use existing Data Catalog tables and Amazon Redshift assets in the SageMaker Unified Studio business data catalog

You can use the SageMaker Unified Studio business data catalog to catalog the data across your organization with business context. To use Amazon SageMaker Catalog, you must bring your existing data assets into the inventory of your project. Follow the instructions in this section to bring your existing Data Catalogs and Amazon Redshift assets into the project inventory.

Add an existing Data Catalog to the project inventory

To enrich the asset with business context and share your assets outside your own project, you must first bring the metadata to SageMaker Catalog. To import the metadata of the assets into the project’s inventory, you need to create a data source in the project catalog.

  1. In SageMaker Unified Studio, navigate to the Project catalog page within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose AWS Glue (Lakehouse) for Data source type.
  6. For Data selection, choose the Database name and choose Next.
  7. Keep the rest as default and choose CREATE.
  8. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Add existing Redshift tables and views to the project inventory

Create a data source to bring in the existing Redshift tables and views to add to the project’s inventory:

  1. In SageMaker Unified Studio, navigate to the Project catalog within the project.
  2. Choose Data sources.
  3. Choose CREATE DATA SOURCE.
  4. For Name, provide the name of the data source.
  5. Choose Amazon Redshift for Data source type.
  6. For Connection, choose the name of the Redshift connection.
  7. For Database name, choose dev and for Schema, enter public.
  8. Keep the rest as default and choose CREATE.
  9. Choose RUN to import the metadata.

After the data source successfully completes its run, metadata of all the data assets gets added to the project’s inventory.

Conclusion

This post explained how you can access existing data and resources available in the Data Catalog and Amazon Redshift using SageMaker Unified Studio. SageMaker Unified Studio provides an integrated environment for analytics and AI. Being able to access existing datasets available in your AWS account helps reduce operational overhead because users of your organization can access a common interface, collaborate, and share datasets. It also brings in efficiency for administrators because they can manage permissions for domains and projects in a common place.

In the next post, we will demonstrate how you can onboard and access other existing data sources such as Amazon S3, Amazon RDS, DynamoDB, and Amazon EMR.


About the Authors

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance. She can be reached via LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is also the author of the book Serverless ETL and Analytics with AWS Glue. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Sakti Mishra is a Principal Data and AI Solutions Architect at AWS, where he helps customers modernize their data architecture and define end-to end-data strategies, including data security, accessibility, governance, and more. He is also the author of Simplify Big Data Analytics with Amazon EMR and AWS Certified Data Engineer Study Guide. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family. He can be reached via LinkedIn.

Daiyan Alamgir is a Principal Frontend Engineer on the Amazon SageMaker Unified Studio team based in New York.

Vipin Mohan is a Principal Product Manager at AWS, leading the launch of generative AI capabilities in Amazon SageMaker Unified Studio. He is committed to shaping impactful products by working backward from customer insights, championing user-focused solutions, and delivering scalable results.

Chanu Damarla is a Principal Product Manager on the Amazon SageMaker Unified Studio team. He works with customers around the globe to translate business and technical requirements into products that delight customers and enable them to be more productive with their data, analytics, and AI.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments