quinta-feira, dezembro 12, 2024
HomeBig DataHow to Read Unity Catalog Tables in Snowflake, in 4 Easy Steps

How to Read Unity Catalog Tables in Snowflake, in 4 Easy Steps


Databricks pioneered the open data lakehouse architecture and has been at the forefront of format interoperability. We’re excited to see more platforms adopt the lakehouse architecture and start to embrace interoperable formats and standards. Interoperability lets customers reduce expensive data duplication by using a single copy of data with their choice of analytics and AI tools for their workloads. In particular, a common pattern for our customers is to use Databricks’ best-in-class ETL price/performance for upstream data, accessing it from BI and analytics tools, such as Snowflake.

 

Unity Catalog is a unified and open governance solution for data and AI assets. A key feature of Unity Catalog is its implementation of the Iceberg REST Catalog APIs. This makes it simple to use an Iceberg-compliant reader without having to manually refresh your metadata location. 

 

In this blog post, we will cover why the Iceberg REST catalog is useful and walk through an example of how to read Unity Catalog tables in Snowflake.

 

Note: This functionality is available across cloud providers. The following instructions are specific to AWS S3, but it is possible to use other object storage platforms such as Azure Data Lake Storage (ADLS) or Google Cloud Storage (GCS).

 

Image depicts the architecture that lets you 1. Write a Delta table in Unity Catalog, 2. Create an Iceberg table with a catalog integration in Snowflake, and 3. Read a Unity-Catalog managed table as Iceberg in Snowflake

 

Iceberg REST API Catalog Integration

 

Apache Iceberg™  maintains atomicity and consistency by creating new metadata files for each table change. This ensures that incomplete writes do not corrupt an existing metadata file. The Iceberg catalog tracks the new metadata per write. However, not all engines can connect to every Iceberg catalog, forcing customers to manually keep track of the new metadata file location.

 

Iceberg solves interoperability across engines and catalogs with the Iceberg REST Catalog API. The Iceberg REST catalog is a standardized, open API specification which is a unified interface for Iceberg catalogs, decoupling catalog implementations from clients.

 

Unity Catalog has implemented the Iceberg REST Catalog APIs since the launch of Universal Format (UniForm) in 2023. Unity Catalog exposes the latest table metadata, guaranteeing interoperability with any Iceberg client compatible with the Iceberg REST Catalog such as Apache Spark™, Apache Trino, and Snowflake. Unity Catalog’s Iceberg REST Catalog endpoints extend governance and Delta Lake table features like Change Data Feed.

 

Snowflake’s REST API catalog integration lets you connect to Unity Catalog’s Iceberg REST APIs to retrieve the latest metadata file location. This means that with Unity Catalog, you can read tables directly in Snowflake as if they were Iceberg.

 

Note: As of writing, Snowflake’s support of the Iceberg REST Catalog is in Public Preview. However, Unity Catalog’s Iceberg REST APIs are Generally Available.

 

There are 4 steps to creating a REST catalog integration in Snowflake:

  1. Enable UniForm on a Delta Lake table in Databricks to generate Iceberg metadata
  2. Register Unity Catalog in Snowflake as your catalog
  3. Register an S3 Bucket in Snowflake so it recognizes the source data
  4. Create an Iceberg table in Snowflake so you can query your data

 

Getting Started

 

We’ll start in Databricks, with our Unity Catalog-managed table, and we’ll ensure it can be read as Iceberg. Then, we’ll move to Snowflake to complete the remaining steps.

 

Before we start, there are a few components needed:

  • A Databricks account with Unity Catalog (This is enabled by default for new workspaces)
  • An AWS S3 bucket and IAM privileges
  • A Snowflake account that can access your Databricks instance and S3

 

Unity Catalog namespaces follow a catalog_name.schema_name.table_name format. In the example below, we’ll use uc_catalog_name.uc_schema_name.uc_table_name for our Databricks table. 

 

Step 1: Enable UniForm on a Delta table in Databricks

 

In Databricks, you can enable UniForm on a Delta Lake table. By default, new tables are managed by Unity Catalog. Full instructions are available in the UniForm documentation but are also included below.

For a new table, you can enable UniForm during table creation in your workspace:

 CREATE TABLE uc_table_name(c1 INT) TBLPROPERTIES(

  'delta.columnMapping.mode' = 'name',
  'delta.enableIcebergCompatV2' = 'true',
  'delta.universalFormat.enabledFormats' = 'iceberg'

);

 

If you have an existing table, you can do this via an ALTER TABLE command:

ALTER TABLE uc_table_name SET TBLPROPERTIES(

  'delta.columnMapping.mode' = 'name',
  'delta.enableIcebergCompatV2' = 'true',
  'delta.universalFormat.enabledFormats' = 'iceberg'

);

You can confirm that a Delta table has UniForm enabled in the Catalog Explorer under the Details tab, with the metadata location. It should look something like this:

Image shows a screenshot of the Catalog Explorer UI

 

Step 2: Register Unity Catalog in Snowflake

 

While still in Databricks, create a service principal from the workspace admin settings and generate the accompanying secret and client ID. Instead of a service principal, you can also authenticate with personal tokens for debugging and testing purposes, but we recommend using a service principal for development and production workloads. From this step, you will need your <deployment-name> and the values for your OAuth <client-id> and <secret> so you can authenticate the integration in Snowflake.

 

Now switch over to your Snowflake account.

 

Note: There are a few naming differences between Databricks and Snowflake that may be confusing:

  • A “catalog” in Databricks is a “warehouse” in the Snowflake Iceberg catalog integration configuration.
  • A “schema” in Databricks is a “catalog_namespace” in the Snowflake Iceberg catalog integration.

 

You’ll see in the example below that the CATALOG_NAMESPACE value is uc_schema_name from our Unity Catalog table. 

 

In Snowflake, create a catalog integration for Iceberg REST catalogs. Following that process, you’ll create a catalog integration as below:

CREATE OR REPLACE CATALOG INTEGRATION unity_catalog_int_oauth
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = 'uc_schema_name'
  REST_CONFIG = (
    CATALOG_URI = 'https://.cloud.databricks.com/api/2.1/unity-catalog/iceberg'
    WAREHOUSE = 'uc_catalog_name>'
  )

  REST_AUTHENTICATION = (
    TYPE = OAUTH
    OAUTH_TOKEN_URI = 'https://.cloud.databricks.com/oidc/v1/token'
    OAUTH_CLIENT_ID = ''
    OAUTH_CLIENT_SECRET = ''
    OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql')
  )
  ENABLED = TRUE
  REFRESH_INTERVAL_SECONDS = '';

 

The REST API Catalog Integration also unlocks time-based automatic refresh. With automatic refresh, Snowflake will poll for the latest metadata location from Unity Catalog on a time interval defined for the catalog integration. However, automatic refresh is incompatible with manual refresh, requiring users to wait up to the time interval after a table update. The REFRESH_INTERVAL_SECONDS parameter configured on the catalog integration applies to all Snowflake Iceberg tables created with this integration. It is not customizable per table.

 

Step 3: Register your S3 Bucket in Snowflake

 

In Snowflake, configure an external volume for Amazon S3. This involves creating an IAM role in AWS, configuring the role’s trust policy, and then creating an external volume in Snowflake using the role’s ARN.

 

For this step, you’ll use the same S3 bucket that Unity Catalog is pointed to.

CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
   STORAGE_LOCATIONS =
      (
         (
            NAME = 'my-s3-us-west-2'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3:///'
            STORAGE_AWS_ROLE_ARN = ''
            STORAGE_AWS_EXTERNAL_ID = ''
         )
      );

 

Step 4: Create an Apache Iceberg™  table in Snowflake

 

In Snowflake, create an Iceberg table with the previously created catalog integration and external volume to connect to the Delta Lake table. You can choose the name for your Iceberg table in Snowflake; it does not need to match the Delta Lake table in Databricks.

 

Note: The correct mapping for the CATALOG_TABLE_NAME in Snowflake is the Databricks table name. In our example, this is uc_table_name. You do not need to specify the catalog or schema at this step, because they were already specified in the catalog integration. 

 

CREATE OR REPLACE ICEBERG TABLE <snowflake_table_name>
  EXTERNAL_VOLUME = 'iceberg_external_volume'
  CATALOG = 'unity_catalog_int_oauth'
  CATALOG_TABLE_NAME = 'uc_table_name'
  AUTO_REFRESH = TRUE;

 

Optionally, you can enable auto-refresh using the catalog integration time interval by adding AUTO_REFRESH = TRUE to the command. Note that if auto-refresh is enabled, manual refresh is disabled.

 

You have now successfully read the Delta Lake table in Snowflake.

 

Finishing Up: Test the Connection

 

In Databricks, update the Delta table data by inserting a new row.

 

If you previously enabled auto-refresh, the table will update automatically on the specified time interval. If you did not, you can manually refresh by running ALTER ICEBERG TABLE REFRESH.

 

Note: if you previously enabled auto-refresh, you cannot run the manual refresh command and will need to wait for the auto-refresh interval to complete to refresh the table.

 

Video Demo

If you would like a video tutorial, this video demonstrates how to bring these steps together to read Delta tables with UniForm in Snowflake.

 

We are thrilled by continued support for the lakehouse architecture. Customers no longer have to duplicate data, reducing cost and complexity. This architecture also allows customers to choose the right tool for the right workload.

 

The key to an open lakehouse is storing your data in an open format such as Delta Lake or Iceberg. Proprietary formats lock customers into an engine, but open formats give you flexibility and portability. No matter the platform, we encourage customers to always own their own data as the first step into interoperability. In the coming months, we will continue to build features that make it simpler to manage an open data lakehouse with Unity Catalog.

 

 

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments