When serving machine learning models, the latency between requesting a prediction and receiving a response is one of the most critical metrics for the end user. Latency includes the time a request takes to reach the endpoint, be processed by the model, and then return to the user. Serving models to users that are based in a different region can significantly increase both the request and response times. Imagine a company with a multi-region customer base that is hosting and serving a model in a different region than the one where its customers are based. This geographic dispersion both incurs higher egress costs when data is moved from cloud storage and is less secure compared to a peering connection between two virtual networks.
To illustrate the impact of latency across regions, a request from Europe to a U.S.-deployed model endpoint can add 100-150 milliseconds of network latency. In contrast, a U.S.-based request may only add 50 milliseconds, based on info extracted from this Azure network round-trip latency statistics blog.
This difference can significantly impact user experience for latency-sensitive applications. Moreover, a simple API call often involves additional networking processes—such as calls to a database, authentication services, or other microservices—which can further increase the total latency by 3 to 5 times. Deploying models in multiple regions ensures users are served from closer endpoints, reducing latency and providing faster, more reliable responses globally.
In this blog, a collaboration with Aimpoint Digital, we explore how Databricks supports multi-region model serving with Delta Sharing to help decrease latency for real-time AI use cases.
Approach
For multi-region model serving, Databricks workspaces in different regions are connected using Delta Sharing for seamless replication of data and AI objects from the primary region to the replica region. Delta Sharing offers three methods for sharing data: the Databricks-to-Databricks sharing protocol, the open sharing protocol, and customer-managed implementations using the open source Delta Sharing server. In this blog, we focus on the first option: Databricks-to-Databricks sharing. This method enables the secure sharing of data and AI assets between two Unity Catalog-enabled Databricks workspaces, making it ideal for sharing models between regions.
In the primary region, the data science team can continuously develop, test, and promote new models or updated versions of existing models, ensuring they meet specific performance and quality standards. With Delta Sharing and VPC peering in place, the model can be securely shared across regions without exposing the data or models to the public internet. This setup allows other regions to have read-only access, enabling them to use the models for batch inference or to deploy regional endpoints. The result is a multi-region model deployment that reduces latency, delivering faster responses to users no matter where they are located.
The reference architecture above illustrates that when a model version is registered to a shared catalog in the main region (Region 1), it is automatically shared within seconds to an external region (Region 2) using Delta Sharing through VPC peering.
After the model artifacts are shared across regions, the Databricks Asset Bundle (DAB) enables seamless and consistent deployment of the Deployment Workflow. It can be integrated with existing CI/CD tools like GitHub Actions, Jenkins, or Azure DevOps, allowing the deployment process to be reproduced effortlessly and in parallel with a simple command, ensuring consistency regardless of the region.
The example deployment workflow above consists of three steps:
- The model serving endpoint is updated to the latest model version in the shared catalog.
- The model serving endpoint is evaluated using several test scenarios such as health checks, load testing, and other pre-defined edge cases. A/B testing is another viable option within Databricks where endpoints can be configured to host multiple model variants. In this approach, a proportion of the traffic is routed to the challenger model (model B), and a proportion of the traffic is sent to the champion model (model A). Check out traffic_config for more information. In production, the results of the two models are compared and a decision is made on which model to use in production.
- If the model serving endpoint fails the tests, it will be rolled back to the previous model version in the shared catalog.
The deployment workflow described above is for illustrative purposes. The model deployment workflow’s tasks may vary based on the specific machine learning use case. For the remainder of this post, we discuss the Databricks features that enable multi-region model serving.
Databricks Model Serving Endpoints
Databricks Model Serving provides highly available, low-latency model endpoints to support mission-critical and high-performance applications. The endpoints are backed by serverless compute, which automatically scales up and down based on the workload. Databricks Model Serving endpoints are also highly resilient to failures when updating to a newer model version. If updating to a newer model version fails, the endpoint will continue handling live traffic requests by automatically reverting to the previous model version.
Delta Sharing
A key benefit of Delta Sharing is its ability to maintain a single source of truth, even when accessed by multiple environments across different regions. For instance, development pipelines in various environments can access read-only tables from the central data store, ensuring consistency and avoiding redundancy.
Additional advantages include centralized governance, the ability to share live data without replication, and freedom from vendor lock-in, thanks to Delta Sharing’s open protocol. This architecture also supports advanced use cases like data clean rooms and integration with the Databricks Marketplace.
AWS VPC Peering
AWS VPC Peering is a crucial networking feature that facilitates secure and efficient connectivity between virtual private clouds (VPCs). A VPC is a virtual network dedicated to an AWS account, providing isolation and control over the network environment. When a user establishes a VPC peering connection, they can route traffic between two VPCs using private IP addresses, making it possible for instances in either VPC to communicate as if they are on the same network.
When deploying Databricks workspaces across multiple regions, AWS VPC Peering plays a pivotal role. By connecting the VPCs of Databricks workspaces in different regions, VPC Peering ensures that data sharing and communication occur entirely within private networks. This setup significantly enhances security by avoiding exposure to the public internet and reduces egress costs associated with data transfer over the internet. In summary, AWS VPC Peering is not just about connecting networks; it’s about optimizing security and cost-efficiency in multi-region Databricks deployments
Databricks Asset Bundles
A Databricks Asset Bundle (DAB) is a project-like structure that uses an infrastructure-as-code approach to help manage complicated machine learning use cases in Databricks. In the case of a multi-region model serving the DAB is crucial for orchestrating the model deployment to Databricks model serving endpoints via Databricks workflows across regions. By simply specifying each region’s Databricks workspace in databricks.yml of the DAB, the deployment of code (python notebooks), and resources (jobs, pipelines, DS models) are streamlined across different regions. Additionally, DABs offer flexibility by allowing incremental updates and scalability, ensuring that deployments remain consistent and manageable even as the number of regions or model endpoints grows.
Next Steps
- Showcase how different deployment strategies (A/B testing, Canary Deployment, etc.) can be implemented in DABs as part of the multi-region deployment.
- Use before-and-after performance metrics to show how latency was reduced by using this approach.
- Use a PoC to compare user satisfaction with a multi-region approach vs. a single-region approach.
- Ensure that multi-region data sharing and model serving comply with regional data protection laws (e.g., GDPR in Europe). Assess whether any legal considerations affect where data and models can be hosted.
Aimpoint Digital is a market-leading analytics firm at the forefront of solving the most complex business and economic challenges through data and analytical technology. From the integration of self-service analytics to implementing AI at scale and modernizing data infrastructure environments, Aimpoint Digital operates across transformative domains to improve the performance of organizations. Learn more by visiting: https://www.aimpointdigital.com/