As organizations scale their Amazon Web Services (AWS) infrastructure, they frequently encounter challenges in orchestrating data and analytics workloads across multiple AWS accounts and AWS Regions. While multi-account strategy is essential for organizational separation and governance, it creates complexity in maintaining secure data pipelines and managing fine-grained permissions particularly when different teams manage resources in separate accounts.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the Amazon Cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. With Amazon MWAA, you can use Apache Airflow to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.
In this blog post, we demonstrate how to use Amazon MWAA for centralized orchestration, while distributing data processing and machine learning tasks across different AWS accounts and Regions for optimal performance and compliance.
Solution overview
Let’s consider an example of a global enterprise with distributed teams spread across different AWS regions. Each team generates and processes valuable data that is often required by other teams for comprehensive insights and streamlined operations. In this post, we consider a scenario where the data processing team sits in one region and the machine learning (ML) team sits in another region and there is a central team that manages the tasks between the two teams.
To address this complex challenge of orchestrating dependent teams across geographic regions, we’ve designed a data pipeline that spans multiple AWS accounts across different AWS Regions and is centrally orchestrated using Amazon MWAA. This design enables seamless data flow between teams, making sure that each team has access to the necessary data from other AWS accounts and Regions while maintaining compliance and operational efficiency.
Here’s a high-level overview of the architecture:
- Centralized orchestration hub (Account A, us-east-1)
- Amazon MWAA serves as the central orchestrator, coordinating operations across all regional data pipelines.
- Regional data pipelines (Account B, two Regions)
- Region 1 (for example, us-east-1)
- Region 2 (for example, us-west-2)
This architecture maintains the concept of separate regional operations within Account B, with data processing in AWS Region 1 and ML in AWS Region 2. The central Amazon MWAA instance in Account A orchestrates these operations across AWS Regions, enabling different teams to work with the data they need. It enables scalability, automation, and streamlined data processing and ML workflows across multiple AWS environments.
Prerequisites
This solution requires two AWS accounts:
- Account A: Central managed account for the Amazon MWAA environment.
- Account B: Data processing and ML operations
- Primary Region: US East (N. Virginia) [us-east-1]: Data processing workloads
- Secondary Region: US West (Oregon) [us-west-2]: ML workloads
Step 1: Set up Account B (data processing and ML tasks)
in us-east-1 and provide Account A as input. This template creates the following three stacks:
- Stack in us-east-1: Creates the required roles for stackset execution.
- Second stack in us-east-1: Creates an S3 bucket, S3 folders, and AWS Glue job.
- Stack in us-west-2: Creates a S3 bucket, S3 folders, Amazon SageMaker Config file, cross-account-role, and AWS Lambda function.
Collect stack outputs: After successful deployment, gather the following output values from the created stacks. These outputs will be used in subsequent steps of the setup process.
- From the us-east-1 stack:
- The value of
SourceBucketName
- The value of
- From the us-west-2 stack:
- The value of
DestinationBucketName
- The value of
CrossAccountRoleArn
- The value of
Step 2: Set up Account A (central orchestration)
in us-east-1. Provide value of
CrossAccountRoleArn
from Account B setup as input. This template does the following:
- Deploys an Amazon MWAA environment
- Sets up an Amazon MWAA Execution role with a cross-account trust policy.
Step 3: Setting up S3 CRR and bucket policies in Account B
in us-east-1 for cross-Region replication of the S3 data-processing bucket in us-east-1 and the ML pipeline bucket in us-west-1. Provide values of
SourceBucketName
, DestinationBucketName
, and AccountAId
as input parameters.
This stack should be deployed after completing the Amazon MWAA setup. This sequence is necessary because you need to grant the Amazon MWAA execution role appropriate permissions to access both the source and destination buckets.
Step 4: Implement cross-account, cross-Region orchestration
IAM cross-account role in Account B
The stack in Step 2 created an AWS Identity and Access Management (IAM) role in Account B with a trust relationship that allows the Amazon MWAA execution role from Account A (the central orchestration account) to assume it. Additionally, this role is granted the necessary permissions to access AWS resources in both Regions of Account B.
This setup enables the Amazon MWAA environment in Account A to securely perform actions and access resources across different Regions in Account B, maintaining the principle of least privilege while allowing for flexible, cross-account orchestration.
Airflow connection in Account A
To establish cross-account connections in Amazon MWAA:
Create a connection for us-east-1. Open the Airflow UI and navigate to Admin and then to Connections. Choose the plus (+) icon to add a new connection and enter the following details:
- Connection ID: Enter
aws_crossaccount_role_conn_east1
- Connection type: Select Amazon Web Services.
- Extras: Add the cross-account-role and Region name using the following code. Replace
with the cross-account role Amazon Resource Name (ARN) created while setting Account B in Step 1, in Region 2 (us-west-2):
Create a second connection for us-west-2.
- Connection ID: Enter
aws_crossaccount_role_conn_west2
- Connecton type: Select Amazon Web Services.
- Extras: Add a
CrossAccountRoleArn
and Region name using the following code:
By setting up these Airflow connections, Amazon MWAA can securely access resources in both us-east-1 and us-west-2, helping to ensure seamless workflow execution.
Implement cross-account workflows in Account A
Now that your environment is set up with the necessary IAM roles and Airflow connections, you can create data processing and ML workflows that span across accounts and Regions.
DAG 1: Cross-account data processing
The directed acyclic graph (DAG) depicted in the preceding figure demonstrates a cross-account data processing workflow using Amazon MWAA and AWS services.
To implement this DAG:
Here’s a description of its key operators:
- S3KeySensor: This sensor monitors a specified S3 bucket for the presence of a raw data file (raw/ml_train_data.csv). It uses a cross-account AWS connection (
aws_crossaccount_role_conn_east1
) to access the S3 bucket in a different AWS account. The sensor checks every 60 seconds and times out after 1 hour if the file is not detected. - GlueJobOperator: This operator triggers an AWS Glue job (
mwaa_glue_raw_to_transform
) for data preprocessing. It passes the bucket name as a script argument to the AWS Glue job. Like the S3KeySensor, it uses the cross-account AWS connection to execute the AWS Glue job in the target account.
DAG 2: Cross-account and cross-Region ML
The DAG in the preceding figure demonstrates a cross-account machine learning workflow using Amazon MWAA and AWS services. It shows Airflow’s flexibility in enabling users to write custom operators for specific use cases, particularly for cross-account operations.
To implement this DAG:
Here’s a description of the custom operators and key components:
- CrossAccountSageMakerHook: This custom hook extends the
SageMakerHook
to enable cross-account access. It uses AWS Security Token Service (AWS STS) to assume a role in the target account, enabling seamless interaction with SageMaker across account boundaries. - CrossAccountSageMakerTrainingOperator: Building on the
CrossAccountSageMakerHook
, this operator enables SageMaker training jobs to be executed in a different AWS account. It overrides the default SageMakerTrainingOperator to use the cross-account hook. - S3KeySensor: Used to monitor the presence of training data in a specified S3 bucket. These sensors verify that the required data is available before proceeding with the machine learning workflow. It uses a cross-account AWS connection (
aws_crossaccount_role_conn_west2
) to access the S3 bucket in a different AWS account. - SageMakerTrainingOperator: Uses the custom
CrossAccountSageMakerTrainingOperator
to initiate a SageMaker training job in the target account. The configuration for this job is dynamically loaded from an S3 bucket. - LambdaInvokeFunctionOperator: Invokes a Lambda function named
dagcleanup
after the SageMaker training job completes. This can be used for post-processing or cleanup tasks.
Step 5: Schedule and verify the Airflow DAGs
- To schedule the DAGs, copy the Python scripts cross_account_data_processing_dag.py and cross_account_machine_learning_dag.py to the S3 location associated with Amazon MWAA in central Account A. Go to the Airflow environment created in Account A, us-east-1, and locate the S3 bucket link and upload them to the dags folder.
- Download data file to the source bucket created in Account B, us-east-1, under raw folder.
- Navigate to the Airflow UI.
- Locate your DAG in the DAGs tab. The DAG automatically syncs from Amazon S3 to the Airflow UI. Choose the toggle button to enable the DAGs.
- Trigger the DAG runs.

Best practices for cross-account integration
When implementing cross-account, cross-Region workflows with Amazon MWAA, consider the following best practices to help ensure security, efficiency, and maintainability.
- Secrets management: Use AWS Secrets Manager to securely store and manage sensitive information such as database credentials, API keys, or cross-account role ARNs. Rotate secrets regularly using Secrets Manager automatic rotation. For more information, see Using a secret key in AWS Secrets Manager for an Apache Airflow connection.
- Networking: Choose the appropriate networking solution (AWS Transit Gateway, VPC Peering, AWS PrivateLink) based on your specific requirements, considering factors such as the number of VPCs, security needs, and scalability requirements. Implement appropriate security groups and network ACLs to control traffic flow between connected networks.
- IAM role management: Follow the principle of least privilege when creating IAM roles for cross-account access.
- Error handling and retries: Implement robust error handling in your DAGs to manage cross-account access issues. Use Airflow’s retry mechanisms to handle transient failures in cross-account operations.
- Managing Python dependencies: Use a requirements.txt file to specify exact versions of required packages. Test your dependencies locally using the Amazon MWAA local runner before deploying to production. For more information, see Amazon MWAA best practices for managing Python dependencies
Clean up
To avoid future charges, remove any resources you created for this solution.
- Empty the S3 buckets: Manually delete all objects within each bucket, verify they are empty, then delete the buckets themselves.
- Delete the CloudFormation stacks: Identify and delete the stacks associated with the architecture.
- Verify resource cleanup: Make sure that Amazon MWAA, AWS Glue, SageMaker, Lambda, and other services are terminated.
- Remove remaining resources: Delete any manually created IAM roles, policies, or security groups.
Conclusion
By using Airflow connections, custom operators, and features such as Amazon S3 cross-Region replication, you can create a sophisticated workflow that seamlessly operates across multiple AWS accounts and Regions. This approach allows for complex, distributed data processing and machine learning pipelines that can take advantage of resources spread across your entire AWS infrastructure. The combination of cross-account access, cross-Region replication, and custom operators provides a powerful toolkit for building scalable and flexible data workflows. As always, careful planning and adherence to security best practices are crucial when implementing these advanced multi-account, multi-Region architectures.
Ready to tackle your own cross-account orchestration challenges? Test this approach and share your experience in the comments section.
About the authors
Suba Palanisamy is a Senior Technical Account Manager helping customers achieve operational excellence using AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games
Anubhav Gupta is a Solutions Architect at AWS supporting enterprise greenfield customers, focusing on the financial services industry. He has worked with hundreds of customers worldwide building their cloud foundational environments and platforms, architecting new workloads, and creating governance strategy for their cloud environments. In his free time, he enjoys traveling and spending time outdoors
Anusha Pininti is a Solutions Architect guiding enterprise greenfield customers through every stage of their cloud transformation, specializing in data analytics. She supports customers across various industries, helping them achieve their business objectives through cloud-based solutions. In her free time, Anusha loves to travel, spend time with family, and experiment with new dishes
Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, watching TV shows, and playing Tabla
Geetha Penmatsa is a Solutions Architect supporting enterprise greenfield customers through their cloud journey. She helps customers across various industries transform their business with the AWS Cloud. She has a background in data analytics and is specializing in Amazon Connect Cloud contact center to help transform customer experience at scale. Outside work, Geetha loves to travel, ski, hike, and spend time with friends and family