Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

março 15, 2025

9

Organizations rely on Amazon EMR on EC2 clusters to process large-scale data workloads using frameworks like Apache Spark, Apache Hive, and Trino. Events such as TV advertisements or unplanned promotions might lead to an increase in demand of compute capacity, making effective capacity planning necessary to make sure your workloads don’t hit capacity limits or job failures.

A common scenario is to run daily Spark jobs on Amazon EMR using consistent Amazon Elastic Compute Cloud (Amazon EC2) instance types (for example, a single instance size and family for the cluster). Although this might work well to sustain the baseline, spikes can trigger auto scaling, which narrows the chances of capacity availability when trying to stop and relaunch a larger EMR cluster, because the specific on-demand instance pool might lack capacity to meet the demand.

In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when Amazon EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use Amazon EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.

Solution overview

Instance fleets in Amazon EMR offer a flexible and robust way to manage EC2 instances within your cluster. This feature allows you to specify target capacities for On-Demand and Spot Instances, select up to five EC2 instance types per fleet (or 30 when using the AWS Command Line Interface [AWS CLI] and API with an allocation strategy), and use multiple subnets across different Availability Zones. Importantly, instance fleets support the use of ODCRs, enabling you to align your EMR clusters with pre-purchased EC2 capacity. You can configure your instance fleet to prefer or require capacity reservations, making sure that your EMR clusters use your reserved capacity efficiently.

EMR workload patterns typically fall into two categories: stable and variable (spiky). In the following sections, we explore how to optimize for each pattern using various options available with instance fleets, starting with stable workloads and then addressing variable workloads.

Stable workloads are workloads with a predictable pattern of resource utilization over time; for example, a pharmaceutical provider needs to process 21 TB of research data, patient records, and other information daily. The workload is consistent and needs to run reliably every day on long-running persistent clusters. For critical business operations requiring high reliability and guaranteed capacity, we recommend reserving the baseline capacity as part of your capacity planning. We demonstrate the following steps:

Use AWS Cost and Usage Reports (AWS CUR) to estimate the baseline of existing workloads.
Reserve the baseline capacity using ODCR.
Configure Amazon EMR to use the targeted ODCR.

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands. These surges can be triggered by various factors (such as batch processing, real-time data streaming, or seasonal business fluctuations) that trigger Amazon EMR to request more capacity to match the demand. We address the resource allocation by using instance and Availability Zone flexibility, with the following steps:

Introduce EC2 instance flexibility with EMR instance fleets.
Achieve resiliency through intelligent subnet selection with EMR instance fleets.
Use managed scaling to automatically manage scaling in and out.

Stable workloads

In this section, we demonstrate how to define your baseline, configure AWS Identity and Access Management (IAM) permissions, create an ODCR, and associate your reservations to a capacity group and configure Amazon EMR to use targeted ODCRs. You can opt for a mixed ODCR strategy—for example, one ODCR with a short period of duration that supports the launch of your EMR cluster, and another ODCR with a longer period of duration that supports your task nodes based on the baseline capacity reservation.

Estimate the baseline

Make sure to activate the AWS generated cost allocation tag aws:elasticmapreduce:job-flow-id. This enables the field resource_tags_aws_elasticmapreduce_job_flow_id in the AWS CUR to be populated with the EMR cluster ID and is used by the SQL queries in the solution. To activate the cost allocation tag from the AWS Billing Console, complete the following steps:

On the AWS Billing and Cost Management console, choose Cost allocation tags in the navigation pane.
Under AWS generated cost allocation tags, choose the aws:elasticmapreduce:job-flow-id tag.
Choose Activate.

It can take up to 24 hours for tags to activate. For more information, see here.

After the tags are activated, you can use AWS CUR and perform the following query on Amazon Athena to find the compute resources used by the EMR cluster ID vs. the timeline of usage. For more details, see Querying Cost and Usage Reports using Amazon Athena. Update the following query with your CUR table name, EMR cluster ID, desired timestamps, and AWS account ID, and run the query on Athena:

SELECT bill_payer_account_id as Payer,
    product_product_family as PFamily,
    product_product_name as PName,
    resource_tags_aws_elasticmapreduce_job_flow_id,
    line_item_usage_account_id as LinkedAccount,
    line_item_usage_start_date as UsageDate,
    bill_billing_period_start_date as BillingDate,
    SPLIT_PART(line_item_usage_type, ':', 2) AS InstanceType,
    line_item_availability_zone AS AvailabilityZone,
    COUNT(line_item_resource_id) as ResourceIDCount
FROM 
WHERE (
        line_item_usage_start_date BETWEEN TIMESTAMP 'YYYY-MM-DD 00:00:00'
        AND TIMESTAMP 'YYYY-MM-DD 23:59:59' 
    )
    AND line_item_operation LIKE '%%RunInstance%%'
    AND line_item_line_item_type LIKE '%%Usage%%'
    AND product_product_family NOT IN ('Data Transfer')
    AND resource_tags_aws_elasticmapreduce_job_flow_id LIKE '%%%%'
    AND line_item_usage_account_id IN (
        ''
)
GROUP BY 1,2,3,4,5,6,7,8,9

As an example, the preceding query filters instances usage per hour for a given account and EMR cluster for the period of 6 months, to generate the following figure. You can export the results in CSV format and analyze the data. Now that you have a visual representation of your workloads’ baseline and bursts, you can define the strategy and configuration of your EMR cluster.

Create an ODCR to reserve the baseline capacity

ODCRs can be either open or targeted:

With an open ODCR, new instances and existing instances that have matching attributes (such as operating system or instance type) will run using the capacity reservation attributes first.
With a targeted ODCR, instances must match the attributes of the ODCR specification and the ODCR is specifically targeted at launch. This approach is recommended if you have multiple concurrent EMR clusters consuming capacity from the shared On-Demand pool of EC2 instances. EMR clusters larger than the targeted ODCR quantity will fall back to On-Demand Instances that are in the same Availability Zone.

In this example, we use a targeted ODCR with an EMR instance fleet in the us-east-1a Availability Zone. The following diagram illustrates the workflow.

Complete the following steps:

Use the create-capacity-reservation AWS CLI command to create the ODCR and make a note of the CapacityReservationArn value in the output:

We get the following output:

{
     "CapacityReservation": {
         "CapacityReservationId": "cr-0123456f9907xxxxx",
         "OwnerId": "XXXX",
         "CapacityReservationArn": "arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx",
         "InstanceType": "r8g.2xlarge",
         "InstancePlatform": "Linux/UNIX",
         "AvailabilityZone": "us-east-1a"

 ....
     }
 }

You can use Amazon CloudWatch to monitor ODCR usage and trigger an alert for unused capacity. For more details, see Monitor Capacity Reservations usage with CloudWatch metrics.

Create a resource group named EMRSparkSteadyStateGroup and make a note of GroupArn values in the output:

aws resource-groups create-group --name EMRSparkSteadyStateGroup \
--configuration '{"Type":"AWS::EC2::CapacityReservationPool"}' '{"Type":"AWS::ResourceGroups::Generic", "Parameters":[{"Name":"allowed-resource-types","Values":["AWS::EC2::CapacityReservation"]}]}'

We get the following output:

"Group": {
         "GroupArn": "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup",
         "Name": "EMRSparkSteadyStateGroup"
     }, ...

Use the following code to associate the capacity reservation to the resource group. You can have multiple capacity reservations associated to a resource group.

aws resource-groups group-resources --group EMRSparkSteadyStateGroup \
 --resource-arns arn:aws:ec2:us-east-1:XXXX:capacity-reservation/cr-0123456f9907xxxxx

As a best practice for effective management and cleanup, Create a tag Purpose=EMR-Spark-Steady-State for the newly created ODCR and the resource group.

# Tag your Capacity Reservation
 aws ec2 create-tags \
 --resources cr-0123456f9907xxxxx \
 --tags Key=Purpose,Value=EMR-Spark-Steady-State
# Tag your Resource Group
 aws resource-groups tag \
 --arn "arn:aws:resource-groups:us-east-1:XXXX:group/EMRSparkSteadyStateGroup" --tags Purpose=EMR-Spark-Steady-State

Implement Amazon EMR with ODCR

Complete the following steps to create an EMR cluster tagged with the specific targeted ODCR:

Add required permissions to the EMR service role before using capacity reservations. With these permissions, you can lock down the resource with the specific Amazon Resource Name (ARN) of the group name to be created with the following code:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Resource": "*",
             "Action": [
                 "ec2:CreateFleet",
                 "ec2:RunInstances",
                 "ec2:CreateLaunchTemplate",
                 "ec2:CreateLaunchTemplateVersion",
                 "ec2:DeleteLaunchTemplateVersions",
                 "ec2:DescribeCapacityReservations",
                 "ec2:DescribeLaunchTemplateVersions",
                 "resource-groups:ListGroupResources"
             ]
         }
     ]
 }

Configure the EMR cluster to use ODCR with instance fleets. We use the CapacityReservationOptions parameter to configure the EMR cluster, as shown in the following example:

  {
 ...
     "LaunchSpecifications": {
       "OnDemandSpecification": {
         "AllocationStrategy": "LOWEST_PRICE",
         "CapacityReservationOptions": {
           "UsageStrategy": "USE_CAPACITY_RESERVATIONS_FIRST",
           "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:us-east-1:xxxxxx:group/EMRSparkSteadyStateGroup"
         }
       }
     }
   }

The following step-by-step breakdown illustrates the Amazon EMR decision-making process when prioritizing targeted capacity reservations, from core node provisioning through task node allocation:

Cluster provisioning initiation:
- The user chooses to override the lowest-price allocation strategy.
- The user specifies targeted capacity reservations in the launch request.
Core node provisioning:
- Amazon EMR evaluates all EC2 instance capacity pools with targeted capacity reservations, and selects the pool with the lowest price that has sufficient capacity for all requested core nodes.
- If no pool with targeted reservations has sufficient capacity, Amazon EMR reevaluates all specified EC2 instance capacity pools and selects the lowest-priced pool with sufficient capacity for core nodes. Available open capacity reservations are applied automatically.
Availability Zone selection:
- After the core capacity is acquired, Amazon EMR locks in the Availability Zone for your cluster.
Primary and task node provisioning:
- Amazon EMR evaluates EC2 instance capacity pools within that Availability Zone for primary and task fleets. First, Amazon EMR evaluates all the pools with targeted ODCRs specified in the request, ordered by lowest price by default.
- From the ordered list, Amazon EMR launches as much capacity as possible from the unused targeted ODCRs of each instance pool until the request is fulfilled.
- If the unused targeted ODCRs don’t fulfill the request yet, Amazon EMR continues to launch the remaining capacity into On-Demand pools, in the lowest-price order by default.

For more details about the allocation strategy, refer to Allocation strategy for instance fleets or Amazon EMR Support for Targeted ODCR.

Spiky workloads

Spiky workloads are defined by unpredictable and often significant fluctuations in processing demands, triggered by factors such as infrequent but resource-intensive periodic batch processing jobs. For example, a geographic information system processes location data from millions of users in real time to provide up-to-date traffic information, calculate routes, and suggest points of interest. User location data is constantly being generated, but the volume can spike dramatically during rush hour or special events, as illustrated in the following figure. This graph shows the number of used resources (Amazon EC2) by hour; it varies from 1 when the cluster scales in, waiting for jobs, to spikes of 1,000 nodes.

If you’re running spiky workloads with limited flexibility in instance type, family, and Availability Zone, you might face ICE errors when the available capacity can’t meet the cluster’s scaling requirements. To address this, we explore a set of best practices for EMR cluster creation to maximize availability and balance price-performance. Although spiky workloads present a unique challenge in resource management, configuring EMR instance fleets offers a powerful solution. By using diverse instance types, prioritized allocation strategies, Availability Zone flexibility, and managed scaling, organizations can create a robust, cost-effective infrastructure capable of handling unpredictable workload patterns. This configuration offers the following benefits:

Improved availability – By diversifying instance types and using multiple Availability Zones, the cluster mitigates insufficient capacity issues
Cost savings – Allocation strategies reduce costs while minimizing interruptions
Resilience for spiky workloads – Prioritizing instance generations provides seamless scaling under varying demands
Optimized performance – Managed scaling dynamically adjusts resources to meet workload demands efficiently

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Amazon EMR supports instance flexibility with instance fleet deployment. Instance fleets give you a wider variety of options and intelligence around instance provisioning. You can now provide a list of up to 30 instance types with corresponding weighted capacities and spot bid prices (including spot blocks) using the AWS CLI or AWS CloudFormation. Amazon EMR will automatically provision On-Demand and Spot capacity across these instance types when creating your cluster. This can make it more straightforward and more cost-effective to quickly obtain and maintain your desired capacity for your clusters. In August 2024, Amazon EMR introduced the prioritized allocation strategy to enhance instance flexibility with instance fleets. This feature allows you to specify priority levels for your instance types, enabling Amazon EMR to allocate capacity to the highest-priority instances first. This strategy helps improve cost savings and reduces the time required to launch clusters, even in scenarios with limited capacity. For more details, see Amazon EMR support prioritized and capacity-optimized-prioritized allocation strategies for EC2 instances. To maximize cost-efficiency and availability for spiky workloads, combine the price-performance advantages of new-generation instances with the broader availability of previous-generation instances. For workloads with strict latency requirements, fix the instance size to maintain consistent performance. This approach takes advantage of the strengths of both instance generations, providing flexibility and reliability decreasing the likelihood of capacity constraints. For On-Demand nodes, choose the prioritized allocation strategy, so the cluster tries to use newer-generation instances first. While configuring the instance fleet, arrange instances in a prioritized order reflecting price-performance and availability trade-offs, for example:

Primary node – m8g.12xlarge > m8g.16xlarge > m7g.12xlarge > m7g.16xlarge
Core node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge
Task Node – r8g.8xlarge > r8g.12xlarge > r7g.8xlarge > r6g.16xlarge > r5.16xlarge

For Spot Instances, make sure the capacity-optimized prioritized allocation strategy is selected to reduce interruptions. See the following CloudFormation template snippet as an example:

...
       "Properties": {
         "Instances": {
          "MasterInstanceFleet": {
            "Name": "cfnMaster",
            "InstanceTypeConfigs": [
               {
                 "BidPrice": "10.50",
                 "InstanceType": "m5.xlarge",
                 "Priority": "1",
 ...
             "LaunchSpecifications": {
               "SpotSpecification": {
                 "TimeoutAction": "SWITCH_TO_ON_DEMAND",
                 "TimeoutDurationMinutes": 20,
                 "AllocationStrategy": "CAPACITY_OPTIMIZED_PRIORITIZED"
               },
               "OnDemandSpecification": {
                "AllocationStrategy": "PRIORITIZED"
               }
 ...

Select subnets with EMR instance fleets

When creating a cluster, specify multiple EC2 subnets within a virtual private cloud (VPC), each corresponding to a different Availability Zone. Amazon EMR provides multiple subnet (Availability Zone) options by employing subnet filtering at cluster launch, and selects one of the subnets that has adequate available IP addresses to successfully launch all instance fleets. If Amazon EMR can’t find a subnet with sufficient IP addresses to launch the whole cluster, it will prioritize the subnet that can at least launch the core and primary instance fleets.

Use managed scaling

Managed scaling is another powerful feature of Amazon EMR that automatically adjusts the number of instances in your cluster based on workload demands. This makes sure that your cluster scales up during periods of high demand to meet processing requirements and scales down during idle times to save costs. With managed scaling, you can set minimum and maximum scaling limits, giving you control over costs while benefiting from an optimized and efficient cluster performance.

The following workflow illustrates Amazon EMR configured with instance fleets and managed scaling.

The workflow consists of the following steps:

The user defines the EMR instance configurations and instance types, along with their launch priority.
The user selects subnets for the Amazon EMR configuration to provide Availability Zone flexibility.
Amazon EMR calls the Amazon EC2 Fleet API to provision instances based on the allocation strategy.
The EMR instance fleet is launched.
The cycle is repeated for scaling operations within the launched Availability Zone, providing optimized performance and scalability.

Conclusion

In this post, we demonstrated how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. As you implement any of the preceding strategies, remember to continuously monitor your cluster’s performance and adjust configurations based on your specific workload patterns and business needs. With the right approach, the challenges of spiky workloads can be transformed into opportunities for optimized performance and cost savings.

To effectively manage workloads with both baseline demands and unexpected spikes, consider implementing a hybrid approach in Amazon EMR. Use ODCRs for consistent baseline capacity and configure instance fleets with a strategic mix of ODCR, On-Demand, and Spot Instances prioritizing ODCR usage.

Try these strategies with your own use case, and leave your questions in the comments.

About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Suba Palanisamy is a Senior Technical Account Manager, helping customers achieve operational excellence on AWS. Suba is passionate about all things data and analytics. She enjoys traveling with her family and playing board games.

Flavio Torres is a Principal Technical Account Manager at AWS. Flavio helps Enterprise Support customers design, deploy, and scale resilient cloud applications. Outside of work, he enjoys hiking and barbecuing.

Previous articleJDK 25: The new features in Java 25

Next articleHow to Infect Your PC in Three Easy Steps – Krebs on Security

Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2

Solution overview

Stable workloads

Estimate the baseline

Create an ODCR to reserve the baseline capacity

Implement Amazon EMR with ODCR

Spiky workloads

Introduce EC2 instance flexibility and instance fleets with a prioritized allocation strategy

Select subnets with EMR instance fleets

Use managed scaling

Conclusion

About the Authors

MCP and the innovation paradox: Why open standards will save AI from itself

Configure cross-account access of Amazon SageMaker Lakehouse multi-catalog tables using AWS Glue 5.0 Spark

Implementing a Dimensional Data Warehouse with Databricks SQL: Part 2

Most Popular

MCP and the innovation paradox: Why open standards will save AI from itself

How We Leveraged Splunk to Solve Real Network Challenges

Crypto Flash Loan Arbitrage Bot Development: Trading Bot

How cloud and AI transform and improve customer experiences

Recent Comments

ABOUT US

POPULAR POSTS

MCP and the innovation paradox: Why open standards will save AI from itself

How We Leveraged Splunk to Solve Real Network Challenges

Crypto Flash Loan Arbitrage Bot Development: Trading Bot

POPULAR CATEGORY