Monitoring and troubleshooting Apache Spark applications become increasingly complex as companies scale their data analytics workloads. As data processing requirements grow, enterprises deploy these applications across multiple Amazon EMR on EKS clusters to handle diverse workloads efficiently. However, this approach creates a challenge in maintaining comprehensive visibility into Spark applications running across these separate clusters. Data engineers and platform teams need a unified view to effectively monitor and optimize their Spark applications.
Although Spark provides powerful built-in monitoring capabilities through Spark History Server (SHS), implementing a scalable and secure observability solution across multiple clusters requires careful architectural considerations. Organizations need a solution that not only consolidates Spark application metrics but extends its features by adding other performance monitoring and troubleshooting packages while providing secure access to these insights and maintaining operational efficiency.
This post demonstrates how to build a centralized observability platform using SHS for Spark applications running on EMR on EKS. We showcase how to enhance SHS with performance monitoring tools, with a pattern applicable to many monitoring solutions such as SparkMeasure and DataFlint. In this post, we use DataFlint as an example to demonstrate how you can integrate additional monitoring features. We explain how to collect Spark events from multiple EMR on EKS clusters into a central Amazon Simple Storage Service (Amazon S3) bucket; deploy SHS on a dedicated Amazon Elastic Kubernetes Service (Amazon EKS) cluster; and configure secure access using AWS Load Balancer Controller, AWS Private Certificate Authority, Amazon Route 53, and AWS Client VPN. This solution provides teams with a single, secure interface to monitor, analyze, and troubleshoot Spark applications across multiple clusters.
Overview of solution
Consider DataCorp Analytics, a data-driven enterprise running multiple business units with diverse Spark workloads. Their Financial Analytics team processes time-sensitive trading data requiring strict processing times and dedicated resources, and their Marketing Analytics team handles customer behavior data with flexible requirements, requiring multiple EMR on EKS clusters to accommodate these distinct workload patterns. As their Spark applications grow in number and complexity across these clusters, data and platform engineers struggle to maintain comprehensive visibility while maintaining secure access to monitoring tools.
This scenario presents an ideal use case for implementing a centralized observability platform using SHS and DataFlint. The solution deploys SHS on a dedicated EKS cluster, configured to read events from multiple EMR on EKS clusters through a centralized S3 bucket. Access is secured through Load Balancer Controller, AWS Private CA, Route 53, and Client VPN, and DataFlint enhances the monitoring capabilities with additional insights and visualizations. The following architecture diagram illustrates the components and their interactions.
The solution workflow is as follows:
- Spark applications on EMR on EKS use a custom EMR Docker image that includes DataFlint JARs for enhanced metrics collection. These applications generate detailed event logs containing execution metrics, performance data, and DataFlint-specific insights. The logs are written to a centralized Amazon S3 location through the following configuration (note especially the
configurationOverrides
section). For additional information, explore the StartJobRun guide to learn how to run Spark jobs and review the StartJobRun API reference.
- A dedicated SHS deployed on Amazon EKS reads these centralized logs. The Amazon S3 location is configured in the SHS to read from the central Amazon S3 location through the following code:
- We configure Load Balancer Controller, AWS Private CA, a Route 53 hosted zone, and Client VPN to securely access the SHS UI using a web browser.
- Finally, users can access the SHS web interface at
https://spark-history-server.example.internal/
.
You can find the code base in the AWS Samples GitHub repository.
Prerequisites
Before you deploy this solution, make sure the following prerequisites are in place:
Set up a common infrastructure
Complete the following steps to set up the infrastructure:
- Clone the repository to your local machine and set the two environment variables. Replace
with the AWS Region where you want to deploy these resources.
- Execute the following script to create the common infrastructure. The script creates a secure virtual private cloud (VPC) networking environment with public and private subnets and an encrypted S3 bucket to store Spark application logs.
- To verify successful infrastructure deployment, open the AWS CloudFormation console, choose your stack, and check the Events, Resources, and Outputs tabs for completion status, details, and list of resources created.
Set up EMR on EKS clusters
This section covers building a custom EMR on EKS Docker image with DataFlint integration, launching two EMR on EKS clusters (datascience-cluster-v
and analytics-cluster-v
), and configuring the clusters for job submission. Additionally, we set up the necessary IAM roles for service accounts (IRSA) to enable Spark jobs to write events to the centralized S3 bucket. Complete the following steps:
- Deploy two EMR on EKS clusters:
- To verify successful creation of the EMR on EKS clusters using the AWS CLI, execute the following command:
- Execute the following command for the
datascience-cluster-v
andanalytics-cluster-v
clusters to verify their respective states, container provider information, and associated EKS cluster details. Replacewith the ID of each cluster obtained from the list-virtual-clusters
output.
Configure and execute Spark jobs on EMR on EKS clusters
Complete the following steps to configure and execute Spark jobs on the EMR on EKS clusters:
- Generate custom EMR on EKS image and
StartJobRun
request JSON files to run Spark jobs:
The script performs the following tasks:
- Prepares the environment by uploading the sample Spark application
spark_history_demo.py
to a designated S3 bucket for job execution. - Creates a custom Amazon EMR container image by extending the base
EMR 7.2.0
image with the DataFlint JAR for additional insights and publishing it to an Amazon Elastic Container Registry (Amazon ECR) repository. - Generates cluster-specific
StartJobRun
request JSON files fordatascience-cluster-v
andanalytics-cluster-v
.
Review start-job-run-request-datascience-cluster-v.json
and start-job-run-request-analytics-cluster-v.json
for additional details.
- Execute the following commands to submit Spark jobs on the EMR on EKS virtual clusters:
- Verify the successful generation of the logs in the S3 bucket:
aws s3 ls s3://emr-spark-logs-
You have successfully set up an EMR on EKS environment, executed Spark jobs, and collected the logs in the centralized S3 bucket. Next, we will deploy SHS, configure its secure access, and visualize the logs using it.
Set up AWS Private CA and create a Route 53 private hosted zone
Use the following code to deploy AWS Private CA and create a Route 53 private hosted zone. This will provide a user-friendly URL to connect to SHS over HTTPS.
Set up SHS on Amazon EKS
Complete the following steps to build a Docker image containing SHS with DataFlint, deploy it on an EKS cluster using a Helm chart, and expose it through a Kubernetes service of type LoadBalancer. We use a Spark 3.5.0
base image, which includes SHS by default. However, although this simplifies deployment, it results in a larger image size. For environments where image size is critical, consider building a custom image with just the standalone SHS component instead of using the complete Spark distribution.
- Deploy SHS on the
spark-history-server
EKS cluster:
- Verify the deployment by listing the pods and viewing the pod logs:
- Review the logs and confirm there are no errors or exceptions.
You have successfully deployed SHS on the spark-history-server
EKS cluster, and configured it to read logs from the emr-spark-logs-
S3 bucket.
Deploy Client VPN and add entry to Route 53 for secure access
Complete the following steps to deploy Client VPN to securely connect your client machine (such as your laptop) to SHS and configure Route 53 to generate a user-friendly URL:
- Deploy the Client VPN:
- Add entry to Route 53:
Add certificates to local trusted stores
Complete the following steps to add the SSL certificate to your operating system’s trusted certificate stores for secure connections:
- For macOS users, using Keychain Access (GUI):
- Open Keychain Access from Applications, Utilities, choose the System keychain in the navigation pane, and choose File, Import Items.
- Browse to and choose
${REPO_DIR}/ssl/certificates/ca-certificate.pem
, then choose the imported certificate. - Expand the Trust section and set When using this certificate to Always Trust.
- Close and enter your password when prompted and save.
- Alternatively, you can execute the following command to include the certificate in Keychain and trust it:
- For Windows users:
- Rename
ca-certificate.pem
toca-certificate.crt
. - Choose (right-click)
ca-certificate.crt
and choose Install Certificate. - Choose Local Machine (admin rights required).
- Select Place all certificates in the following store.
- Choose Browse and choose Trusted Root Certification Authorities.
- Complete the installation by choosing Next and Finish.
- Rename
Set up Client VPN on your client machine for secure access
Complete the following steps to install and configure Client VPN on your client machine (such as your laptop) and create a VPN connection to the AWS Cloud:
- Download, install, and launch the Client VPN application from the official download page for your operating system.
- Create your VPN profile:
- Choose File in the menu bar, choose Manage Profiles, and choose Add Profile.
- Enter a name for your profile. Example:
SparkHistoryServerUI
- Browse to
${REPO_DIR}/vpn/client_vpn_certs/client-config.ovpn
, choose the certificate file, and choose Add Profile to save your configuration.
- Select your newly created profile, choose Connect, and wait for the connection confirmation to establish the VPN connection.
When you’re connected, you will have secure access to the AWS resources in your environment.
Securely access the SHS URL
Complete the following steps to securely access SHS using a web browser:
- Execute the following command to get the SHS URL:
https://spark-history-server.example.internal/
- Copy this URL and enter it into your web browser to access the SHS UI.
The following screenshot shows an example of the UI.
- Choose an
App ID
to view its detailed execution information and metrics.
- Choose the DataFlint tab to view detailed application insights and analytics.
DataFlint displays various helpful metrics, including alerts, as shown in the following screenshot.
Clean up
Conclusion
In this post, we demonstrated how to build a centralized observability platform for Spark applications using SHS and enhance SHS with performance monitoring tools like DataFlint. The solution aggregates Spark events from multiple EMR on EKS clusters into a unified monitoring interface, providing comprehensive visibility into your Spark applications’ performance and resource utilization. By using a custom EMR image with performance monitoring tool integration, we enhanced the standard Spark metrics to gain deeper insights into application behavior. If your environment uses a mix of EMR on EKS, Amazon EMR on EC2, or Amazon EMR Serverless, you can seamlessly extend this architecture to aggregate the logs from EMR on EC2 and EMR Serverless in a similar way and visualize them using SHS.
Although this solution provides a robust foundation for Spark monitoring, production deployments should consider implementing authentication and authorization. SHS supports custom authentication through javax servlet filters and fine-grained authorization through access control lists (ACLs). We encourage you to explore implementing authentication filters for secure access control, configuring user- and group-based ACLs for view and modify permissions, and setting up group mapping providers for role-based access. For detailed guidance, refer to Spark’s web UI security documentation and SHS security features.
While AWS endeavors to apply best practices for security within this example, each organization has its own policies. Please make sure to use the specific policies of your organization when deploying this solution as a starting point for implementing centralized Spark monitoring in your data processing environment.
About the authors
Sri Potluri is a Cloud Infrastructure Architect at AWS. He is passionate about solving complex problems and delivering well-structured solutions for diverse customers. His expertise spans across a range of cloud technologies, providing scalable and reliable infrastructures tailored to each project’s unique challenges.
Suvojit Dasgupta is a Principal Data Architect at AWS. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.