Databricks is excited to introduce enhanced streaming observability within Workflows and Delta Live Tables (DLT) pipelines. This feature provides data engineering teams with robust tools for optimizing real-time data processing. The user interface has been designed for intuitiveness, enabling users to monitor key metrics such as backlog duration in seconds, bytes processed, records ingested, and files handled across prominent streaming sources like Kafka, Kinesis, Delta, and Autoloader.
With the implementation of proactive, task-level alerts, ambiguity is removed from backlog management, facilitating more efficient compute resource utilization and ensuring data freshness is maintained. These innovations empower organizations to scale real-time analytics with confidence, thereby enhancing decision-making processes and driving superior outcomes through reliable, high-performance streaming pipelines.
Common Challenges in Streaming Monitoring and Alerting
A growing backlog often indicates underlying issues, which may range from one-time fixes to the need for reconfiguration or optimization to handle increased data volumes. Below are some critical areas engineering teams focus on to maintain the throughput and reliability of a streaming pipeline.
- Capacity Planning
This involves determining when to scale vertically (adding more power to existing resources) or horizontally (adding more nodes) to sustain high throughput and maintain system stability. - Operational Insights
This includes monitoring for bursty input patterns, sustained periods of high throughput, or slowdowns in downstream systems. Early detection of anomalies or spikes enables proactive responses to maintain seamless operations. - Data Freshness Guarantees
For real-time applications, such as machine learning models or business logic embedded in the stream, having access to the freshest data is paramount. Stale data can lead to inaccurate decisions, making it essential to prioritize data freshness in streaming workflows. - Error Detection and Troubleshooting
This requires robust monitoring and alerting systems that can flag issues, provide actionable insights, and enable engineers to take corrective actions quickly.
Understanding a stream’s backlog previously required several steps. In Delta Live Tables, this involved continuously parsing the pipeline event log to extract relevant information. For Structured Streaming, engineers often relied on Spark’s StreamingQueryListener to capture and push backlog metrics out to third party tools, which introduced additional development and maintenance overhead. Setting up alerting mechanisms added further complexity, requiring more custom code and configuration.
After metrics are delivered, challenges remain in managing expectations around the time required to clear the backlog. Providing accurate estimates for when data will catch up involves variables such as throughput, resource availability, and the dynamic nature of streaming workloads, making precise predictions difficult.
Workflows and Delta Live Tables now display Backlog Metrics
With the release of streaming observability, data engineers can now easily detect and manage backlogs through visual indicators in the Workflows and DLT UI. The Streaming backlog metrics sit side by side with Databricks notebooks code in the Workflows UI.
The streaming metrics graph, displayed in the right pane of the Workflow UI, highlights the backlog. This graph plots the volume of unprocessed data over time. When the data processing rate lags behind the data input rate, a backlog begins to accumulate, clearly visualized in the graph.
Alerting on the Backlog metrics from Workflows UI
Databricks is also enhancing its alerting functionality by incorporating backlog metrics alongside its existing capabilities, which include alerts for start, duration, failure, and success. Users can set thresholds for streaming metrics inside the Workflows UI, ensuring notifications are triggered whenever these limits are exceeded. Alerts can be configured to send notifications via email, Slack, Microsoft Teams, webhooks, or PagerDuty. The recommended best practice for implementing notifications on DLT pipelines is to orchestrate them using a Databricks Workflow.
The above notification was delivered through email and allows you to click directly into the Workflows UI.
Enhancing Streaming Pipeline Performance through Real-Time Backlog Metrics in DLT
Managing and optimizing streaming pipelines in Delta Live Tables is a significant challenge, particularly for teams dealing with high-throughput data sources like Kafka. As data volume scales, backlogs increase, which leads to performance degradation. In serverless DLT, features like stream pipelining and vertical autoscaling help maintain system performance effectively, unlike in non-serverless where these capabilities are unavailable.
One major issue is the lack of real-time visibility into backlog metrics, which hinders teams ability to quickly identify problems and make informed decisions to optimize the pipeline. Currently, DLT pipelines rely on event log metrics, which require custom dashboards or monitoring solutions to track backlogs effectively.
However, the new streaming observability feature allows data engineers to swiftly identify and manage backlogs through the DLT UI, enhancing the efficiency of monitoring and optimization.
Here let’s examine a Delta Live Tables pipeline that ingests data from Kafka and writes it to a streaming Delta table. The code below represents the table definition in DLT.
The kafka_stream_bronze is a streaming Delta table created in the pipeline, designed for continuous data processing. The maxOffsetsPerTrigger setting, configured to 1000, controls the maximum number of Kafka offsets that can be processed per trigger interval within the DLT pipeline. This value was determined by analyzing the required processing rate based on the current data size. The pipeline is processing historical data from Kafka as part of its initial setup.
Initially, the Kafka streams were producing fewer than 1000 records per second, and the backlog metrics showed a steady decline (as shown in image1). When the volume of incoming data from Kafka starts to increase, the system begins to exhibit signs of strain (as shown in images 2 and 3), which indicates that processing is struggling to keep up with the growing data volume. The initial configuration will lead to delays in processing, prompting a reevaluation of the instance and configuration settings.
It became clear that the initial configuration, which limited maxOffsetsPerTrigger to 1000, was insufficient to handle the growing load effectively. To resolve this, the configuration was adjusted to allow up to 10,000 offsets per trigger as shown below.
This helped the pipeline to process larger data batches in each trigger, significantly boosting throughput. After making this adjustment, we saw a consistent reduction in backlog metrics (image 4) , indicating that the system was successfully catching up with the incoming data stream. The decreased backlog improved the overall system performance.
This experience underlines the importance of visualizing stream backlog metrics, as it enables proactive adjustments to configurations and ensures that the pipeline can effectively manage changing data needs. Real-time tracking of backlog enabled us to optimize the Kafka streaming pipeline, reducing delays and improving data throughput without the need for complex event log queries or Spark UI navigation.
Don’t let bottlenecks catch you off guard. Leverage our new observability capabilities to monitor backlog, freshness, and throughput. Try it today and experience stress-free data pipeline management.