Data pipelines are a crucial component of any data-driven organization. An efficient data pipeline boosts operational efficiency by automating the data flow and increasing its quality as it travels from one point to another throughout the pipeline. It helps enterprises to make data-driven and well-informed decisions by providing a holistic view of operations and the market.
A slight lag in data pipelines' performance can cause substantial financial losses in time-sensitive industries, like financial services and investment, where milliseconds affect the trading outcomes.
At 73 Strings, we are committed to enhancing human capabilities with innovative solutions. Let’s understand how our team designed and deployed a solution using Databricks, Debezium, Kafka, and Elasticsearch to overcome the challenges due to traditional ETL pipelines.
Building a Real-Time Analytics Pipeline
Modern-day data pipelines should have the ability to load, transform, and analyze the data in real time. The traditional ETL data pipelines involve disk-based processing, which leads to slower transformation times. While this approach is suitable for batch processing, where the data is processed at scheduled intervals, it won’t be possible to meet the demands of real-time processing.
That’s why disk-based processing should be replaced by in-memory processing for real-time data processing to enable businesses to respond to insights quickly. The concept of a real-time analytics data pipeline is that all the data from different sources, like databases, IoT devices, messaging systems, and log files, is ingested without any delay. Log-based Change Data Capture (CDC) is the gold standard that helps in producing a stream of real-time data.
CDC provides an efficient method to move the data across a wide area network, making it suitable for modern cloud architecture.
73 Strings - Building Innovative Solutions to Convert Challenges Into Opportunities
Debezium and Kafka can be used for CDC as Debezium captures the real-time changes from databases, like MySQL, without overwhelming the database. Apache Kafka serves as the backbone for event streaming. It ensures that real-time data flows from databases to the processing layer. Let’s understand the architecture of our solution:
Step 1. Set up the Source Database
The core of this real-time pipeline is MySQL, which is the primary source of transactional databases. It stores critical business information, like transactions, financial records, and application logs. All the insert, update, and delete operations are tracked in their binary log (bin log).
Step 2. Change Data Capture (CDC) using Debezium
All the data changes in source databases are captured in real time using Debezium. Debezium’s flexibility, lightweight architecture, and low-latency streaming make it suitable for CDC. Debezium is built on top of Kafka Connect, which is a framework that helps in integrating Kafka with external systems.
Debezium provides a pre-built source connector for MySQL, and this connector reads the database bin logs to detect the changes with details, like tables, changed columns, types of operations, etc. Each change event is structured as a Kafka message in JSON or AVRO format. The events are streamed into Kafka topics, which typically store one table per topic.
The stream processors subscribe to these topics and act on the changes, providing real-time data replication and syncing with downstream systems like ElasticSearch.
Step 3. Kafka (Confluent Cloud) as the Event Streaming Platform
Confluent supports features and know-how that enhance the ability to stream data reliably. Confluent Cloud is a fully managed Kafka service that requires no infrastructure management and supports automatic scaling and resilience.
Kafka Confluent Cloud ingests the real-time changes from Debezium and acts as a message broker, temporarily storing and streaming the data to consumers like Databricks and Elasticsearch. It decouples the database source from the operational analytics systems and creates a loose and scalable architecture.
The deployed Debezium is pointed towards the Confluent Cloud cluster. It offers a built-in Schema Registry that can register the event schemas and ensure type safety and schema evolution.
Step 4. Databricks is the unified analytics platform for real-time stream processing
Databricks is an enterprise-grade data platform built on Apache Spark, and it acts as a consumer that subscribes to Kafka topics to read the messages as they come in the Confluent Cloud.
With Confluent and Databricks, the streaming data sets can be prepped, enriched, joined, and queried in Databricks SQL for performing lightning-fast analytics on stream data. Databricks provides thousands of optimizations to provide improved performance for real-time applications. The voluminous streaming data is stored directly in the delta tables for fast reads, versioning, and reliability.
Step 5. Delta Live Tables (DLT) pipeline for ingestion and transformation
Delta Live Tables provides a declarative framework for developing and running automated and scalable streaming data pipelines. With DLT, the data flows can be defined in SQL or Python, and Databricks takes care of the orchestration, monitoring, quality checks, and optimizations. Automatic Orchestration and Declarative Processing are two benefits that DLT provides as compared to data pipelines built using Apache Spark.
DLT ingests the raw event data from Kafka, which is streamed using Debezium from MySQL in real time. DLT takes care of pipeline orchestration, job execution, error handling, and monitoring instead of manually coding complex ingestion logic. This makes building and maintaining real-time pipelines far more efficient and less error-prone.
Step 6. Periodic polling and data loading from Elasticsearch for the data which is stored in ES
Elasticsearch is used as a high-speed search and query engine in this architecture. After real-time data is processed and transformed by Databricks and Delta Live Tables, a portion of that data is indexed into Elasticsearch to support use cases like keyword search, filtering, and instant querying in downstream applications or dashboards.
The solution uses periodic polling and batch data loading. This approach involves scheduling jobs that periodically query the processed data (e.g., from Delta tables) and then load or refresh relevant datasets into Elasticsearch at regular intervals — for example, every few seconds or minutes, depending on the use case and latency requirements.
Key Benefits of this Architecture
73 Strings’ real-time pipeline solution was used for real-time analytics to get live data on the Analytics BI tool. From the moment the data change occurs in the database, our solution takes less than 10 seconds to make that change appear on the analytics BI tool. Let’s understand the key benefits of this architecture:
1. Minimized database load: The solution reduces the load on operational databases and improves the performance using the Debezium Kafka CDC approach.
2. Centralized data lake: Built using Databricks and Delta Lake, the solution unifies the data from different sources, like MySQL and ElasticSearch, into a single, centralized platform. This makes it easy to access the data and makes governance simpler.
3. Real-time reporting: The CDC streaming pipeline architecture ensures that the stakeholders get up-to-date dashboards and at-the-moment alerts, and not yesterday’s data.
4. Scalability: The Databricks and Kafka integration can scale horizontally to handle spikes in data volume and velocity. Even if the solution processes hundreds of records per second, the pipeline remains reliable and future-proof.
5. Advanced analytics: The pipeline can support advanced analytics and predictive analytics with clean, consistent, and centralized data.
The Bottom Line
Building scalable and near-real-time data pipelines is necessary for organizations to make the best use of their data, especially for investment and financial organizations, where every minute is crucial in responding to unpredictable markets. 73 Strings is your partner to brainstorm solutions that best use the existing technologies to overcome the challenges. Our real-time analytics pipeline supports the stakeholders with sub-second decision-making and powers their AI-driven investment strategies.