The distinction between batch and stream processing is the distinction between looking at yesterday's newspaper and watching live news. For many business decisions — fraud detection, operational alerting, real-time personalisation, supply chain exception management — yesterday's data is too late. The business needs to know what is happening now, and act on it within seconds or minutes.
Apache Kafka is the foundational infrastructure for real-time data pipelines. At its core, Kafka is a distributed, partitioned, replicated commit log: an append-only sequence of events that any number of producers can write to and any number of consumers can read from, independently and at their own pace. This decoupling — producers do not need to know who will consume their data, consumers do not need to synchronise with producers — makes Kafka the standard integration backbone for event-driven architectures.
Kafka handles ingestion and transport. Apache Flink handles stateful stream processing — the computations that require context across multiple events. Flink's programming model allows developers to express complex analytics as streaming dataflows: count events in a sliding five-minute window, detect sequences of events that match a fraud pattern, aggregate IoT sensor readings by device and compute rolling averages, join two event streams on a shared key within a time window. These computations happen continuously, in real time, as events flow through the system.
The operational infrastructure for Kafka and Flink at production scale requires attention to partitioning strategy, consumer group management, serialisation format (Avro and Protobuf are the production standards — JSON is too verbose for high-throughput streams), and schema registry for managing schema evolution without breaking consumers.
Managed Kafka services — Confluent Cloud, Amazon MSK, Azure Event Hubs with Kafka protocol — reduce the operational burden of running Kafka clusters at scale. Flink is available as a managed service on AWS (Amazon Kinesis Data Analytics) and Google Cloud (Dataflow). These managed options lower the barrier for organisations without specialised distributed systems expertise.
