The concept of stream processing has been around for a while and most software systems operate as simple stream processors at their core: they read data in, process it, and maybe emit some data out. So why are there so many stream processing frameworks, all with different terminology, and why does it seem so complex to get up and running? What benefits does each stream processing system provide, and more importantly, what are they missing?
This presentation will start by abstracting away the individual frameworks and describe the key features and benefits that stream processing frameworks provide. These core features include scalability and parallelism through data partitioning, fault tolerance and event processing order guarantees, support for stateful stream processing, and handy stream processing primitives such as windowing. Understanding these features will enable you to map practical data problems to stream processing, write applications that process streams of data at scale, and understand how the different frameworks fit into the stream processing framework design space.
Next, we’ll describe Kafka’s new stream processing library, Kafka Streams, and the design decisions and tradeoffs it makes. Kafka Streams represents a new design point in the stream processing space. Where most frameworks provide a service for running stream processing applications, Kafka Streams emphasizes low-overhead development that feels more like developing any other application. This trades off the benefits of a centrally-managed stream processing infrastructure for an easier adoption path and easy integration with your existing deployment tooling. Kafka Streams is also designed to work solely with Kafka. This limits its use to data that is already in Kafka and requires additional tools to import/export data from other systems, but allows Kafka Streams to leverage unique Kafka features such as consumer groups to keep implementation complexity low and get scalability and fault tolerance nearly for free. Combined, these decisions represent a new design point for stream processing applications that we believe address use case not well served by today’s popular frameworks.
Ewen Cheslack-Postava is a Kafka committer and engineer at Confluent building a stream data platform based on Apache Kafka to help organizations reliably and robustly capture and leverage all their real-time data. He received his PhD from Stanford University where he developed Sirikata, an open source system for massive virtual environments. His dissertation defined a novel type of spatial query giving significantly improved visual fidelity, and described a system for efficiently processing these queries at scale.