Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together.
This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner. We start with how hand-written solutions to these problems evolved to prescriptive practices, opening up development of such systems to a wider audience. From there we look at how the emergence of Google’s Dataflow on Spark is helping us take the next step: the tradeoffs between correctness, latency, and cost are becoming a simple, easily changeable decision rather than a deep analysis for each new need. Finally, we look at challenges unique to doing processing in large organizations, such as making independent units of processing composable into large pipelines — and making them usable in both batch and stream modes.
Ryan Brush is a software engineer at Cerner, where he works on Hadoop-based systems to bring together and make sense of the world’s health data. He dabbles in writing, having contributed chapters to Hadoop: The Definitive Guide and 97 Things Every Programmer Should Know. He is also the author of Clara, an open-source rule engine in Clojure. Ryan’s recent focus is on ways to declaratively express domain expertise and apply it at scale.