spark

Philly ETE 2016 #10 – Ryan Brush – Untangling Healthcare with Spark and Dataflow

Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together. This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner.

Philly ETE 2016 #4 – Evan Chan – NoLambda: A new architecture combining streaming, ad hoc, machine learning, and batch analytics

In today’s world of exploding big and fast data, developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda—a path for fast streaming analytics using NoSQL stores such as Cassandra and HBase with a separate batch path involving HDFS and Parquet. While this approach works, it involves too many moving parts, too many technologies for ops, and too many engineering hours. Helena Edelson and Evan Chan highlight a much simpler approach to combine streaming and ad hoc/batch analysis using what they call the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka), plus FiloDB, a new entrant to the distributed-database world that combines streaming and ad hoc analytics.

Philly ETE 2016 – Ryan Brush – Untangling Healthcare with Spark and Dataflow

Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together. This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner.

Philly ETE 2016 – Evan Chan – NoLambda: A new architecture combining streaming, ad hoc, machine learning, and batch analytics

In today’s world of exploding big and fast data, developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda—a path for fast streaming analytics using NoSQL stores such as Cassandra and HBase with a separate batch path involving HDFS and Parquet. While this approach works, it involves too many moving parts, too many technologies for ops, and too many engineering hours. Helena Edelson and Evan Chan highlight a much simpler approach to combine streaming and ad hoc/batch analysis using what they call the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka), plus FiloDB, a new entrant to the distributed-database world that combines streaming and ad hoc analytics.

Philly ETE 2015 – Jay Kreps – Putting Apache Kafka to Use

I will discuss the experience at LinkedIn and elsewhere moving from batch-oriented ETL to real-time streams using Apache Kafka. I’ll talk about how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data.