TechCast #102 – Sameer Farooqui and Brian Clapper on Spark

Tags: , ,

Today’s podcast features Ken Rimple’s interview with Sameer Farooqui (@blueplastic) and Brian Clapper (@brianclapper) of Databricks. The creators of Databricks are also the creators of Apache Spark – a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.

He asks them about what makes Spark so powerful, reviewing not only the underly infrastructure and using Resilient Distributed Datasets (RDDs), but how the Apache Spark 2.0 spark context selects the most optimal query strategy.

We also talk about the popularity of Spark and what it takes to be a Spark developer.

We talk to Sameer and Brian about:

  • What is Databricks? Databricks is a Just-in-Time Data platform that empowers anyone to easily build and deploy advanced analytics solutions with Apache Spark.
  • How does Apache Spark work, and what does it mean to enterprises and server side developers? Spark is an open-source, big data processing engine built around speed, ease of use, and sophisticated analytics. It’s Apache-licensed, written in Scala, and offers developers different perspectives from their data. What makes Spark so useful is a unified API across SQL, DataFrames, streaming, machine learning, and graph processing, giving developers the ability to use one framework and intermix all their code between these data paradigms.
  • Persisting data to disk slows things down, so one of the fundamental goals of Spark’s developers was to keep as much data in memory as possible during processing.
  • RDDs – resilient distributed datasets – what they are, how they get created, and their benefits within the Spark ecosystem.
  • Even though Brian doesn’t make tech predictions – Is 2016 really the year Spark really takes off?
  • The best use cases for Apache Spark. If you really need data operations to be performed in < 10 milliseconds, then Spark Streaming probably isn't for you. In fact, from Sameer and Brian's research at Databricks, 90-95% of streaming use cases can be satisfied with a batch interval of half a second. The true power of Spark is that it doesn't just do stream analysis - it can be mixed with machine learning, graph processing, and SQL queries, saving the programmer from the cognitive overload of having to learn a bunch of new engines.
  • Sameer’s favorite Spark use case, which involves an incredibly high throughput of nervous system imagery data of zebrafish brains.