I recently built a Spark job that runs every morning to collect the previous day’s data from a few different datasources, join some reference data, perform a few aggregations and write all of the results to Cassandra. All in roughly three minutes (not too shabby).
Today’s podcast features Ken Rimple’s interview with Sameer Farooqui and Brian Clapper of DataBricks, the creators of the Spark Big Data engine.
SbtTestGrouping Running tests that use a HiveContext On our current project, we utilize Spark SQL and have several ScalaTest based suites which require a SparkContext and HiveContext. These are started before a suite runs and shut down after it completes via the BeforeAfterAll mixin trait. Unfortunately due to this bug (also see this related pull … Read More
This talk presents Apache Spark, Spark Streaming, Apache Kafka, Apache Cassandra and Akka as supporting Lambda architecture in the context of a fault tolerant, streaming big data pipeline.