apache spark

IoT on AWS – That’s Not A Data Lake…

This talk will review two common use cases for the use of captured metric data: 1) Real-time analysis, visualization, and quality assurance, and 2) Ad-hoc analysis. Once metric data is generated, to support the use cases mentioned above it must be ingested properly using a robust and fault-tolerant streaming framework. The most common open source streaming options will be mentioned however this talk be concerned with Apache Flink specifically. A brief discussion of Apache Beam will also be included in the context of the larger discussion of a unified data processing model.

Best practices around data persistence will be discussed. An attempt will be made to eliminate confusion about the format data should take when it is ‘at rest’. Different serialization formats will be compared and discussed in context with the most typical analysis use cases. Finally fully managed solutions such as AWS Data Lake will be mentioned briefly. We will discuss their relative advantages and disadvantages.

By Eric Snyder, Software Architect at Chariot Solutions

Real World Spark Lessons

I recently built a Spark job that runs every morning to collect the previous day’s data from a few different datasources, join some reference data, perform a few aggregations and write all of the results to Cassandra. All in roughly three minutes (not too shabby).

SBT: Group annotated tests to run in forked JVMs

SbtTestGrouping Running tests that use a HiveContext On our current project, we utilize Spark SQL and have several ScalaTest based suites which require a SparkContext and HiveContext. These are started before a suite runs and shut down after it completes via the BeforeAfterAll mixin trait. Unfortunately due to this bug (also see this related pull … Read More