The O'Reilly AI Conference
I recently attended the O’Reilly AI Conference in New York where artificial intelligence practitioners showcased the impressive strides they’ve made so far in using AI for real-world applications
I recently attended the O’Reilly AI Conference in New York where artificial intelligence practitioners showcased the impressive strides they’ve made so far in using AI for real-world applications
The tech industry is in the middle of a massive, uncontrolled social experiment. Having made commercial mass surveillance the economic foundation of our industry, we are now learning how indiscriminate collections of personal data, and the machine learning algorithms they fuel, can be put to effective political use. Unfortunately, these experiments are being run in … Read More
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio … Read More
Podcast: Play in new window | Download (Duration: 25:36 — 35.9MB) | Embed
Today’s podcast features Ken Rimple’s interview with Sameer Farooqui and Brian Clapper of DataBricks, the creators of the Spark Big Data engine.
Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together. This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner.
In this talk we present a business use case where Capital One needs to process customer activities real-time and react to events appropriately as needed. We then present our experience in building a real-time analytics application that serves the business using a set of open source software frameworks with Apache Flink at its core for real-time stream processing engine.
Kafka Streams represents a new design point in the stream processing space. Where most frameworks provide a service for running stream processing applications, Kafka Streams emphasizes low-overhead development that feels more like developing any other application.
Spark is becoming a data processing giant, but it leaves much as an exercise for the user. Developers need to write specialized logic to move between batch and streaming modes, manually deal with late or out-of-order data, and explicitly wire complex flows together. This talk looks at how we tackled these problems over a multi-petabyte dataset at Cerner.
In today’s world of exploding big and fast data, developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda—a path for fast streaming analytics using NoSQL stores such as Cassandra and HBase with a separate batch path involving HDFS and Parquet. While this approach works, it involves too many moving parts, too many technologies for ops, and too many engineering hours. Helena Edelson and Evan Chan highlight a much simpler approach to combine streaming and ad hoc/batch analysis using what they call the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka), plus FiloDB, a new entrant to the distributed-database world that combines streaming and ad hoc analytics.
In this talk we present a business use case where Capital One needs to process customer activities real-time and react to events appropriately as needed. We then present our experience in building a real-time analytics application that serves the business using a set of open source software frameworks with Apache Flink at its core for real-time stream processing engine.