stream processing

Philly ETE 2016 – Evan Chan – NoLambda: A new architecture combining streaming, ad hoc, machine learning, and batch analytics

In today’s world of exploding big and fast data, developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda—a path for fast streaming analytics using NoSQL stores such as Cassandra and HBase with a separate batch path involving HDFS and Parquet. While this approach works, it involves too many moving parts, too many technologies for ops, and too many engineering hours. Helena Edelson and Evan Chan highlight a much simpler approach to combine streaming and ad hoc/batch analysis using what they call the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka), plus FiloDB, a new entrant to the distributed-database world that combines streaming and ad hoc analytics.

Philly ETE 2016 – Srinivas Palthepu – Emergence of Real-Time Analytics: Real-time Analysis of Customer Financial Activities With Apache Flink

In this talk we present a business use case where Capital One needs to process customer activities real-time and react to events appropriately as needed. We then present our experience in building a real-time analytics application that serves the business using a set of open source software frameworks with Apache Flink at its core for real-time stream processing engine.

Data I/O 2013 – Web-scale Data Processing: Practical approaches for low-latency and batch – Edward Capriolo

In this talk, Hive and Cassandra author (and Hive committer and PMC member) Edward Capriolo will discuss common big-data software challenges and how they can be solved using both batch and stream processing. Technology focus will primarily be on Apache Kafka for publish-subscribe messaging, Storm for stream processing, and Apache Cassandra as a NoSQL data store.

PhillyETE Screencast #21 – Stream Processing – Philosophy, Concepts, and Technologies – Dan Frank

From the abstract: “Stream processing has emerged in recent years as a very fast-growing paradigm in data science infrastructure. This rise can be partly attributed to some factors external to system design, such as business demands for near-realtime data or inability of hardware to manage an ever-growing data set. However, this paradigm also possesses many … Read More