Bigger, faster, smarter - October 30, 2013
Data IO showcases how to leverage today's applications and open-source frameworks to connect tens or hundreds or even millions of devices, things and people, and with any luck, make some sense of the data being captured. We'll show you not only how the pieces fit together, but also why these platforms are capable of doing what they do.
Topics discussed include machine learning, graph databases with Neo4J, Amazon Web Services, and more.
Check this page frequently for additional speakers and a schedule.
|Room A||Room B|
8:30 - 9:00
9:00 - 9:50
10:00 - 10:50
11:00 - 11:50
11:50 - 1:00
1:00 - 1:50
2:00 - 2:50
3:00 - 3:50
Speakers / Topics in Detail
Lance Ball - RedHat
Vert.x: Async Data from Cluster to Browser
Web-scale data processing: practical approaches for low-latency and batch
Apache Hadoop provides a useful implementation of the Map Reduce paradigm to perform ETL (Extract Transform Load) processes. Apache Hive enhances Hadoop's capabilities by allowing users to interact with Hadoop via a structured query language. Even with Hadoop and Hive, building some solutions in a batch driven system can be cumbersome, particularly when dealing with time sensitive latency constraints. Stream processing can be used in tandem or as an alternative to batch processing. Stream processing possesses its own set of challenges at web-scale including new software stacks, application programmer interfaces, failure modes, and managing the software life cycle.
In this talk, Hive and Cassandra author (and Hive committer and PMC member) Edward Capriolo will discuss common big-data software challenges and how they can be solved using both batch and stream processing. Technology focus will primarily be on Apache Kafka for publish-subscribe messaging, Storm for stream processing, and Apache Cassandra as a NoSQL data store.
Edward Capriolo is a developer at Dstillery where he helps design and maintain applications for distributed data storage and processing systems for the internet advertising industry. He is a member of the Apache Software Foundation and on the Project Management Committee for the Hadoop-Hive project. He has authored two books on big data, Programming Hive (O'Reilly) and the Cassandra High Performance Cookbook (Packt).
Max De Marzi is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions.
This will be an intro-level talk about ZooKeeper, why it's useful, and what you should do with it now that you've got it running. I will cover the high-level purpose of ZooKeeper and guarantees that it provides, as well as covering some of the basic use cases and operational concerns.
Camille Fournier is the Head of Engineering at Rent the Runway. Prior to joining Rent the Runway she was an infrastructure engineer at Goldman Sachs.
In her limited spare time she is a distributed systems hacker, serving as a PMC member and committer for the Apache ZooKeeper project.
This talk will address valuable lessons learned with the current versions of HBase. There are inherent architectural features that warrant for careful evaluation of the data schema and how to scale out a cluster. The audience will get a best practices summary of where there are limitations in the design of HBase and how to avoid those. In particular, we will discuss issues like proper memory tuning (for reads and writes), optimal flush file sizing, compaction tuning, and the number of write ahead logs required. Further, there is a discussion of the theoretical write performance, in comparison to those observed on real clusters. A collection of cheat sheets and example calculation for cluster sizing rounds out the talk towards the end.
Loading data into Hadoop is easy, since it behaves like a file system. Reading and analyzing though depends very much on the questions asked. If the data is composed of nested data structures, as provided for example by Apache Avro, it is advantageous to choose a file format that applies itself to the analytical processing needs. In case of the nested records a columnar format with an explicit type system can help storing data very efficiently - for example by applying a specific compression and encoding algorithm based on field types. Parquet is based on the ideas presented in Google's Dremel paper and implements them in the open-source Hadoop ecosystem. This allows data processing tools like Hive and Impala to be able to process very large datasets as effective as possible. This presentation will explain what Parquet is, what its goals are and how it is used within Hadoop.
Lars George has been involved with HBase since 2007, and became a full HBase committer in 2009. He has spoken at many Hadoop User Group meetings, and conferences such as ApacheCon, FOSDEM, QCon, Hadoop World, and Hadoop Summit. He also started the Munich OpenHUG meetings. Lars now works for Cloudera, as the EMEA Chief Architect, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data driven solution. Lars is the author of O'Reilly's HBase - The Definitive Guide.
Grant Ingersoll - LucidWorks
Whether it's enabling core search or powering a next generation product recommendation engine or building agile business intelligence tools, Apache Lucene and Solr are highly capable, open source search technologies that make it easy for organizations to drastically enhance data access. With the release of version 4.4 of Lucene and Solr, it is easier than ever to add and scale search capabilities to your data driven application. In this talk, Lucene and Solr committer Grant Ingersoll will walk you through the latest and greatest capabilities in Lucene and Solr related to relevance, distributed search, and faceting as well as show you how to leverage these capabilities to build fast, efficient, scalable, next generation data driven applications.
As Mahout rolls towards a 1.0 release, Mahout committer and co-founder Grant Ingersoll, will provide an overview of what's happening with the machine learning project and what to look forward to next.
Grant Ingersoll is the CTO and co-founder of LucidWorks as well as an active member of the Lucene community – a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project and a long standing member of the Apache Software Foundation. Grant's prior experience includes work at the Center for Natural Language Processing at Syracuse University in natural language processing and information retrieval. Grant earned his B.S. from Amherst College in Math and Computer Science and his M.S. in Computer Science from Syracuse University. Grant is also the co-author of Taming Text from Manning Publications.
This talk will be an introduction to scientific programming in Python, focusing on the NumPy and SciPy packages. NumPy extends Python with classes for fast vectorized operations on multidimensional arrays. SciPy is a collection of algorithms built on top of NumPy, and provides modules for many common scientific algorithms and mathematical tools, from FFTs and signal processing to clustering and image processing. NumPy and SciPy can deliver the performance of highly optimized C and Fortran code with the ease of use of a modern interpreted scripting language.
Walt recently finished his PhD in Computer Science at Drexel University. His dissertation was on canonical behavior patterns, which he'll be glad to explain to you if you ask nicely. Prior to grad school, he worked on high-throughput online systems at the Philadelphia Stock Exchange, QVC and SIG. He also wrote code to do statistical analysis of DNA microarray data at the Wistar Institute. Walt is currently back at Drexel working as a postdoc, where he's doing computational image sequence analysis of stem cell movies. In his spare time he runs the Philadelphia Perl Mongers.
All the data and still not enough???
Predictive modeling is one of the figureheads of big data. Machine Learning Theory asserts that the more data the better, and empirical observations suggest that the more granular data, the better the performance (provided you have modern algorithms and big data) but the paradox of predictive modeling is that when you need models the most, even all the data is not enough. There are just so many people buying luxury cars online. So even in the days and age of big data there remains an art to predictive modeling in situation where the right data is scarce. This talk will present a number of cases where enough of the right data is simply not obtainable. In those instances we discuss some tricks of the trade including transfer learning and quantile estimation.
Claudia Perlich is a renowned data scientist who currently serves as Chief Scientist at Dstillery. In this role, Claudia designs, develops, analyzes and optimizes the machine learning that informs brands on how to find their best prospective customers. She and the team of Dstillery scientists live and breathe web-wide data to drive new business and marketplace intelligence. An active industry speaker and frequent contributor to industry publications, Claudia thrives on advocating best data practices. She has published over 50 scientific articles, and holds multiple patents in machine learning. She has won many data mining competitions, including the prestigious KDD Cup three times for her work on movie ratings in 2007, breast cancer detection in 2008, and churn and propensity predictions for telecom customers in 2009, as well as the KDD best paper award for data leakage in 2011 and bid optimization in 2012.
Is Amazon's new managed, lower cost, petabyte scale warehousing solution a game changer? We'll review the costs and discuss what does (or does not) make Amazon Redshift reliable, scalable and effective. We'll dive into the technical details behind the query and storage engines and we'll expose what works well and what does not. This talk should benefit both those that are and are not already part of the Amazon Web Services ecosystem.
TL;DR - Eric was a consultant, then a "big data" engineer.
Eric Snyder is a software engineer who currently enjoys building large-scale, cloud-based, big data solutions with Amazon Redshift, Amazon Elastic Map Reduce, Hadoop, Hive, and other technologies. Eric has previously worked with many businesses in a variety of markets to build a variety of solutions, from complex web applications to home automation.