In this talk, Hive and Cassandra author (and Hive committer and PMC member) Edward Capriolo will discuss common big-data software challenges and how they can be solved using both batch and stream processing. Technology focus will primarily be on Apache Kafka for publish-subscribe messaging, Storm for stream processing, and Apache Cassandra as a NoSQL data store.
As Mahout rolls towards a 1.0 release, Mahout committer and co-founder Grant Ingersoll, will provide an overview of what’s happening with the machine learning project and what to look forward to next.
Predictive modeling is one of the figureheads of big data. Machine Learning Theory asserts that the more data the better, and empirical observations suggest that the more granular data, the better the performance (provided you have modern algorithms and big data) but the paradox of predictive modeling is that when you need models the most, even all the data is not enough.
Vert.x is an asynchronous, event-driven application platform similar in style to Node.js, except it runs on the JVM. It supports several JVM languages, including Javascript, and uses a multi-reactor event loop to handle a very high number of concurrent connections. Learn about it in this screencast from Data I/O 2013.
This talk will address valuable lessons learned with the current versions of HBase. There are inherent architectural features that warrant for careful evaluation of the data schema and how to scale out a cluster. The audience will get a best practices summary of where there are limitations in the design of HBase and how to avoid those. In particular, we will discuss issues like proper memory tuning (for reads and writes), optimal flush file sizing, compaction tuning, and the number of write ahead logs required. Further, there is a discussion of the theoretical write performance, in comparison to those observed on real clusters. A collection of cheat sheets and example calculation for cluster sizing rounds out the talk towards the end.
Is Amazon’s new managed, lower cost, petabyte scale warehousing solution a game changer? We’ll review the costs and discuss what does (or does not) make Amazon Redshift reliable, scalable and effective. We’ll dive into the technical details behind the query and storage engines and we’ll expose what works well and what does not. This talk should benefit both those that are and are not already part of the Amazon Web Services ecosystem.
This will be an intro-level talk about ZooKeeper, why it’s useful, and what you should do with it now that you’ve got it running. I will cover the high-level purpose of ZooKeeper and guarantees that it provides, as well as covering some of the basic use cases and operational concerns.
With the release of version 4.4 of Lucene and Solr, it is easier than ever to add and scale search capabilities to your data driven application. In this talk, Lucene and Solr committer Grant Ingersoll will walk you through the latest and greatest capabilities in Lucene and Solr related to relevance, distributed search, and faceting as well as show you how to leverage these capabilities to build fast, efficient, scalable, next generation data driven applications.