The DataPhilly Meetups

So many great user groups, but so little time! This year I’ve started going to the DataPhilly meetups, and I think I’m hooked. The bottom line is DataPhilly talks are very intriguing, expose you to topics you don’t encounter everyday, and give you the chance to meet “non-traditional” developers (scientists and statisticians), whose ranks are rapidly growing.
First off, a lot of the talks typically use Python (and sometimes R): There are a ton of libraries in Python when it comes to statistics, data analysis, machine learning, NLP, etc. so that’s not surprising. What I like is getting exposure to a language I don’t code in frequently and the ecosystem around it. In addition, a lot of times the real gems are hidden in the questions and discussions during the talks and the lightning round sessions at the end.
What follows is a summary of the last meetup:
The first half was on mrjob (https://github.com/Yelp/mrjob), a pretty lightweight (from what I can tell) map reduce library written in Python that lets you run Hadoop streaming jobs locally, on Amazon EMR (Elastic MapReduce), or your own Hadoop cluster. The talk started with word counting, which I swear is the “Hello World” of big data, but as promised quickly moved on to discuss detailed and specific examples (backed by code) of how its used at Monetate to generate product recommendations and gather statistics on user behavior. On the surface it seems super easy to write the MR jobs. The best part was when they discussed user behavior statistics and calculating variance: a Temple U. professor in the audience got on his soapbox and vehemently warned of the dangers of calculating variance using the “single pass algorithm” with single precision numbers due to underflow and overflow. (More on that at the end)
The second half was on Scrapy (http://scrapy.org/), a Python library – scratch that – actually an “appliance” to perform crawling and web scraping. It was a super basic intro that went over its high level architecture, spiders, and how to parse the HTML. For those who didn’t know anything about it, it was a decent intro. Of course “Little Bobby” asked the obvious question of how do you maintain the heaping pile of scraping code? The speaker mentioned how Monetate used it heavily back in the day when they were smaller, but now some of their clients give them direct data feeds instead so they don’t always have to resort to scraping. Its sort of crazy to to think how much data is still “trapped” in web pages. I don’t think that problem is going away anytime soon.
At the end of the night, two people presented during the lighting round:
1. “How airline crew schedules are made” – The speaker presented a modern twist on this combinatorial problem that now tries to minimize pilot fatigue when generating schedules. He discussed the features of a schedule that take this into account (pilot awake time, trips at night, # of timezones crossed, etc) and combining these features along with empirical data on fatigue (cognitive tests based on sleep deprivation, etc.) to come up with a “best fit” formula to use in a cost function. Hey pilots are human – they get sleepy and jet lagged just like the rest of us!
2. The aforementioned Temple U professor quickly set up an example in R to back up his claim earlier, showing how dangerous it is to calculate variance with single precision numbers and using the single-pass algorithm – you can end up with NEGATIVE variance, which makes no sense. He then showed it again using double precision and the two pass algorithm. Here’s a wikipedia link that describes it in more detail: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance – What I personally took away from this: when you perform anything more than very simple computations, know what the library does under the hood and use Google to find out if there are good numeric algorithms out there already. Remember computers are limited precision machines!
BTW, last month, one of the talks was on Python scikit-learn (http://scikit-learn.org/stable/) – a machine learning library. The example used Wikipedia articles written in different languages as training data into scikit-learn and showed a program that identifies which language an arbitrary body of text is written in. It was a gentle intro to machine learning, “document vectorizing”, and classification.
Here is a link to the meetup: http://www.meetup.com/DataPhilly/
Seriously, go check it out!