Back in 2013, I stumbled across the New York City open data project; I was like a nerd at her first comic con. Datasets from just a few hundred records all the way up to gigabytes worth of rows were readily available for the mining. As a New Yorker, the restaurant grade data immediately caught my eye- and as a nerd, the 100,000k plus record length made the project worth while.
About the Data
In New York (and many other cities in the United States), restaurants receive a grade based on their compliance with local health code regulations. NYC.gov outlines the program as follows:
“The Health Department conducts unannounced inspections of restaurants at least once a year. Inspectors check for compliance in food handling, food temperature, personal hygiene and vermin control. Each violation of a regulation gets a certain number of points. At the end of the inspection, the inspector totals the points, and this number is the restaurant’s inspection score—the lower the score, the better the Grade.”
Is there a cleanliness trend correlated to cuisine type?
In this constant drive to unearth big data use cases, I thought that I was really on to something that could influence travel and restaurant industry’s marketing efforts for the New York City metropolitan area.
Datameer was the tool of choice because I was able to ingest, cleanse, parse, and model the data in less than 10 minutes. But Moreover the analytics would give me instant feedback. I was sure to find that my assumption regarding certain cuisines would surely come to light and I already imagined how the sunburst chart would look.
How I did it
The analytics were simple once I got the data ingested. I did some basic aggregations on the cuisines, joined the cuisine type from a separate data source that translated the code, joined the violation detail data, counted the number of letter grades, and normalized the data with some relative averages to make sure the cuisines with a lot of locations didn’t appear to be the worst performers.
The analytics themselves took a little over an hour and I completed it with two joins and two sheets. I’m running an embedded Hadoop on my laptop and the process time was less than a minute- even though this was a big data analytics project, instant gratification awaited me!
What I found
All of my biases, assumptions, and more importantly my 2 precious hours of work taught me that there is no meaningful correlation between cuisine type and restaurant health inspection grade.
I thought about trying to figure out if zip code, or other factors contributed- but I was a bit discouraged by my findings and the project was tabled.
1 Year Goes by…
With the release of the Smart Analytics features and Flipside in Datameer, I was met with this deep sense of wonder- would these features have saved me any time?
The answer was absolutely, and more.
Flipside was another huge contributor to my discovery. With a quick flip of the original dataset, I could see that there were irregularities in the grade results with things like P, Z, and nulls that were skewing my original analytics. I expected 3 unique values (A, B, C) and saw immediately that there were 6 and my cleansing was effortless. I simply applied a filter and moved forward. In my previous attempt, I didn’t discover the undesirable variables until my aggregations were complete.
Column dependencies is one of the Smart Analytics and if given variables, it will give you a correlation score from 0-1, 1 being a 100% correlation. Well- this seemed like exactly what I tried to build manually, but much prettier and with a fraction of the effort. I revisited the same data set, plugged in all variables, and found out within 30 seconds that not only is there almost no correlation between cuisine type and grade, that there’s no correlation in any of the data points and grade- As you can see below, the score and current grade are the highest correlated data points and that’s simply because the score determines the grade.
What I learned
A friendly reminder about biases when working with data was the first learn for me in this situation. I spent a lot of time heading down an assumed path and ran in to a complete dead end.
The most rewarding lesson was how powerful data discovery is- if you’re open to it. Because not only did the quick column dependencies graphic show me there was no high frequency path to follow- it led me down a more prescriptive path using decision tree to measure what factors will contribute to certain scores and restaurant performance. And instead of a commitment of multiple hours, it was a commitment of just a few minutes.
So while many of the Big Data Community is at Strata next week, keep this in mind as you explore the city for all the wonderful cuisine options it offers. Hopefully you’ll use Datameer to find some new insights of your own! We also just released our latest version, Datameer 5.0 with Smart Execution that takes an even bigger step by intelligently selecting the most effective compute framework to execute each job whether it be optimized MapReduce, single node or in-memory. To find out more, stop by our booth 409 or fill out this meeting request form; we’d love to talk to you personally about how Datameer can help you discover insights.