About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Rain Rain Go Away

By on November 5, 2014

If there’s anything New Yorkers love, it’s talking about New York. The rainy season has just started here and we’re quick to miss our sidewalk dining, central park walks, and non-damp subway experiences.

I set out on a mission to see if the rain had any affect on tweets with the hashtag #NYC. It turns out, it does! But it might not be what you think.

It seems that right after the intensity of the rain increases, there’s a surge of negativity from the NYC tagging Twitter community. But as fall sets in and we realize that fall has its beauty, the positive sentiment tweets make a comeback.

Rain, Rain Go Away

About the Data

To kick things off, I used our Twitter API to pull hahstag information which I then merged with forecast data. Once that was ported into Datameer, I was able to use our sentiment analysis to determine where there was correlation between mood and weather.

Let’s break this down a bit: 

Twitter data is obviously available via the Twitter API, so that’s an easy source. It’s easy for the queries to be configured directly in Datameer and the schema-on-read nature enables users to capture just the columns of interest; this is a pretty powerful option considering the results are typically something like 40 columns wide.

 

Rain Rain Go Away

For the rain data, I used the Forecast API. This is a popular API that powers iOS apps such as Dark Skies and Weatherline; it offers hourly, micro-gps data in terms of precipitation. It provides a simple JSON output and Datameer let’s you quickly extract just the key/value pairs that are of interest to you with the JSON_VALUE function.

Rain Rain Go Away

Working with Twitter Data

“Raw” Twitter data can be intimidating. It’s full of hashtags, symbols, codes, and offensive language; the last thing you want is the F-bomb to be the biggest word in your tagcloud. Datameer understands the relevance of cleansing and how difficult text-mining can be- which is why you have functions available to you that make it fast and simple to weed out “stop words” and foul language.

Here’s what my initial raw Twitter import looked like:

Rain Rain Go Away

Here’s what it looked like with a little cleanup:

Rain Rain Go Away

What did I do? 

— Normalized the Date to a whole hour so that I could match it with my rain data (we’ll talk more about that in a minute)

— Implemented the CONTENTS_BY_TAG_NAME function to pull the ‘source’ out of the HTML element <a>. I thought it might be interesting later to see where people are tweeting       from when it rains (i.e. more at home vs mobile)

— Filtered to users who’s location was listed as NYC or contained New York  (I’m not interested if someone outside of NYC is tagging NYC when it rains in this case)

— Applied JSON_ELEMENTS to the JSON object that held all of the hashtags in the entities_hashtags blob from Twitter

— Used JSON_VALUE to pull out the individual hashtags from the array represented by the ‘text’ key, the same function that gleamed the Forecast data

It might seem like a lot, but this cleansing and prepping took about 15 minutes- it’s a process that applies to all of my Twitter analyses and I can even re-use this cleansing sheet for other projects by swapping the data with a different Twitter query. #recycle #efficiency #worksmarter

Joining Disperate Data Sources

One difficult challenge when looking for trends in seemingly unrelated data sources is finding the commonality that makes them join-able.

The rain data, as you can see in the sample screenshot, gives hourly data on the hour – but Tweets don’t happen on the hour at all- they carry unique timestamps. All I had to do to give these data sets a join key was normalize the timestamp. Using the FORMATDATE function, I easily applied the parse pattern of “MMM dd yyyy HH”- making these datasets capable of a fruitful inner join because all of the timestamps were on the hour now!

Take a look at the now enriched data:

Rain Rain Go Away

Now for the fun!

How do you measure the polarity of a word? 

Sentiment analysis was hard.

Just point the ANALYZE_POLARITY function at the hashtags column and then go to lunch early because you’ve earned it. I filtered out the ‘undecided’ or ’neutral’ results to keep it simple. But we won’t tell anyone that you didn’t have to upload a sentiment dictionary, reverse outer join it to your two datasets, filter out blanks and errors, and then produce polarity results.

Rain Rain Go Away

So what’s left? 

Some basic grouping and counting of the date/time and the number of hashtags that were positive vs. negative are all that remain to generate the infographic representation of this idea.

 


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook


Jason Arrigo

Subscribe