As a reminder, my goal with this project is to use Tweets containing education-related keywords to monitor attitudes towards education in different geographical locations (for this initial work, different locations around the San Francisco Bay Area). A lot of what I’ve done has been based off of code/assignments from Bill Howe’s Coursera course on Data Science (which appears to no longer be on Coursera, but still has an excellent GitHub repository here. Currently, my workflow consists of three steps, each corresponding to different pieces of Python code (which can be found in my GitHub repository, for those who are interested):
- Twitterstream.py: Collects messages from Twitter, within a specified geographical bounding box (in my case, the S.F. Bay Area), and pipes them into a text file in JSON format.
- ExtractTweetData.py: Reads the raw data from the stream file and converts it into a tab-delimited file. This file does not contain all possible metadata, just the ones of interest for my particular project (e.g., geographical location data). A simple sentiment score based on the AFINN-111 matrix is also calculated for each Tweet.
- ParsedTweetReader.py: Takes the tab-delimited tweet file from the previous step, along with a set of query terms, and calculates the frequency and average sentiment score for each term in each geographical location.
First, here is a simple geographical frequency distribution of Tweets within my sample (comprising slightly under a month of collected data, with discontinuous collection; only locales with > 10,000 Tweets are shown):
There seem to be some baseline differences in sentiment scores associated with the different cities, although all the average sentiment scores are quite small in magnitude (error bars show standard error of the mean):
When I pull out only messages
containing certain keywords (in this case, related to education) and calculate
their average sentiment scores in different locales, I again observe some
differences in scores but small overall magnitudes and large errors (bars indicate
standard error of the mean). For example, for the keywords “school,” “class,”
and “college,” which are three of the most common education-related terms in my
dataset, the following patterns are observed:
As a check that the sentiment
scores are detecting something meaningful, we can also check the scores for “homework,”
which would generally be expected to have a more negative sentiment associated
with it (especially around the start of school):
“Homework” does tend to have a negative
sentiment score, though for some of the cities towards the right-hand side of
the graph (cities with smaller overall samples), the results are probably not
very accurate because the number of term-containing tweets is small (as few as
11 for Redwood City).
- Data Collection: My data collection is discontinuous (consists of semi-random chunks) because I collect data on an old computer that gets filled up, accidentally shut off by the cleaning lady, disconnected by the Twitter streaming endpoint, etc. Having a more robust data collection method would allow me to do analyses over time, which might reveal interesting trends at the weekly and monthly scales.
- Sentiment Score Calculation: The method I’m using to calculate sentiment scores is a very rudimentary one. It only works at all for messages that contain one of a fixed set of words (those in the AFINN-111 matrix), with other message receiving a neutral 0. The words are only in standard English, and emoticons and punctuation are not taken into consideration. Calculating a better sentiment score, perhaps using machine learning approaches and taking full advantage of emoticons, is an area I’d really like to follow up. (E.g., I’d like to try training random forests on positive/negative messages, then using the trained classifier on new messages and assigning the signed % certainty of classification as the message sentiment score.
- Sampling Design: One problem with my current sample is that, for any given geographical location, it’s just not big (the number of messages with a given query term is pretty small). I had initially wanted to sample the Bay Area because I had some idea of how locales might vary due to socioeconomic factors, but if I were to do this experiment again, I think I might try a larger scale (e.g., splitting the U.S. up by states, or choosing only major metropolitan areas) so as to have significantly bigger datasets.
Okay, that’s already more time than I was supposed to
spend on this today, so I’d better get back to the in situs. But thank you for reading, and I hope you’ll check back
soon for the next tutorial, which will be on parsing JSON format files.




