Sunday, September 28, 2014

Streaming Twitter: Getting a basic Twitter stream

Everything I know about streaming Twitter data, I learned from Coursera. Well, maybe not everything. But it’s true that when I looked at the first assignment for Bill Howe’s Introduction to Data Science Course, I realized that if I could figure out how to do the assignment, I would have all the basic tools I needed to acquire, parse, and analyze Twitter data. Mind you, I haven’t actually watched the course videos yet (maybe after I graduate), but the assignment instructions were very useful, particularly for getting and using the hardwired credentials necessary to access the Twitter API. With a little trial and error, I was able to build on that foundation using the Twitter documentation, and was ultimately able to get what I needed for my project – a reasonably high-volume stream of data originating from (or at least, somehow associated with) a particular geographical region.

In this two-part series of posts, I’ll discuss several aspects of how to get data from Twitter. In Part 1, I’ll simply describe how to get some Twitter data (a random-ish sample of all global Twitter data) streaming on your computer, mostly by directing the reader to Bill Howe's excellent guide on Coursera. In Part 2, I’ll talk about ways to filter or gate your request so that you get only a subsample of Tweets with particular properties – e.g., from a certain location, in a particular language, or containing a given keyword.

A few notes on system requirements and background: 
  • My programs are written using Python 2.7.3 in the IDLE development environment. If you want to use my code or write your own similar code, you will also need to install and use Python.
  • This blog will not teach you how to program in Python, but if you have some coding experience (or even if you don’t), Python is an easy language to pick up. Some good resources for self-teaching Python include the CodeAcademy course and Google’s Python class
  •       All my programs are tested on a Windows XP or Windows 7 computer. I think most of the code should be portable to a different OS, but setup details (e.g., how to install packages) may differ.

Step 1: Getting a Basic Twitter Stream

In the scheme of programming feats, streaming data from Twitter is not very complicated. In fact, if you go to the Github repository for Introduction to Data Science, you can find a nice, simple piece of working code that does just that – all you have to do is press the button. Right?

Well, yes and no. For one thing, that code has some credential and package dependencies that have to be satisfied, and which (while not intellectually interesting) can be a barrier if you’ve never queried an API before…which I most definitely hadn’t. Also, while the code may work for you once you’ve satisfied these dependencies, it may not be especially clear how it works – which is important if you want to modify it, e.g., to write data to a file in different formats, or to make queries of different types. I’ll discuss how to modify queries in the second post of this series.

The code that I'll point you to today isn’t mine at all – rather, it’s the template code in the Introduction to Data Science Assignment 1 folder on Github. If you are not familiar with Github, it’s an online code repository with useful features for version control. However, you don’t need to understand these features to get the Assignment 1 materials. You can simply access the link above and click the “Download ZIP” button on the right-hand side of the page. This should give you a ZIP folder containing several Python files, including twitterstream.py (the streaming template), as well as an HTML file (“assignment1.html”) and some text files.

Open up “assignment1.html” and scroll down to the header “Problem 1: Get Twitter Data.” As described in the instructions, you need to complete several setup steps before you can begin streaming data: 
  • First, you will need to use the Twitter website to get authentication credentials that you can use to connect securely to the Twitter API. These are a set of unique codes that identify you, and which are necessary in order to query the Twitter API. (To use these codes, paste them into the twitterstream.py file in the appropriate spot, as shown in the instructions.)
  • Second, you will need to install the oauth2 library on your computer. oauth2 enables you to submit your credentials and thus make a secure request to the Twitter API. To install oauth2, you may first wish to install the easy_install python package, then use the command “easy_install oauth2” from the Windows command line. If you are still unable to import oauth2 in Python, you may need to figure out where easy_install put the oauth2 package, and add this location to the PYTHONPATH variable.
Now, you should be ready to run twitterstream.py and get your first stream of Twitter data. You can run the script from either the Windows command line (navigating to directory that contains the file, then typing “twitterstream.py”) or from within a development environment like IDLE. The output should be a stream of tweets, accompanied by a lot of metadata in bracketed-dictionary format (JSON format, which I'll discuss a couple posts down the road).

It can be kind of a rush the first time you see the data flowing – I remember how psyched I was that I had gotten the code to work, and how powerful I suddenly felt, as if no data-mining challenge were beyond my reach. Overconfidence? Maybe a tad. But getting a datastream going is the first step towards being able to collect the specific Twitter sample you need...which will be the topic of the second post in this series, “Putting filters on your query."


Monday, September 22, 2014

So you want to stream Twitter...


Around the end of June this year, I was bitten by a bug. To be precise, a programming bug.

To give you some context, I was at the time a 5th (almost 6th) year graduate student in molecular biology, less than six months away from my thesis defense date. My work was on the development of stomata, plant leaf pores that regulate gas exchange, and had next to nothing to do with programming.  The amount of free time I had, while not a negative number, was not a large one either. In short, it wasn’t a particularly intuitive or opportune time to kick off some recreational programming.

So why, you may ask, did I start programming? The reason why I started was not a particularly noble one. Five years into my Ph.D., I was feeling less and less certain that I was cut out for an academic career, and becoming more and more aware that the options for biology Ph.D.s outside of academia were limited (particularly if you studied a small, weedy plant rather than, say, immunology). Around that time, I saw an email circular about a program called Insight, which was designed to help Ph.D.s from various scientific disciplines transition to data analysis jobs in the tech sector.  My dad is a programmer (yes, he’s that Michael Abrash), and so I knew that while the tech sector had its problems, it also had some excellent features like abundant jobs, decent pay, and the chance to do something insanely cool that had a major effect on the world.

I thought about the program for about two days, and the longer I thought, the more convinced I became that Insight was a once-in-a-lifetime chance, a tiny moving window; and that if I wanted to escape a lifetime of teaching high school, editing technical publications, or working for Monsanto, I had to bodily fling myself through that window, with every ounce of strength that I had. And what that meant was developing data science credentials, and quickly. What that meant was, I needed to do some programming.

So, that was why I started. That was why I wrote a very embarrassed email to a Nobel laureate (the extremely nice Andy Fire), asking if I could join his Python for Biologists course a week and a half late. That was why I gingerly tried out the first homework assignment, in which I induced Python to spit out the Fibonacci series and some trimmed small RNA sequences.

It was not, however, why I continued. Over the course of my Ph.D., I’ve thought I wanted to be a lot of things: journal editor, technical writer, high-school teacher, undergraduate-institution professor. Even a novelist or, on the lowest of days, a dental hygienist. The unifying characteristic of all these career aspirations is that they haven’t stuck. I’ve been excited about them for some period of time, then lost interest as I’ve realized that the reality of the job was considerably less appealing than my mental abstraction. Honestly, I expected programming to go the same way.

It didn’t. What I discovered, in the course of just a few Python for Biologists assignments, was that programming was awesome. It was more than awesome; it was a magical zone into which I could drop completely, something in which I could immerse myself with complete concentration, my often distractible mind focusing itself into a narrow beam of energy. I forgot everything else: lab work, worries, food, time of day. Whenever I started to program, I went into the zone and didn’t want to come out. That feeling was why I continued programming; that feeling, and the thought that if I could simply prove myself, if I could just be good enough, I would get to do this every single day.

 All of which is to explain why, in the middle of August, I started on my Twitter project. My original rationale for picking Twitter was simply that I wanted some big data to play with, and Twitter’s happened to be publicly available. Being an intractable Child of theSacred Heart, I soon realized that I could use Twitter to analyze something socially redeeming. So, I set out to measure attitudes towards education in different geographical regions of the Bay Area. The ultimate goal was to identify factors (co-occuring words/ ideas, user characteristics, and/or user connectivity characteristics) that predisposed high school students from underprivileged areas towards positive or negative attitudes about college.

I planned to do all of this during night and weekend hours, in about a month, while finishing my Ph.D. thesis. Needless to say, I have yet to fulfill the grandeur of my original vision. However, I have managed to meet some incremental goals: stream messages from Twitter, parse them, split them by locale, assign them affect scores, and measure average affect for education-related keywords. On top of that, I’ve learned some other useful things: that my dataset is way too small for what I planned to do; that my program architecture becomes painfully clunky as locales and query terms are scaled up; and that several weeks of Twitter data (even slow, geo-filtered Twitter data) will explode the brain of my ten-year-old Dell laptop.

So, I am not here to showcase a glossy, perfect project. Instead, I am here because I’ve learned some things that I think other people might find useful, like how to stream Twitter data, filter for geographical location (hint: get ready to draw lots of overlapping coordinate boxes), parse json files, work with UTF-8 formatting (you don’t want to lose those emoticons, do you?), and calculate a couple different kinds of affect scores.

My goal is to share what I’ve learned, so that someone else could easily implement a similar analysis for her/his queries, locations, and message properties of interest. And maybe even avoid some of the stupid (but highly educational) mistakes I've made.

Thanks for reading, and I hope you’ll tune in next time for my first technical article: “How to stream Twitter data”!