Monday, September 22, 2014

So you want to stream Twitter...


Around the end of June this year, I was bitten by a bug. To be precise, a programming bug.

To give you some context, I was at the time a 5th (almost 6th) year graduate student in molecular biology, less than six months away from my thesis defense date. My work was on the development of stomata, plant leaf pores that regulate gas exchange, and had next to nothing to do with programming.  The amount of free time I had, while not a negative number, was not a large one either. In short, it wasn’t a particularly intuitive or opportune time to kick off some recreational programming.

So why, you may ask, did I start programming? The reason why I started was not a particularly noble one. Five years into my Ph.D., I was feeling less and less certain that I was cut out for an academic career, and becoming more and more aware that the options for biology Ph.D.s outside of academia were limited (particularly if you studied a small, weedy plant rather than, say, immunology). Around that time, I saw an email circular about a program called Insight, which was designed to help Ph.D.s from various scientific disciplines transition to data analysis jobs in the tech sector.  My dad is a programmer (yes, he’s that Michael Abrash), and so I knew that while the tech sector had its problems, it also had some excellent features like abundant jobs, decent pay, and the chance to do something insanely cool that had a major effect on the world.

I thought about the program for about two days, and the longer I thought, the more convinced I became that Insight was a once-in-a-lifetime chance, a tiny moving window; and that if I wanted to escape a lifetime of teaching high school, editing technical publications, or working for Monsanto, I had to bodily fling myself through that window, with every ounce of strength that I had. And what that meant was developing data science credentials, and quickly. What that meant was, I needed to do some programming.

So, that was why I started. That was why I wrote a very embarrassed email to a Nobel laureate (the extremely nice Andy Fire), asking if I could join his Python for Biologists course a week and a half late. That was why I gingerly tried out the first homework assignment, in which I induced Python to spit out the Fibonacci series and some trimmed small RNA sequences.

It was not, however, why I continued. Over the course of my Ph.D., I’ve thought I wanted to be a lot of things: journal editor, technical writer, high-school teacher, undergraduate-institution professor. Even a novelist or, on the lowest of days, a dental hygienist. The unifying characteristic of all these career aspirations is that they haven’t stuck. I’ve been excited about them for some period of time, then lost interest as I’ve realized that the reality of the job was considerably less appealing than my mental abstraction. Honestly, I expected programming to go the same way.

It didn’t. What I discovered, in the course of just a few Python for Biologists assignments, was that programming was awesome. It was more than awesome; it was a magical zone into which I could drop completely, something in which I could immerse myself with complete concentration, my often distractible mind focusing itself into a narrow beam of energy. I forgot everything else: lab work, worries, food, time of day. Whenever I started to program, I went into the zone and didn’t want to come out. That feeling was why I continued programming; that feeling, and the thought that if I could simply prove myself, if I could just be good enough, I would get to do this every single day.

 All of which is to explain why, in the middle of August, I started on my Twitter project. My original rationale for picking Twitter was simply that I wanted some big data to play with, and Twitter’s happened to be publicly available. Being an intractable Child of theSacred Heart, I soon realized that I could use Twitter to analyze something socially redeeming. So, I set out to measure attitudes towards education in different geographical regions of the Bay Area. The ultimate goal was to identify factors (co-occuring words/ ideas, user characteristics, and/or user connectivity characteristics) that predisposed high school students from underprivileged areas towards positive or negative attitudes about college.

I planned to do all of this during night and weekend hours, in about a month, while finishing my Ph.D. thesis. Needless to say, I have yet to fulfill the grandeur of my original vision. However, I have managed to meet some incremental goals: stream messages from Twitter, parse them, split them by locale, assign them affect scores, and measure average affect for education-related keywords. On top of that, I’ve learned some other useful things: that my dataset is way too small for what I planned to do; that my program architecture becomes painfully clunky as locales and query terms are scaled up; and that several weeks of Twitter data (even slow, geo-filtered Twitter data) will explode the brain of my ten-year-old Dell laptop.

So, I am not here to showcase a glossy, perfect project. Instead, I am here because I’ve learned some things that I think other people might find useful, like how to stream Twitter data, filter for geographical location (hint: get ready to draw lots of overlapping coordinate boxes), parse json files, work with UTF-8 formatting (you don’t want to lose those emoticons, do you?), and calculate a couple different kinds of affect scores.

My goal is to share what I’ve learned, so that someone else could easily implement a similar analysis for her/his queries, locations, and message properties of interest. And maybe even avoid some of the stupid (but highly educational) mistakes I've made.

Thanks for reading, and I hope you’ll tune in next time for my first technical article: “How to stream Twitter data”!

No comments:

Post a Comment