Around the end of June this year, I was bitten by a bug. To be precise, a programming bug.
To give you some context, I was at the time a 5th
(almost 6th) year graduate student in molecular biology, less than six
months away from my thesis defense date. My work was on the development of
stomata, plant leaf pores that regulate gas exchange, and had next to nothing
to do with programming. The amount of
free time I had, while not a negative number, was not a large one either. In
short, it wasn’t a particularly intuitive or opportune time to kick off some
recreational programming.
So why, you may ask, did I start programming? The reason
why I started was not a particularly noble one. Five years into my Ph.D., I was
feeling less and less certain that I was cut out for an academic career, and
becoming more and more aware that the options for biology Ph.D.s outside of
academia were limited (particularly if you studied a small, weedy plant rather
than, say, immunology). Around that time, I saw an email circular about a
program called Insight, which was designed to help Ph.D.s from various scientific disciplines
transition to data analysis jobs in the tech sector. My dad is a programmer (yes, he’s that Michael Abrash), and so I knew that while the tech sector had its problems,
it also had some excellent features like abundant jobs, decent pay, and the
chance to do something insanely cool that had a major effect on the world.
I thought about the program for about two
days, and the longer I thought, the more convinced I became that Insight was a
once-in-a-lifetime chance, a tiny moving window; and that if I wanted to escape
a lifetime of teaching high school, editing technical publications, or working
for Monsanto, I had to bodily fling myself through that window, with every
ounce of strength that I had. And what that meant was developing data science
credentials, and quickly. What that meant was, I needed to do some programming.
So,
that was why I started. That was why I wrote a very embarrassed email to a
Nobel laureate (the extremely nice Andy Fire), asking if I could join his
Python for Biologists course a week and a half late. That was why I gingerly
tried out the first homework assignment, in which I induced Python to spit out
the Fibonacci series and some trimmed small RNA sequences.
It was not, however, why I continued. Over
the course of my Ph.D., I’ve thought I wanted to be a lot of things: journal editor,
technical writer, high-school teacher, undergraduate-institution professor. Even
a novelist or, on the lowest of days, a dental hygienist. The unifying
characteristic of all these career aspirations is that they haven’t stuck. I’ve
been excited about them for some period of time, then lost interest as I’ve
realized that the reality of the job was considerably less appealing than my
mental abstraction. Honestly, I expected programming to go the same way.
It didn’t. What I discovered, in the course
of just a few Python for Biologists assignments, was that programming was awesome. It was more than awesome; it
was a magical zone into which I could drop completely, something in which I
could immerse myself with complete concentration, my often distractible mind
focusing itself into a narrow beam of energy. I forgot everything else: lab
work, worries, food, time of day. Whenever I started to program, I went into
the zone and didn’t want to come out. That feeling was why I continued
programming; that feeling, and the thought that if I could simply prove myself,
if I could just be good enough, I would get to do this every single day.
All
of which is to explain why, in the middle of August, I started on my Twitter project.
My original rationale for picking Twitter was simply that I wanted some big
data to play with, and Twitter’s happened to be publicly available. Being an
intractable Child of theSacred Heart, I soon realized that I could use Twitter to analyze
something socially redeeming. So, I set out to measure attitudes towards
education in different geographical
regions of the Bay Area. The ultimate goal was to identify factors (co-occuring
words/ ideas, user characteristics, and/or user connectivity characteristics)
that predisposed high school students from underprivileged areas towards
positive or negative attitudes about college.
I planned to do all of this
during night and weekend hours, in about a month, while finishing my Ph.D.
thesis. Needless to say, I have yet to fulfill the grandeur of my
original vision. However, I have managed to meet some incremental goals: stream
messages from Twitter, parse them, split them by locale, assign them affect
scores, and measure average affect for education-related keywords. On top of
that, I’ve learned some other useful things: that my dataset is way
too small for what I planned to do; that my program architecture becomes painfully
clunky as locales and query terms are scaled up; and that several weeks of Twitter data (even slow, geo-filtered
Twitter data) will explode the brain of my ten-year-old Dell laptop.
So, I am not here to showcase a glossy,
perfect project. Instead, I am here because I’ve learned some things that I
think other people might find useful, like how to stream Twitter data, filter
for geographical location (hint: get ready to draw lots of overlapping
coordinate boxes), parse json files, work with UTF-8 formatting (you don’t want
to lose those emoticons, do you?), and calculate a couple different kinds of affect
scores.
My goal is to share what I’ve learned, so that someone else could easily implement a similar analysis
for her/his queries, locations, and message properties of interest. And maybe
even avoid some of the stupid (but highly educational) mistakes I've made.
Thanks for reading, and I hope you’ll
tune in next time for my first technical article: “How to stream Twitter data”!
No comments:
Post a Comment