In this two-part series of posts, I’ll discuss several aspects of how to get data from Twitter. In Part 1, I’ll simply describe how to get some Twitter data (a random-ish sample of all global Twitter data) streaming on your computer, mostly by directing the reader to Bill Howe's excellent guide on Coursera. In Part 2, I’ll talk about ways to filter or gate your request so that you get only a subsample of Tweets with particular properties – e.g., from a certain location, in a particular language, or containing a given keyword.
A few notes on system requirements and background:
- My programs are written using Python 2.7.3 in the IDLE development environment. If you want to use my code or write your own similar code, you will also need to install and use Python.
- This blog will not teach you how to program in Python, but if you have some coding experience (or even if you don’t), Python is an easy language to pick up. Some good resources for self-teaching Python include the CodeAcademy course and Google’s Python class.
-
All my programs are tested on a Windows XP or Windows 7 computer. I think most of the code should be portable to a different OS, but setup details (e.g., how to install packages) may differ.
Step 1: Getting a Basic Twitter Stream
In the scheme of programming feats, streaming data from Twitter is not very complicated. In fact, if you go to the Github repository for Introduction to Data Science, you can find a nice, simple piece of working code that does just that – all you have to do is press the button. Right?
Well, yes and no. For one thing, that code has some credential and package dependencies that have to be satisfied, and which (while not intellectually interesting) can be a barrier if you’ve never queried an API before…which I most definitely hadn’t. Also, while the code may work for you once you’ve satisfied these dependencies, it may not be especially clear how it works – which is important if you want to modify it, e.g., to write data to a file in different formats, or to make queries of different types. I’ll discuss how to modify queries in the second post of this series.
The code that I'll point you to today isn’t mine at all – rather, it’s the template code in the Introduction to Data Science Assignment 1 folder on Github. If you are not familiar with Github, it’s an online code repository with useful features for version control. However, you don’t need to understand these features to get the Assignment 1 materials. You can simply access the link above and click the “Download ZIP” button on the right-hand side of the page. This should give you a ZIP folder containing several Python files, including twitterstream.py (the streaming template), as well as an HTML file (“assignment1.html”) and some text files.
Open up “assignment1.html” and scroll down to the header “Problem 1: Get Twitter Data.” As described in the instructions, you need to complete several setup steps before you can begin streaming data:
- First, you will need to use the Twitter website to get authentication credentials that you can use to connect securely to the Twitter API. These are a set of unique codes that identify you, and which are necessary in order to query the Twitter API. (To use these codes, paste them into the twitterstream.py file in the appropriate spot, as shown in the instructions.)
- Second, you will need to install the oauth2 library on your computer. oauth2 enables you to submit your credentials and thus make a secure request to the Twitter API. To install oauth2, you may first wish to install the easy_install python package, then use the command “easy_install oauth2” from the Windows command line. If you are still unable to import oauth2 in Python, you may need to figure out where easy_install put the oauth2 package, and add this location to the PYTHONPATH variable.
Now, you should be ready to run twitterstream.py and get your
first stream of Twitter data. You can run the script from either the Windows
command line (navigating to directory that contains the file, then typing “twitterstream.py”)
or from within a development environment like IDLE. The output should be a
stream of tweets, accompanied by a lot of metadata in bracketed-dictionary
format (JSON format, which I'll discuss a couple posts down the road).
It can be kind of a rush the first time you see the data
flowing – I remember how psyched I was that I had gotten the code to work, and
how powerful I suddenly felt, as if no data-mining challenge were beyond my reach. Overconfidence? Maybe a tad. But getting a datastream going
is the first step towards being able to collect the specific Twitter sample you
need...which will be the topic of the second post in this series, “Putting
filters on your query."
No comments:
Post a Comment