Depending on your interests, a simple, real-time stream that samples worldwide
Twitter data may be exactly what you need. For some applications, however, it’s
helpful to sample a more specific subset of Tweets – for instance, ones from a certain
location, containing a particular term or written in a given language. It’s
indeed possible to query Twitter using filters like these, but there are several
different ways of doing so, each with its own pros and cons.
Streaming vs. Search APIs. One key distinction, which it took
me awhile to understand – and which I understand at more of an operational than
a technical level – is that between streaming
APIs (e.g., Twitter stream) and REST
APIs (e.g., Twitter search). Basically, a streaming API allows you to open
a persistent connection to Twitter’s remote server, and to collect a continuous
stream of data from this server. A REST API, on the other hand, allows you to
open a transient connection to Twitter’s remote server for a specific query,
then returns the data and closes the connection. For a nice explanation with
diagrams, I recommend looking at Twitter’s overview of streaming vs. REST APIs.
You can request Tweets with specific properties using either the
streaming or the search Twitter API, and the “better” choice will depend on what
you’re trying to do. If, like me, you want to do a data-mining project, you’ll probably
want to use Twitter stream rather than Twitter search. A persistent stream
means that you capture as much of your target Tweet population as possible, and
also means that you don’t have to worry about rate limits (limits to the number
of queries you can make in a given time period), which seem to be much more
stringent for Twitter search than for Twitter stream. If, instead of
data-mining, you’re trying to make an application that takes input from a user,
queries Twitter using that input, and returns a result to the user, Twitter
search might be more appropriate. Because I’ve worked primarily with streaming
APIs, the remainder of the post will focus on these.
Filter Types. Using the streaming API, you can apply a number of filters to your query. Some of the most useful filters, from a data-mining perspective,
include:
- language: this parameter enables you to limit the stream to Tweets written in a certain language. The following line of code requests Tweets whose autodetected language is Spanish: url = "https://stream.twitter.com/1/statuses/sample.json?language=es"
- track: this parameter allows you to specify a search term or list of terms, and only returns Tweets containing at least one of the search terms. The following line of code requests Tweets containing the term “school”: url = "https://stream.twitter.com/1/statuses/filter.json?track=school"
- locations: this parameter allows you to
specify the location(s) from which Tweets should be retrieved. You will need to
specify locations as bounding boxes, which are defined by pairs of coordinates
representing the southwest and northeast corners of the box. Note that Twitter,
unlike most other sources, places the longitude coordinate first and the
latitude coordinate second. The following query returns Tweets from the
San Francisco Bay Area:
url =
"https://stream.twitter.com/1.1/statuses/filter.json?locations=-122.544708,37.208457,-121.929474,38.041602"
The map below is from http://earthexplorer.usgs.gov/, which is helpful for finding the coordinates of your desired bounding box. It shows the SW coordinate (1), the NE coordinate (2), and the bounding box they specify (white outline):
Sounds simple, right? Well, yes and no. Filters are very useful for
sampling a specific subset of Twitter data, but they have some quirks and peculiarities
to watch out for. Specifically, certain filters can only be applied to certain streaming
endpoints, and not all filters can be used in combination with each other.
Different Streaming Endpoints. You’ll notice that in the example above, the language filter is included in a query with syntax statuses/sample, while the track and locations filters are included in queries with syntax statuses/filter. “Sample” and “filter” are two different streaming endpoints, and have different tolerances for filters, as discussed in more detail in the Twitter documentation (referenced above). If use the “filter” endpoint, you have to specify either a track or a locations filter (you can also use follow, an option I don’t discuss here). If you instead use the “sample” endpoint, you can use the language filter, but will get an error if you try to use a track or locations filter.
Different Streaming Endpoints. You’ll notice that in the example above, the language filter is included in a query with syntax statuses/sample, while the track and locations filters are included in queries with syntax statuses/filter. “Sample” and “filter” are two different streaming endpoints, and have different tolerances for filters, as discussed in more detail in the Twitter documentation (referenced above). If use the “filter” endpoint, you have to specify either a track or a locations filter (you can also use follow, an option I don’t discuss here). If you instead use the “sample” endpoint, you can use the language filter, but will get an error if you try to use a track or locations filter.
Combinable (and Non-Combinable) Filters. An important consideration if you are trying to get data with specific properties is that certain filters can only be combined with a logical OR (not with a logical AND). For instance, if you specify both track and locations using the “filter” endpoint, you’ll get tweets are either from the Bay Area or contain the keyword “Obama” – not things with both those properties. If you want a sample of Tweets that is limited by both location and subject, you’ll need to stream based on one of these two parameters (generally, I would recommend location), then manually filter your data for the other in Python.
Non-Rectangular Location Queries. When you are filtering Tweets by location, not every area you might want to cover is going to be shaped like a rectangle. In some cases, you might want to cover several separate, rectangular areas (e.g., major U.S. metropolitan areas, including New York, San Francisco, etc.), while in others, you might want to cover a single area that happens not to be shaped like a rectangle (e.g., Brazil). In these cases, you can specify multiple bounding boxes (by adding additional pairs of SW/NE corner coordinates), either covering distinct areas or building up a large, irregularly shaped area from smaller rectangles. For example, the following example would (roughly) cover the continental U.S.: url = "https://stream.twitter.com/1.1/statuses/filter.json?locations=-169.9,51.3,-141.7,72.2,-98.9,25.9,-65.7,49.7,-106.9,25.9,-99.1,49.8,-124.6,30.06,-106.9,49.7"
It’s okay if your bounding boxes aren’t perfect (e.g., if you get a little bit of Mexico or Canada in your U.S. sample) – you can always check the metadata of your tweets at a later processing step to confirm that they’re in your target region.
Once you’ve formulated a query retrieve tweets with the desired properties, you can begin collecting data. In my case, this meant leaving my decade-old laptop on all the time, patiently gathering tweets as they came in. Once you some stored data to play with, you’ll be ready to parse your tweets (which come in JSON format, containing UTF-8 text and lots of metadata) and start doing some real analysis. So, thanks for reading, and I hope you’ll tune in next time for “How to parse tweets”!

No comments:
Post a Comment