Working with streaming data: Using the Twitter API to capture tweets

If you’ve done any data science or data analysis work, you’ve probably read in a csv file or connected to a database and queried rows. A typical data analysis workflow involves retrieving stored data, loading it into an analysis tool, and then exploring it. This works well when you’re dealing with historical data such as analyzing what products a customer at your online store is most likely to purchase, or whether people’s diets changed in response to advertising. But what if you want to predict stock prices in real-time? Or figure out what people are watching on television right now?

As more data is generated, it’s becoming increasingly important to be able to work with real-time data. Real-time, or streaming, data is generated continuously, and in the case of the stock market, there can be millions of rows generated every hour. Due to size and time constraints, there often isn’t a neat dataset that you can analyze – you’ll need to either store the data to analyze later, or analyze it in real time, as you get it.

Being able to work with streaming data is a critical skill for any aspiring data scientist. In this post, we’ll talk about…