Down the Tweet Chain Rabbit Hole

Today I stumbled over an apparently interesting tweet by @aendu:

The link in the tweet points to another tweet:

And, as you may have guessed, this link points to yet another tweet. After about
20 levels deep I stopped clicking on the links and wrote a Python script instead:

# -*- coding: utf-8 -*-
"""
Tracing the twitter chain, down the rabbit hole.

Dependencies:

  - requests
  - beautifulsoup4

"""
from __future__ import print_function, division, absolute_import, unicode_literals

import re
from datetime import datetime

import requests
from bs4 import BeautifulSoup


START_URL = 'https://twitter.com/aendu/status/433586683615784960'


def inception(url):
    # Request tweet page
    r = requests.get(url)
    if r.status_code == 404:
        print('TWEET DELETED, CHAIN BROKEN :(')
        return
    soup = BeautifulSoup(r.text)
    tweet = soup.select('div.tweet.permalink-tweet')[0]

    # Parse out & print tweet info
    text = tweet.find('p', class_='tweet-text').text
    user = tweet.get('data-screen-name')
    timestamp = tweet.find('span', class_='js-relative-timestamp').get('data-time')
    dt = datetime.fromtimestamp(int(timestamp))
    print('{0} @{1}: {2}'.format(dt.isoformat().replace('T', ' '), user, text))

    # And we need to go deeper!
    links = tweet.find('p', class_='tweet-text').find_all('a')
    for link in links:
        url = link.get('data-expanded-url')
        if not url:
            continue
        if re.match(r'^https?:/{2}(www.)?twitter.*status.*$', url):
            return url


if __name__ == '__main__':
    url = START_URL
    while url:
        url = inception(url)

The code is on Gist, feel free to
fork it! Here’s the result: https://gist.github.com/dbrgn/8956214

The first thing you’ll notice is that the chain is broken after 57 tweets,
because @charlescwcooke deleted his
tweet in the chain. Too bad.

There are also some other things we can see from this data, for example the
time distribution
:

Time distribution scatter plot

You can clearly see that the tweets were sent in “bursts” throughout the day.

Unfortunately, following down one branch of the chain is not too interesting. A
much more interesting analysis would be to see how many branches there are, and
which one is the longest. This could be done by two ways: If you’re very rich
you can buy access to the Twitter Firehose in order to analyze all the tweets
sent in a few days. The other possibility is to do some kind of backtracking. I
didn’t use the Twitter API because I was too lazy to register a new app, and
resorted to HTML scraping instead. But by using the API, one could first follow
the chain down to the last working tweet, and then use the search API to find
tweets

containing an URL to that tweet. By doing that, you could build a dataset
containing all branches, and start to analyze them.

But I leave that to somebody else 🙂