GitHub in 2013: Follow Events

In this IPython notebook I give a brief overview of GitHub follow events in 2013 based on data obtained from the GitHub Archive. This is a follow-up post to the Event Types 2013 notebook. The source code of this notebook is available in this GitHub repository, the CSV file with the follow events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.


First load the necessary packages, set a global footer text and a limit for the bar charts below.

In [1]:
import datetime
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import graphs

limit = 30
footer = 'Data: - Source: / @coderstats'

Read the follow events from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events by followers and followed users.

In [2]:
df_follows = pd.read_csv('csv/githubarchive/2013/follow_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0])
df_follows['Count'] = 1
Actor Target Count
2013-01-01 08:00:48 kelsonzhao pjhyett 1
2013-01-01 08:02:05 bedreamer t-k- 1
2013-01-01 08:03:17 zonuexe 0xrofi 1
2013-01-01 08:03:56 adulau FredericJacobs 1
2013-01-01 08:04:09 conikeec garyburd 1

Most Following Users

One eye-catching observation in the previous notebook was a huge spike in the follow events timeline, so let's shed some light on this by looking at the users who followed the most people.

In [3]:
top_followers = df_follows.groupby('Actor').sum().sort('Count').tail(limit)
            'img/%d-most-following-github-users-2013.png' % limit,
            figsize=(12, 16),
            title='%d most following GitHub users from 2013-01-01 to 2013-12-11' % limit,

Particularly user/bot threejs-cn, who unsurprisingly doesn't exist anymore on GitHub, stands out here, although a few of the others have quite impressive follow counts as well.

Let's look threejs-cn's activity aggregated by date.

In [4]:
bot = df_follows[df_follows['Actor'] == 'threejs-cn']
bot['date'] =
2013-10-18 1
2013-11-05 5
2013-11-14 40934

Obviously, the one responsible for the spike we saw. Interestingly, one day earlier on November 13, 2013 there was a DDoS attack on GitHub pages, but this might just be a coincidence. In any case, I assume that it is not more possible to follow 40k users on GitHub in a single day.

Most Followed Users

Let's now look at who gained the most followers last year or more exactly from from 2013-01-01 to 2013-12-11.

In [5]:
top_followed = df_follows.groupby('Target').sum().sort('Count').tail(limit)
            'img/%d-most-followed-github-users-2013.png' % limit,
            figsize=(12, 16),
            title='%d most followed GitHub users from 2013-01-01 to 2013-12-11' % limit,

I wonder why Tom Preston-Werner (mojombo) got so many more followers than his co-founder Chris Wanstrath (defunkt). Both of them have some pretty popular repositories on GitHub. Maybe looking at their timelines of follow events reveals something.

Top Followed Users Timelines

The code below will plot multiple line-graphs for the 20 most followed users in 2013.

In [6]:
top = top_followed.tail(20)

fig, axes = plt.subplots(nrows=10, ncols=2)
fig.suptitle('GitHub top followed users timelines for 2013', y=1.01, fontsize=14)

for idx, coords in enumerate(itertools.product(range(10), range(2))):
    ax = axes[coords[0], coords[1]]
    user = top.index[idx]
    user_follows = df_follows[df_follows['Target'] == user]
    user_follows['date'] =
    grouped = user_follows.groupby('date').sum()
    grouped.plot(ax=ax, legend=False, rot=45)
    ax.set_xlabel('', visible=False)

fig.text(0, 0, footer, fontsize=12)