GitHub in 2013: Pull Request Actors

In this IPython notebook I give an overview of GitHub pull request events in 2013 based on data obtained from the GitHub Archive. This is part of a series of posts about GitHub in 2013. The source code of this notebook is available in this GitHub repository, the CSV file with the pull request events is not included due to its size. If there is demand, I look into uploading it to a data sharing service, proposals of what service to use are welcome.

Preliminaries

First load the necessary packages, set a global footer text and a limit for the charts below.

In [1]:
import datetime
import itertools

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import graphs3 as graphs

limit = 20
footer = 'Data: githubarchive.org - Source: coderstats.net / @coderstats'

Read the event data from a compressed CSV file, set the first column as the index and specify, that it is a date. Also add a Count column and set its values to 1 for aggregating events.

In [2]:
df_pulls = pd.read_csv('csv/githubarchive/2013/pullrequest_events.csv.bz2', compression='bz2', index_col=0, parse_dates=[0], low_memory=False)
df_pulls['Count'] = 1
df_pulls.head()
Out[2]:
Action Actor Actor Type Repo Forks Repo Language Repo Name Repo Owner Repo Size Repo Stars Repo Watchers Repo is Fork Count
2013-01-01 08:03:58 opened CODeRUS User 17 Python yowsup tgalal 196 47 47 False 1
2013-01-01 08:05:25 closed tnm User 59 C rugged libgit2 356 488 488 False 1
2013-01-01 08:06:06 opened thebinaryhood User 0 Ruby beastmaster thebinaryhood 172 0 0 False 1
2013-01-01 08:07:29 closed thebinaryhood User 0 Ruby beastmaster thebinaryhood 172 0 0 False 1
2013-01-01 08:09:58 opened lichenbo User 1 NaN 2012-a-year-of-no-significance leon-huang 424 0 0 False 1

5 rows × 12 columns

Distribution of pull request actions

Each pull request event has an associated action, so let's first look at the distribution of different actions.

In [3]:
df_pulls['Action'].value_counts().plot(kind='bar', rot=0)
plt.show()

The most common action is opening a pull request, most of them are closed sooner or later, which doesn't necessarily mean they got accepted.

Opened pull requests

We'll focus on opened pull requests by user (not organizations) in 2013 in this notebook.

In [4]:
df_opened = df_pulls[(df_pulls['Action'] == 'opened') & (df_pulls['Actor Type'] == 'User')]
df_opened.head()
Out[4]:
Action Actor Actor Type Repo Forks Repo Language Repo Name Repo Owner Repo Size Repo Stars Repo Watchers Repo is Fork Count
2013-01-01 08:03:58 opened CODeRUS User 17 Python yowsup tgalal 196 47 47 False 1
2013-01-01 08:06:06 opened thebinaryhood User 0 Ruby beastmaster thebinaryhood 172 0 0 False 1
2013-01-01 08:09:58 opened lichenbo User 1 NaN 2012-a-year-of-no-significance leon-huang 424 0 0 False 1
2013-01-01 08:10:12 opened CODeRUS User 17 Python yowsup tgalal 196 47 47 False 1
2013-01-01 08:12:34 opened kefirfromperm User 10 Groovy grails-quartz grails-plugins 160 11 11 True 1

5 rows × 12 columns

In [5]:
actor_grouped = df_opened.groupby('Actor')[['Count']].sum()
top_actor = actor_grouped.sort(['Count']).tail(limit)
graphs.barh(top_actor.index,
            top_actor.Count,
            'img/%d-top-actors-pulls-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top actors by opened pull requests on GitHub in 2013' % limit,
            footer=footer)

Let me just say that some of these users look very fishy. Obviously ideatest1 automates pull requests to test ideas, whatever they are about. The first user in this top 20 who doesn't seem to be a bot is juliocamarero with an impressive 1338 (why not one less man?) pull requests or on average 3.67 per day, each day in 2013.

Most hyperpolyglot actors

In natural language a person who speaks 6 or more languages is considered a hyperpolyglot, a term coined by the linguist Richard Hudson. Adapting this to the world of programming, a person who writes code in 6 or more programming languages could be considered a hyperpolyglot programmer.

Every GitHub repository is assigned a main language, provided one is detected by the GitHub linguist so we're going to group pull requests by actors and languages to find out who are the most hyperpolyglot programmers on GitHub. This is not without problems, because a pull request can just be a fix of a typo in the readme file or a even a code change that affects a part of the code base that is not written in its main language. Also GitHub's linguist first looks at file extensions, so it relies on conventions, which obviously can be broken by coders. Keep this in mind here and whenever you see an analysis of programming languages on GitHub.

In [6]:
actor_lang_grouped = df_opened.groupby(['Actor', 'Repo Language'])[['Count']].sum()
actor_lang_grouped['Lang Count'] = 1
actor_lang_counts = actor_lang_grouped['Lang Count'].groupby(level=0).sum()
actor_lang_counts.sort()

top_actor_lang = actor_lang_counts.tail(limit)
graphs.barh(top_actor_lang.index,
            top_actor_lang,
            'img/%d-top-actors-pulls-languages-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top actors by opened pull request languages on GitHub in 2013' % limit,
            footer=footer)

If you look at the profiles of some of these users, chances are very good, that they are indeed hyperpolyglot programmers, but let's have a look at the distribution of language counts by actors' pull requests.

In [7]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

actor_lang_counts.hist(bins=53, ax=axes[0])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, ax=axes[1])
actor_lang_counts[actor_lang_counts < 15].hist(bins=14, log=True, ax=axes[2])
plt.show()

By far most coders submit pull requests to repositories of a single language, without further digging into it, I assume that most actors probably submit just one or a few pull requests to the same repo. So it is quite impressive how diversified some GitHubbers seem to be.

Language combinations of hyperpolyglots

The last thing we'll look at here are the top language combinations of those hyperpolyglots. We'll only take users with more than 6 and less than 15 different languages into account.

In [8]:
hyper = actor_lang_counts[(actor_lang_counts > 6) & (actor_lang_counts < 15)]
is_hyper = df_opened['Actor'].isin(hyper.index)
df_hyper = df_opened[is_hyper]
actor_lang_grouped = df_hyper.groupby(['Actor', 'Repo Language']).count()
actor_lang_grouped.head()
Out[8]:
Action Actor Actor Type Repo Forks Repo Language Repo Name Repo Owner Repo Size Repo Stars Repo Watchers Repo is Fork Count
Actor Repo Language
9034725985 C 38 38 38 38 38 38 38 38 38 38 38 38
C# 2 2 2 2 2 2 2 2 2 2 2 2
CSS 4 4 4 4 4 4 4 4 4 4 4 4
Java 15 15 15 15 15 15 15 15 15 15 15 15
JavaScript 54 54 54 54 54 54 54 54 54 54 54 54

5 rows × 12 columns

Now create a dictionary of language combinations and their counts and turn it into a DataFrame.

In [9]:
actor_langs = actor_lang_grouped.groupby(level=0)
lang_combs = {}
for g in actor_langs.groups.values():
    langs = [i[1] for i in g]
    for c in itertools.combinations(langs, 2):
        lang_combs[c] = lang_combs.get(c, 0) + 1

df_lang_combs = pd.DataFrame.from_dict(lang_combs, orient='index')
df_lang_combs.columns = ['Count']

Finally print a graph of the top language combinations.

In [10]:
limit = 50
top = df_lang_combs.sort('Count').tail(limit)
graphs.barh(top.index.map(lambda x: ' and '.join(x)),
            top.Count,
            'img/%d-top-hyper-language-combinations-pulls-github-2013.png' % limit,
            figsize=(12, limit / 2),
            title='%d top hyperpolyglot language combinations for pull requests on GitHub in 2013' % limit,
            footer=footer)

This post was written by Ramiro Gómez and published on August 01, 2014.

Subscribe to news feed or follow @coderstats on Twitter to not miss the next post.


blog comments powered by Disqus