Each square in the chart measures the rank correlation between two languages; positive correlations are in blue, and negative correlations are in red. A little more than halfway down the first column is a medium-red square showing the correlation between Go and ActionScript. Because the square is red, we know that as users write more Go, they write less ActionScript. From the intensity of the red, we know that this rule of thumb is fairly reliable. Rank correlations are symmetric, so ActionScript fans also write less Go (and there’s an identically-colored square in the bottom row to show it). Take a moment to find your favorite languages, then read on for the details (or skip to the conclusions)!
GitHub’s own Brian Doll published a data set titled “Programming Language Correlations,” but as an astute commenter pointed out, it’s really a set of conditional probabilities. For our question, that’s an important distinction — while conditional probabilities let us say, “87.9% of CoffeeScript programmers also code in Ruby,” they don’t allow us to say, “People who write more CoffeeScript also tend to write more Ruby.” To tackle our question, we’ll need access to some more granular data.
Rather than banging on the API, we can use the GitHub Archive. This fantastic resource archives every public GitHub event and makes the whole data set accessible via Google BigQuery. BigQuery has a web-based console and a comfortably SQL-like query language, so it’s easy to get the data we need (all the code in this post is also in a single Gist):
select actor, repository_language, count(repository_language) as pushes from [githubarchive:github.timeline] where type='PushEvent' and repository_language != '' and PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-01-01 00:00:00') and PARSE_UTC_USEC(created_at) < PARSE_UTC_USEC('2013-01-01 00:00:00') group by actor, repository_language;
The results of this query are in a stacked format, where each combination of user and language is on a separate row:
Stacked formats are often convenient in database schemas, but they’re not very useful for analysis. We’d rather unstack the data so that there’s one row per user and one column per language:
With the help of the brilliant pandas library, we can perform this transformation with one Python command:
import pandas as pd pushes = pd.read_csv('stacked_language_by_user.csv').pivot( index='actor', columns='repository_language', values='pushes')
Exporting these results from BigQuery was an enormous pain (and required a paid account), so I’ll keep a zipped copy of the stacked and unstacked results available.
GitHub recognizes lots of different languages, including some that are fairly obscure, so our unstacked data set has too many columns to visualize. Let’s just keep the most popular languages:
import numpy as np popular = pushes.select(lambda x: np.sum(pushes[x]) > 50000, axis=1)
Now that our data’s formatted and filtered, it’s time to actually calculate our correlation matrix and draw a plot. Again, pandas makes the number-crunching ridiculously simple:
import matplotlib.pyplot as plt def plot_correlation(dataframe, filename, title='', corr_type=''): lang_names = dataframe.columns.tolist() tick_indices = np.arange(0.5, len(lang_names) + 0.5) plt.figure() plt.pcolor(dataframe.values, cmap='RdBu', vmin=-1, vmax=1) colorbar = plt.colorbar() colorbar.set_label(corr_type) plt.title(title) plt.xticks(tick_indices, lang_names, rotation='vertical') plt.yticks(tick_indices, lang_names) plt.savefig(filename) spearman_corr = popular.corr(method='spearman') plot_correlation( spearman_corr, 'spearman_language_correlation.svg', title='2012 GitHub Language Correlations', corr_type='Spearman\'s Rank Correlation')
Update: As conjugateprior notes in the comments below, this calculation ignores any rows with missing data (for example, the Python-Ruby correlation ignores any users who haven’t used both Python and Ruby). We could fill the missing values with zeroes (which Corey has already done - check out his updated code and plot), or we could also calculate significance for each correlation.
It’s better to use Spearman’s rank correlation here instead of the usual Pearson correlation for two reasons:
- We don’t really care whether the relationship between languages is strictly linear.
- There are quite a few outliers in our data set, and rank correlations are less distorted by these outliers.
If that doesn’t convince you, it’s easy to calculate the Pearson correlation —
it’s the default in pandas, so removing the
method='spearman' above should
do the trick. If you’re impatient, you can just peek at the
The most striking thing about this chart is its blueness. Despite our tribalism, writing scads of C# doesn’t make programmers any less likely to hack on some R. Even PHP, perhaps the most hated programming language on earth, has a slight positive correlation with Haskell. After seeing so many flamewars in forums, on mailing lists, and even in person, I expected language communities to be more insular. I’m particularly surprised by the positive correlations between the languages associated with proprietary platforms (C#, Objective-C, and ActionScript) and the traditionally open-source languages.
Not surprisingly, special-purpose languages are the exception to this rule. R, Matlab, and Puppet have more strong correlations (both positive and negative) than the norm, likely because of their niche roles in data analysis and devops.
Like any analysis project, this one comes with a few caveats:
- GitHub pushes aren’t a perfect measure of activity. Then again, neither are commits, lines of code changed, or anything else I’ve heard of.
- This data only considers public projects on GitHub, many of which are open source. Open-source programmers, and projects, may behave quite differently from their closed-source counterparts.
- Correlation isn’t causation.
I’ve only scratched the surface here — if you’ve got some ideas, leave a comment!