GitHub Social Graphs with Groovy and GraphViz

Kelly RobinsonJune 14th, 2012Last Updated: October 22nd, 2012

0 64 8 minutes read

The Goal

Using the GitHub API, Groovy and GraphViz to determine, interpret and render a graph of the relationships between GitHub users based on the watchers of their repositories. The end result can look something like this.

The GitHub V3 API

You can find the full documentation for the GitHub V3 API here. They do a great job of documenting the various endpoints and their behaviour as well as demonstrating usage of the API extensively with curl. For the purposes of this post the API calls I’m making are simple GET requests that do not require authentication. In particular I’m targeting two specific endpoints: repositories for a specific user and watchers for a repository.

Limitations of the API

Although a huge upgrade from the 50 requests per hour rate limit on the V2 API, I found it fairly easy to exhaust the 5000 requests per hour provided by the V3 API while gathering data. Fortunately, included with every response from GitHub is a convenient X-RateLimit-Remaining header we can use to check our limit. This allows us to stop processing before we run out of requests, after which GitHub will return errors for every request. For each user we examine one url to find their repositories, and for each of those repositories execute a separate request to find all of the watchers. While executing these requests, using my own GitHub account as the centerpoint, I was able to gather repository information about 1143 users and find 31142 total watchers- 18023 of which were unique in the data collected. This is somewhat of a broken figure as consistently, after reaching the rate limit, there were far more nodes left to process in the queue than already encountered. Myself I only have 31 total repository watchers but appearing within the graph we find users like igrigorik, an employee of Google with 529 repository watchers, and that tends to skew the results somewhat. The end result is that the data provided here is far from complete, I’m sorry to say, but that doesn’t mean it’s not interesting to visualize.

Groovy and HttpBuilder

Groovy and the HttpBuilder dsl abstract away most of the details of handling the HTTP connections. The graph I’m building starts with one central GitHub user and links that user to everyone that is presently watching one of their repositories. This requires a single GET request to load all of the repositories for the given user, and a GET request per repository to find the watchers. These two HTTP operations are very easily encapsulated with Closures using the HttpBuilder wrapper around HttpClient. Each call returns both the X-RateLimit-Remaining value and the requested data. Here’s what the configuration of HttpBuilder looks like:

final String rootUrl = 'https://api.github.com'
final HTTPBuilder builder = new HTTPBuilder(rootUrl)

The builder object is created and fixed at the GitHub api url, simplifying the syntax for future calls. Now we define two closures, each of which targets a specific url and extracts the appropriate data from the (already automagically unmarshalled by HttpBuilder) JSON response. The findWatchers Closure has a little bit more logic in it to remove duplicate entries, and to exclude the user themselves from the list as by default GitHub records a self-referential link for all users with their own repositories.

final String RATE_LIMIT_HEADER = 'X-RateLimit-Remaining'
final Closure findReposForUser = { HTTPBuilder http, username ->
    http.get(path: "/users/$username/repos", contentType: JSON) { resp, json ->
        return [resp.headers[RATE_LIMIT_HEADER].value as int, json.toList()]
    }
}
final Closure findWatchers = { HTTPBuilder http, username, repo ->
    http.get(path: "/repos/$username/$repo/watchers", contentType: JSON) { resp, json ->
        return [resp.headers[RATE_LIMIT_HEADER].value as int, json.toList()*.login.flatten().unique() - username]
    }
}

Out of this data we’re only interested (for now) in keeping a simple map of Username -> Watchers which we can easily marshal as a JSON object and store in a file. The complete Groovy script code for loading the data can be run from the command line using the following code or executed remotely from a GitHub gist on the command line by calling groovy https://raw.github.com/gist/2468052/5d536c5a35154defb5614bed78b325eeadbdc1a7/repos.groovy {username}. In either case you should pass in the username you would like to center the graph on. The results will be output to a file called ‘reposOutput.json’ in the working directory. Please be patient, as this is going to take a little while; progress is output to the console as each user is processed so you can follow along.

@Grab('org.codehaus.groovy.modules.http-builder:http-builder:0.5.2')
import groovy.json.JsonBuilder
import groovyx.net.http.HTTPBuilder
import static groovyx.net.http.ContentType.JSON
 
final rootUser = args[0]
final String RATE_LIMIT_HEADER = 'X-RateLimit-Remaining'
final String rootUrl = 'https://api.github.com'
final Closure<Boolean> hasWatchers = {it.watchers > 1}
final Closure findReposForUser = { HTTPBuilder http, username ->
    http.get(path: "/users/$username/repos", contentType: JSON) { resp, json ->
        return [resp.headers[RATE_LIMIT_HEADER].value as int, json.toList()]
    }
}
final Closure findWatchers = { HTTPBuilder http, username, repo ->
    http.get(path: "/repos/$username/$repo/watchers", contentType: JSON) { resp, json ->
        return [resp.headers[RATE_LIMIT_HEADER].value as int, json.toList()*.login.flatten().unique() - username]
    }
}
 
LinkedList nodes = [rootUser] as LinkedList
Map<String, List> usersToRepos = [:]
Map<String, List<String>> watcherMap = [:]
boolean hasRemainingCalls = true
final HTTPBuilder builder = new HTTPBuilder(rootUrl)
while(!nodes.isEmpty() && hasRemainingCalls)
{
    String username = nodes.remove()
    println "processing $username"
    println "remaining nodes = ${nodes.size()}"
 
    def remainingApiCalls, repos, watchers
    (remainingApiCalls, repos) = findReposForUser(builder, username)
    usersToRepos[username] = repos
    hasRemainingCalls = remainingApiCalls > 300
    repos.findAll(hasWatchers).each{ repo ->
        (remainingApiCalls, watchers) =  findWatchers(builder, username, repo.name)
        def oldValue = watcherMap.get(username, [] as LinkedHashSet)
        oldValue.addAll(watchers)
        watcherMap[username] =  oldValue
        nodes.addAll(watchers)
        nodes.removeAll(watcherMap.keySet())
        hasRemainingCalls = remainingApiCalls > 300
    }
    if(!hasRemainingCalls)
    {
        println "Stopped with $remainingApiCalls api calls left."
        println "Still have not processed ${nodes.size()} users."
    }
}
 
new File('reposOutput.json').withWriter {writer ->
    writer << new JsonBuilder(watcherMap).toPrettyString()
}

The JSON file contains very simple data that looks like this:

"bmuschko": [
        "claymccoy",
        "AskDrCatcher",
        "roycef",
        "btilford",
        "madsloen",
        "phaggood",
        "jpelgrim",
        "mrdanparker",
        "rahimhirani",
        "seymores",
        "AlBaker",
        "david-resnick", ...

Now we need to take this data and turn it into a representation that GraphViz can understand. We’re also going to add information about the number of watchers for each user and a link back to their GitHub page.

Generating a GraphViz file in dot format

GraphViz is a popular framework for generating graphs. The cornerstone of this is a simple format for describing a directed graph in a simple text file(commonly referred to as a ‘dot’ file) combined with a variety of different layouts for displaying the graph. For the purposes of this post, I’m after describing the following for inclusion in the graph:

An edge from each watcher to the user whose repository they are watching.
A label on each node which includes the user’s name and the count of watchers for all of their repositories.
An embedded HTML link to the user’s GitHub page on each node. Highlighting the starting user in the graph by coloring that node red.
Assigning a ‘rank’ attribute to nodes that links all users with the same number of watchers.

The script I’m using to create the ‘dot’ file is pretty much just brute force string processing and the full source code is available as a gist, but here are the interesting parts. First, loading in the JSON file that was output in the last step; converting it to a map structure is very simple:

def data
new File(filename).withReader {reader ->
   data = new JsonSlurper().parse(reader)
}

From this data structure we can extract particular details and group everything by the number of watchers per user.

println "Number of mapped users = ${data.size()}"
println "Number of watchers = ${data.values().flatten().size()}"
println "Number of unique watchers = ${data.values().flatten().unique().size()}"
 
//group the data by the number of watchers
final Map groupedData = data.groupBy {it.value.size()}.sort {-it.key}
final Set allWatchers = data.collect {it.value}.flatten()
final Set allUsernames = data.keySet()
final Set leafNodes = allWatchers - allUsernames

Given this data, we create individual nodes with styling details like so:

StringWriter writer = new StringWriter()
groupedUsers.each {count, users ->
    users.each { username, watchers ->
        def user = "\t\"$username\""
        def attrs = generateNodeAttrsMemoized(username, count)
        def rootAttrs = "fillcolor=red style=filled $attrs"
        if (username == rootUser) {
            writer << "$user [$rootAttrs];\n"
        } else {
            writer << "$user [$attrs ${extraAttrsMemoized(count, username)}];\n"
        }
    }
}

And this generates node and edge descriptions that look like this:

     ...
      "gyurisc" [label="gyurisc = 31" URL="https://github.com/gyurisc" ];
      "kellyrob99" [fillcolor=red style=filled label="kellyrob99 = 31"
                      URL="https://github.com/kellyrob99"];
     ...
      "JulianDevilleSmith" -> "cfxram";
      "rhyolight" -> "aalmiray";
      "kellyrob99" -> "aalmiray";
     ...

If you created the JSON data already, you can run this command in the same directory in order to generate the GraphViz dot file: groovy https://raw.github.com/gist/2475460/78642d81dd9bc95f099e0f96c3d87389a1ef6967/githubWatcherDigraphGenerator.groovy {username} reposOutput.json. This will create a file named ‘reposDigraph.dot’ in that directory. From there the last step is to interpret the graph definition into an image.

Turning a ‘dot’ file into an image

I was looking for a quick and easy way to generate multiple visualizations from the same model quickly for comparison and settled on using GPars to generate them concurrently. We have to be a little careful here as some of the layout/format combinations can require a fair bit of memory and CPU – in the worst cases as much as 2GB of memory and processing times in the range of an hour. My recommendation is to stick with the sfdp and twopi(see the online documentation here) layouts for graphs of similar size to the one described here. If you’re after a huge, stunning graphic with complete detail, expect a png image to weigh in somewhere north of 150MB whereas the corresponding svg file will be less than 10MB. This Groovy script depends on having the GraphViz command line ‘dot’ executable already installed, exercises six of the available layout algorithms and generates png and svn files using four concurrently.

import groovyx.gpars.GParsPool
 
def inputfile = args[0]
def layouts = [ 'dot', 'neato', 'twopi', 'sfdp', 'osage', 'circo' ] //NOTE some of these will fail to process large graphs
def formats = [ 'png', 'svg']
def combinations = [layouts, formats].combinations()
 
GParsPool.withPool(4) {
    combinations.eachParallel { combination ->
        String layout = combination[0]
        String format = combination[1]
        List args = [ '/usr/local/bin/dot', "-K$layout", '-Goverlap=prism', '-Goverlap_scaling=-10', "-T$format",
                '-o', "${inputfile}.${layout}.$format", inputfile ]
        println args
        final Process execute = args.execute()
        execute.waitFor()
        println execute.exitValue()
    }
}

Here’s a gallery with some examples of the images created and scaled down to be web friendly. The full size graphs I generated using this data weighed in as large as 300MB for a single PNG file. The SVG format takes up significantly less space but still more than 10MB. I also had trouble finding a viewer for the SVG format that was a) capable of showing the large graph in a navigable way and b) didn’t crash my browser due to memory usage.

And just for fun

Originally I had intended to publish this functionality as an application on the Google App Engine using Gaelyk, but since the API limit would make it suitable for pretty much one request per hour, and likely get me in trouble with GitHub, I ended up foregoing that bit. But along the way I developed a very simple page that will load all of the publicly available Gists for a particular user and display them in a table. This is a pretty clean example of how you can whip up a quick and dirty application and make it publicly available using GAE + Gaelyk. This involved setting up the infrastructure using the gradle-gaelyk-plugin combined with the gradle-gae-plugin, and using Gradle to build, test and deploy the app to the web- all told about an hour’s worth of effort. Try this link to load up all of my publicly available Gists- replace the username parameter if you’d like to check out somebody else. Please give it a second as GAE will undeploy the application if it hasn’t been requested in awhile, so the first call can take a few seconds.
http://publicgists.appspot.com/gist?username=kellyrob99

Here’s the Groovlet implementation that loads the data and then forwards to the template page.

def username =  request.getParameter('username') ?: 'kellyrob99'
def text = "https://gist.github.com/api/v1/json/gists/$username".toURL().text
log.info text
request.setAttribute('rawJSON', text)
request.setAttribute('username', username)
 
forward '/WEB-INF/pages/gist.gtpl'

And the accompanying template page which renders a simple tabular view of the API request.

<% include '/WEB-INF/includes/header.gtpl' %>
<% import groovy.json.JsonSlurper %>
<%
   def gistMap = new JsonSlurper().parseText(request['rawJSON'])
%>
<h1>Public Gists for username : ${request['username']} </h1>
 
<p>
    <table class = "gridtable">
        <th>Description</th>
        <th>Web page</th>
        <th>Repo</th>
        <th>Owner</th>
        <th>Files</th>
        <th>Created at</th>
        <%
        gistMap.gists.each { data ->
            def repo = data.repo
        %>
            <tr>
                <td>${data.description ?: ''}</td>
                <td>
                    <a href="https://gist.github.com/${repo}">${repo}</a>
                </td>
                <td>
                    <a href= "git://gist.github.com/${repo}.git">${repo}</a>
                </td>
                <td>{data.owner}</td>
                <td>${data.files}</td>
                <td>${data.created_at}</td>
            </tr>
        <% } %>
    </table>
</p>
<% include '/WEB-INF/includes/footer.gtpl' %>

Reference: GitHub Social Graphs with Groovy and GraphViz from our JCG partner Kelly Robinson at the The Kaptain on … stuff blog.