Software Development

R: Scraping the release dates of github projects

Continuing on from my blog post about scraping Neo4j’s release dates I thought it’d be even more interesting to chart the release dates of some github projects.

In theory the release dates should be accessible through the github API but the few that I looked at weren’t returning any data so I scraped the data together.

We’ll be using rvest again and I first wrote the following function to extract the release versions and dates from a single page:

library(dplyr)
library(rvest)
 
process_page = function(releases, session) {
  rows = session %>% html_nodes("ul.release-timeline-tags li")
 
  for(row in rows) {
    date = row %>% html_node("span.date")
    version = row %>% html_node("div.tag-info a")
 
    if(!is.null(version) && !is.null(date)) {
      date = date %>% html_text() %>% str_trim()
      version = version %>% html_text() %>% str_trim()
      releases = rbind(releases, data.frame(date = date, version = version))
    }  
  }
  return(releases)
}

Let’s try it out on the Cassandra release page and see what it comes back with:

> r = process_page(data.frame(), html_session("https://github.com/apache/cassandra/releases"))
> r
           date               version
1  Jun 22, 2015       cassandra-2.1.7
2  Jun 22, 2015      cassandra-2.0.16
3   Jun 8, 2015       cassandra-2.1.6
4   Jun 8, 2015   cassandra-2.2.0-rc1
5  May 19, 2015 cassandra-2.2.0-beta1
6  May 18, 2015      cassandra-2.0.15
7  Apr 29, 2015       cassandra-2.1.5
8   Apr 1, 2015      cassandra-2.0.14
9   Apr 1, 2015       cassandra-2.1.4
10 Mar 16, 2015      cassandra-2.0.13

That works pretty well but it’s only one page! To get all the pages we can use the follow_link function to follow the ‘Next’ link until there aren’t anymore pages to process.

We end up with the following function to do this:

find_all_releases = function(starting_page) {
  s = html_session(starting_page)
  releases = data.frame()
 
  next_page = TRUE
  while(next_page) {
    possibleError = tryCatch({  
      releases = process_page(releases, s)
      s = s %>% follow_link("Next") 
    }, error = function(e) { e })
 
    if(inherits(possibleError, "error")){
      next_page = FALSE
    }
  }
  return(releases)
}

Let’s try it out starting from the Cassandra page:

> cassandra = find_all_releases("https://github.com/apache/cassandra/releases")
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.8
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.13
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-2.0.0-rc1
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.2.0-beta2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.10
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.6
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-1.0.0-rc2
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.7
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.7.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.6.4
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.5.0-rc3
Navigating to https://github.com/apache/cassandra/releases?after=cassandra-0.4.0-final
 
> cassandra %>% sample_n(10)
            date               version
151 Mar 13, 2010   cassandra-0.5.0-rc2
25   Jul 3, 2014      cassandra-1.2.18
51  Jul 27, 2013       cassandra-1.2.8
21  Aug 19, 2014   cassandra-2.1.0-rc6
73  Sep 24, 2012 cassandra-1.2.0-beta1
158 Mar 13, 2010   cassandra-0.4.0-rc2
113 May 20, 2011     cassandra-0.7.6-2
15  Oct 24, 2014       cassandra-2.1.1
103 Sep 15, 2011 cassandra-1.0.0-beta1
93  Nov 29, 2011       cassandra-1.0.4

I want to plot when the different releases happened in time and in order to do that we need to create an extra column containing the ‘release series’ which we can do with the following transformation:

series = function(version) {
  parts = strsplit(as.character(version), "\\.")  
  return(unlist(lapply(parts, function(p) paste(p %>% unlist %>% head(2), collapse = "."))))  
}
 
bySeries = cassandra %>%
  mutate(date2 = mdy(date), series = series(version),
         short_version = gsub("cassandra-", "", version),
         short_series = series(short_version))
 
> bySeries %>% sample_n(10)
            date               version      date2        series short_version short_series
3    Jun 8, 2015       cassandra-2.1.6 2015-06-08 cassandra-2.1         2.1.6          2.1
161 Mar 13, 2010 cassandra-0.4.0-beta1 2010-03-13 cassandra-0.4   0.4.0-beta1          0.4
62  Feb 15, 2013      cassandra-1.1.10 2013-02-15 cassandra-1.1        1.1.10          1.1
153 Mar 13, 2010 cassandra-0.5.0-beta2 2010-03-13 cassandra-0.5   0.5.0-beta2          0.5
37   Feb 7, 2014       cassandra-2.0.5 2014-02-07 cassandra-2.0         2.0.5          2.0
36   Feb 7, 2014      cassandra-1.2.15 2014-02-07 cassandra-1.2        1.2.15          1.2
29   Jun 2, 2014   cassandra-2.1.0-rc1 2014-06-02 cassandra-2.1     2.1.0-rc1          2.1
21  Aug 19, 2014   cassandra-2.1.0-rc6 2014-08-19 cassandra-2.1     2.1.0-rc6          2.1
123 Feb 16, 2011       cassandra-0.7.2 2011-02-16 cassandra-0.7         0.7.2          0.7
135  Nov 1, 2010 cassandra-0.7.0-beta3 2010-11-01 cassandra-0.7   0.7.0-beta3          0.7

Now let’s plot those releases and see what we get:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(!grepl("beta|rc", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015-06-23_22-59-19

An interesting thing we can see from this visualisation is what overlap the various series of versions have. Most of the time there are only two series of versions overlapping but the 1.2, 2.0 and 2.1 series all overlap which is unusual.

In this chart we excluded all beta and RC versions. Let’s bring those back in and just show the last 3 versions:

ggplot(aes(x = date2, y = short_series), 
       data = bySeries %>% filter(grepl("2\\.[012]\\.|1\\.2\\.", short_version))) +     
  geom_text(aes(label=short_version),hjust=0.5, vjust=0.5, size = 4, angle = 90) + 
  theme_bw()

2015-06-23_23-08-04

From this chart it’s clearer that the 2.0 and 2.1 series have recent releases so there will probably be three overlapping versions when the 2.2 series is released as well.

The chart is still a bit cluttered although less than before. I’m not sure of a better way of visualising this type of data so if you have any ideas do let me know!

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button