Home » Author Archives: Mark Needham (page 3)

Author Archives: Mark Needham

SparkR: Add new column to data frame by concatenating other columns

Continuing with my exploration of the Land Registry open data set using SparkR I wanted to see which road in the UK has had the most property sales over the last 20 years. To recap, this is what the data frame looks like: ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0   > sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")   > head(sales) C0 C1 C2 ...

Read More »

Unix: Redirecting stderr to stdout

I’ve been trying to optimise some Neo4j import queries over the last couple of days and as part of the script I’ve been executed I wanted to redirect the output of a couple of commands into a file to parse afterwards. I started with the following script which doesn’t do any explicit redirection of the output: #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start ...

Read More »

Sed: Using environment variables

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path. For example the import file contains several lines similar to this: LOAD CSV WITH HEADERS FROM ...

Read More »

Record Linkage: Playing around with Duke

I’ve become quite interesting in record linkage recently and came across the Duke project which provides some tools to help solve this problem. I thought I’d give it a try. The typical problem when doing record linkage is that we have two records from different data sets which represent the same entity but don’t have a common key that we ...

Read More »

R: Bootstrap confidence intervals

I recently came across an interesting post on Julia Evans’ blog showing how to generate a bigger set of data points by sampling the small set of data points that we actually have using bootstrapping. Julia’s examples are all in Python so I thought it’d be a fun exercise to translate them into R. We’re doing the bootstrapping to simulate ...

Read More »

R: Blog post frequency anomaly detection

I came across Twitter’s anomaly detection library last year but haven’t yet had a reason to take it for a test run so having got my blog post frequency data into shape I thought it’d be fun to run it through the algorithm. I wanted to see if it would detect any periods of time when the number of posts ...

Read More »

Neo4j: The football transfers graph

Given we’re still in pre season transfer madness as far as European football is concerned I thought it’d be interesting to put together a football transfers graph to see whether there are any interesting insights to be had. It took me a while to find an appropriate source but I eventually came across transfermarkt.co.uk which contains transfers going back at ...

Read More »

R: Wimbledon – How do the seeds get on?

Continuing on with the Wimbledon data set I’ve been playing with I wanted to do some exploration on how the seeded players have fared over the years. Taking the last 10 years worth of data there have always had 32 seeds and with the following function we can feed in a seeding and get back the round they would be ...

Read More »

R: Speeding up the Wimbledon scraping job

Over the past few days I’ve written a few blog posts about a Wimbledon data set I’ve been building and after running the scripts a few times I noticed that it was taking much longer to run that I expected. To recap, I started out with the following function which takes in a URI and returns a data frame containing ...

Read More »

Want to take your Java skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • Spring Interview QnA
  • Multithreading & Concurrency QnA
  • JPA Minibook
  • JVM Troubleshooting Guide
  • Advanced Java
  • Java Interview QnA
  • Java Design Patterns