Home » Author Archives: Mark Needham

Author Archives: Mark Needham

Clojure: First steps with reducers

clojure-logo

I’ve been playing around with Clojure a bit today in preparation for a talk I’m giving next week and found myself writing the following code to apply the same function to three different scores: (defn log2 [n] (/ (Math/log n) (Math/log 2)))   (defn score-item [n] (if (= n 0) 0 (log2 n)))   (+ (score-item 12) (score-item 13) (score-item ...

Read More »

Neo4j: Specific relationship vs Generic relationship + property

neo4j-logo

For optimal traversal speed in Neo4j queries we should make our relationship types as specific as possible. Let’s take a look at an example from the ‘modelling a recommendations engine‘ talk I presented at Skillsmatter a couple of weeks ago. I needed to decided how to model the ‘RSVP’ relationship between a Member and an Event. A person can RSVP ...

Read More »

Hadoop: HDFS – java.lang.NoSuchMethodError: org.apache.hadoop.fs.FSOutputSummer.(Ljava/util/zip/Checksum;II)V

apache-hadoop-logo

I wanted to write a little program to check that one machine could communicate a HDFS server running on the other and adapted some code from the Hadoop wiki as follows: package org.playground;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path;   import java.io.IOException;   public class HadoopDFSFileReadWrite {   static void printAndExit(String str) { System.err.println( str ...

Read More »

SparkR: Add new column to data frame by concatenating other columns

software-development-2-logo

Continuing with my exploration of the Land Registry open data set using SparkR I wanted to see which road in the UK has had the most property sales over the last 20 years. To recap, this is what the data frame looks like: ./spark-1.5.0-bin-hadoop2.6/bin/sparkR --packages com.databricks:spark-csv_2.11:1.2.0   > sales <- read.df(sqlContext, "pp-complete.csv", "com.databricks.spark.csv", header="false")   > head(sales) C0 C1 C2 ...

Read More »

Unix: Redirecting stderr to stdout

software-development-2-logo

I’ve been trying to optimise some Neo4j import queries over the last couple of days and as part of the script I’ve been executed I wanted to redirect the output of a couple of commands into a file to parse afterwards. I started with the following script which doesn’t do any explicit redirection of the output: #!/bin/sh   ./neo4j-community-2.2.3/bin/neo4j start ...

Read More »

Sed: Using environment variables

software-development-2-logo

I’ve been playing around with the BBC football data set that I wrote about a couple of months ago and I wanted to write some code that would take the import script and replace all instances of remote URIs with a file system path. For example the import file contains several lines similar to this: LOAD CSV WITH HEADERS FROM ...

Read More »

Record Linkage: Playing around with Duke

java-logo

I’ve become quite interesting in record linkage recently and came across the Duke project which provides some tools to help solve this problem. I thought I’d give it a try. The typical problem when doing record linkage is that we have two records from different data sets which represent the same entity but don’t have a common key that we ...

Read More »

R: Bootstrap confidence intervals

software-development-2-logo

I recently came across an interesting post on Julia Evans’ blog showing how to generate a bigger set of data points by sampling the small set of data points that we actually have using bootstrapping. Julia’s examples are all in Python so I thought it’d be a fun exercise to translate them into R. We’re doing the bootstrapping to simulate ...

Read More »

R: Blog post frequency anomaly detection

software-development-2-logo

I came across Twitter’s anomaly detection library last year but haven’t yet had a reason to take it for a test run so having got my blog post frequency data into shape I thought it’d be fun to run it through the algorithm. I wanted to see if it would detect any periods of time when the number of posts ...

Read More »

Want to take your Java skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • Advanced Java Guide
  • Java Design Patterns
  • JMeter Tutorial
  • Java 8 Features Tutorial
  • JUnit Tutorial
  • JSF Programming Cookbook
  • Java Concurrency Essentials