Home » Author Archives: Mark Needham (page 2)

Author Archives: Mark Needham

Neo4j: Detecting rogue spaces in CSV headers with LOAD CSV

Last week I was helping someone load the data from a CSV file into Neo4j and we were having trouble filtering out rows which contained a null value in one of the columns. This is what the data looked like: load csv with headers from "file:///foo.csv" as row RETURN row ╒══════════════════════════════════╕ │row │ ╞══════════════════════════════════╡ │{key1: a, key2: (null), key3: c}│ ...

Read More »

Hadoop: DataNode not starting

In my continued playing with Mahout I eventually decided to give up using my local file system and use a local Hadoop instead since that seems to have much less friction when following any examples. Unfortunately all my attempts to upload any files from my local file system to HDFS were being met with the following exception: java.io.IOException: File /user/markneedham/book2.txt ...

Read More »

Neo4j: Cypher – Detecting duplicates using relationships

I’ve been building a graph of computer science papers on and off for a couple of months and now that I’ve got a few thousand loaded in I realised that there are quite a few duplicates. They’re not duplicates in the sense that there are multiple entries with the same identifier but rather have different identifiers but seem to be ...

Read More »

Neo4j vs Relational: Refactoring – Extracting node/table

In my previous blog post I showed how to add a new property/field to a node with a label/record in a table for a football transfers dataset that I’ve been playing with. After introducing this ‘nationality’ property I realised that I now had some duplication in the model:               players.nationality and clubs.country are referring ...

Read More »

Neo4j: A procedure for the SLM clustering algorithm

In the middle of last year I blogged about the Smart Local Moving algorithm which is used for community detection in networks and with the upcoming introduction of procedures in Neo4j I thought it’d be fun to make that code accessible as one. If you want to grab the code and follow along it’s sitting on the SLM repository on ...

Read More »

Clojure: First steps with reducers

I’ve been playing around with Clojure a bit today in preparation for a talk I’m giving next week and found myself writing the following code to apply the same function to three different scores: (defn log2 [n] (/ (Math/log n) (Math/log 2)))   (defn score-item [n] (if (= n 0) 0 (log2 n)))   (+ (score-item 12) (score-item 13) (score-item ...

Read More »

Neo4j: Specific relationship vs Generic relationship + property

For optimal traversal speed in Neo4j queries we should make our relationship types as specific as possible. Let’s take a look at an example from the ‘modelling a recommendations engine‘ talk I presented at Skillsmatter a couple of weeks ago. I needed to decided how to model the ‘RSVP’ relationship between a Member and an Event. A person can RSVP ...

Read More »

Hadoop: HDFS – java.lang.NoSuchMethodError: org.apache.hadoop.fs.FSOutputSummer.(Ljava/util/zip/Checksum;II)V

I wanted to write a little program to check that one machine could communicate a HDFS server running on the other and adapted some code from the Hadoop wiki as follows: package org.playground;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path;   import java.io.IOException;   public class HadoopDFSFileReadWrite {   static void printAndExit(String str) { System.err.println( str ...

Read More »

Want to take your Java skills to the next level?

Grab our programming books for FREE!

Here are some of the eBooks you will get:

  • Spring Interview QnA
  • Multithreading & Concurrency QnA
  • JPA Minibook
  • JVM Troubleshooting Guide
  • Advanced Java
  • Java Interview QnA
  • Java Design Patterns