Home » Author Archives: Mark Needham (page 2)

Author Archives: Mark Needham

Spark: Write to CSV file

scala-logo

A couple of weeks ago I wrote how I’d been using Spark to explore a City of Chicago Crime data set and having worked out how many of each crime had been committed I wanted to write that to a CSV file. Spark provides a saveAsTextFile function which allows us to save RDD’s so I refactored my code into the ...

Read More »

Spark: Write to CSV file with header using saveAsFile

scala-logo

In my last blog post I showed how to write to a single CSV file using Spark and Hadoop and the next thing I wanted to do was add a header row to the resulting row. Hadoop’s FileUtil#copyMerge function does take a String parameter but it adds this text to the end of each partition file which isn’t quite what ...

Read More »

Spark: Parse CSV file and group by column value

scala-logo

I’ve found myself working with large CSV files quite frequently and realising that my existing toolset didn’t let me explore them quickly I thought I’d spend a bit of time looking at Spark to see if it could help. I’m working with a crime data set released by the City of Chicago: it’s 1GB in size and contains details of ...

Read More »

Neo4j: Cypher – Avoiding the Eager

neo4j-logo

  Although I love how easy Cypher’s LOAD CSV command makes it to get data into Neo4j, it currently breaks the rule of least surprise in the way it eagerly loads in all rows for some queries even those using periodic commit. This is something that my colleague Michael noted in the second of his blog posts explaining how to ...

Read More »

Conceptual Model vs Graph Model

software-development-2-logo

We’ve started running some sessions on graph modelling in London and during the first session it was pointed out that the process I’d described was very similar to that when modelling for a relational database. I thought I better do some reading on the way relational models are derived and I came across an excellent video by Joe Maguire titled ...

Read More »

R: A first attempt at linear regression

software-development-2-logo

I’ve been working through the videos that accompany the Introduction to Statistical Learning with Applications in R book and thought it’d be interesting to try out the linear regression algorithm against my meetup data set. I wanted to see how well a linear regression algorithm could predict how many people were likely to RSVP to a particular event. I started ...

Read More »

Neo4j: Generic/Vague relationship names

neo4j-logo

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node. Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes ...

Read More »

Neo4j: COLLECTing multiple values

neo4j-logo

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption. However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so. Consider the following data set:         create ...

Read More »

R: Calculating rolling or moving averages

software-development-2-logo

I’ve been playing around with some time series data in R and since there’s a bit of variation between consecutive points I wanted to smooth the data out by calculating the moving average. I struggled to find an in built function to do this but came across Didier Ruedin’s blog post which described the following function to do the job: ...

Read More »
Do you want to know how to develop your skillset and become a ...

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!
Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.

THANK YOU!

Close