Home » Archives for Mark Needham » Page 8

Author Archives: Mark Needham

R: Vectorising all the things

After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…                 @markhneedham vectorise with pmin(pmax(dateToLookup – before, 0), pmax(after – dateToLookup, 0)) / dhours(1) — Hadley Wickham (@hadleywickham) December 14, 2014 …so I thought I’d try and vectorise ...

Read More »

R: Time to/from the weekend

In my last post I showed some examples using R’s lubridate package and another problem it made really easy to solve was working out how close a particular date time was to the weekend. I wanted to write a function which would return the previous Sunday or upcoming Saturday depending on which was closer. lubridate’s floor_date and ceiling_date functions make ...

Read More »

R: Cleaning up and plotting Google Trends data

I recently came across an excellent article written by Stian Haklev in which he describes things he wishes he’d been told before starting out with R, one being to do all data clean up in code which I thought I’d give a try.                 My goal is to leave the raw data completely ...

Read More »

R: Applying a function to every row of a data frame

In my continued exploration of London’s meetups I wanted to calculate the distance from meetup venues to a centre point in London. I’ve created a gist containing the coordinates of some of the venues that host NoSQL meetups in London town if you want to follow along:           library(dplyr)   # https://gist.github.com/mneedham/7e926a213bf76febf5ed venues = read.csv("/tmp/venues.csv")   ...

Read More »

Spark: Write to CSV file

A couple of weeks ago I wrote how I’d been using Spark to explore a City of Chicago Crime data set and having worked out how many of each crime had been committed I wanted to write that to a CSV file. Spark provides a saveAsTextFile function which allows us to save RDD’s so I refactored my code into the ...

Read More »

Spark: Write to CSV file with header using saveAsFile

In my last blog post I showed how to write to a single CSV file using Spark and Hadoop and the next thing I wanted to do was add a header row to the resulting row. Hadoop’s FileUtil#copyMerge function does take a String parameter but it adds this text to the end of each partition file which isn’t quite what ...

Read More »

Spark: Parse CSV file and group by column value

I’ve found myself working with large CSV files quite frequently and realising that my existing toolset didn’t let me explore them quickly I thought I’d spend a bit of time looking at Spark to see if it could help. I’m working with a crime data set released by the City of Chicago: it’s 1GB in size and contains details of ...

Read More »

Neo4j: Cypher – Avoiding the Eager

  Although I love how easy Cypher’s LOAD CSV command makes it to get data into Neo4j, it currently breaks the rule of least surprise in the way it eagerly loads in all rows for some queries even those using periodic commit. This is something that my colleague Michael noted in the second of his blog posts explaining how to ...

Read More »

Conceptual Model vs Graph Model

We’ve started running some sessions on graph modelling in London and during the first session it was pointed out that the process I’d described was very similar to that when modelling for a relational database. I thought I better do some reading on the way relational models are derived and I came across an excellent video by Joe Maguire titled ...

Read More »