Clojure: Reading and writing a reasonably sized file

In a post a couple of days ago I described some code I’d written in R to find out all the features with zero variance in the Kaggle Digit Recognizer data set and yesterday I started working on some code to remove those features.

Jen and I had previously written some code to parse the training data in Clojure so I thought I’d try and adapt that to write out a new file without the unwanted pixels.
In the first version we’d encapsulated the reading of the file and parsing of it into a more useful data structure like so:

(defn get-pixels [pix] (map #( Integer/parseInt %) pix))

(defn create-tuple [[ head & rem]] {:pixels (get-pixels rem) :label head})

(defn tuples [rows] (map create-tuple rows))

(defn parse-row [row] (map #(clojure.string/split % #",") row))

(defn read-raw [path n] 
  (with-open [reader ( path)] (vec (take n (rest  (line-seq reader))))))

(def read-train-set-raw  (partial read-raw "data/train.csv"))

(def parsed-rows (tuples (parse-row (read-train-set-raw 42000))))

So the def parsed-rows gives an in memory representation of a row where we’ve separated the label and pixels into different key entries in a map. We wanted to remove any pixels which had a variance of 0 across the data set which in this case means that they always have a value of 0:

(def dead-to-us-pixels
  [0 1 2 3 4 5 6 7 8 9 10 11 16 17 18 19 20  21  22  23  24  25  26  27 28 29  30 31 52 53 54 55 56 57 82 83 84 85 111 112 139 140 141 168 196 392 420 421 448 476 532 560 644 645 671 672 673 699 700 701 727 728 729 730 731 754 755 756 757 758 759 760 780 781 782 783])

(defn in? 
  "true if seq contains elm"
  [seq elm]  
  (some #(= elm %) seq))

(defn dead-to-us? [pixel-with-index]
  (in? dead-to-us-pixels (first pixel-with-index)))

(defn remove-unwanted-pixels [row]
  (let [new-pixels
        (->> row :pixels (map-indexed vector) (remove dead-to-us?) (map second))]
    {:pixels new-pixels :label (:label row)}))

(defn -main []
  (with-open [wrt ( "/tmp/attempt-1.txt")]
    (doseq [line parsed-rows]
      (let [line-without-pixels (to-file-format (remove-unwanted-pixels line))]
        (.write wrt (str line-without-pixels "\n"))))))

We then ran the main method using ‘leon run’ which wrote out the new file. A print screen of the heap space usage while this function was running looks like this:

Encapsulated read tiff

While I was writing this version of the function I made a mistake somewhere and ended up passing the wrong data structure to one of the functions which resulted in all the intermediate steps that the data structure goes through getting stored in memory and caused an OutOfMemory exception.

A heap dump showed the following:

Gone wrong tiff

When I reduced the size of the erroneous collection by using a ‘take 10′ I got an exception indicating that the function couldn’t process the data structure which allowed me to sort it out.

I initially thought that the problem was to do with the loading of the file into memory at all but since the above seems to work I don’t think it is. When I was working along that theory Jen suggested it might make more sense to do the reading and writing of the files within a ‘with-open’ which tallies with a suggestion I came across in a StackOverflow post.

I ended up with the following code:

(defn split-on-comma [line]
  (string/split line #","))

(defn clean-train-file []
  (with-open [rdr ( "data/train.csv")
              wrt ( "/tmp/attempt-2.csv")]
    (doseq [line (drop 1 (line-seq rdr))]
      (let [line-with-removed-pixels
             ((comp to-file-format remove-unwanted-pixels create-tuple split-on-comma) line)]
        (.write wrt (str line-with-removed-pixels "\n"))))))

Which got called in the main method like this:

(defn -main [] (clean-train-file))

This version had the following heap usage:

All in with open tiff

Its peaks are slightly lower than the first one and it seems like it buffers a bunch of lines, writes them out to the file (and therefore out of memory) and repeats.

Reference: Clojure: Reading and writing a reasonably sized file  from our JCG partner Markh Needham at the Mark Needham Blog.

Related Whitepaper:

Java Essential Training

Author David Gassner explores Java SE (Standard Edition), the language used to build mobile apps for Android devices, enterprise server applications, and more!

The course demonstrates how to install both Java and the Eclipse IDE and dives into the particulars of programming. The course also explains the fundamentals of Java, from creating simple variables, assigning values, and declaring methods to working with strings, arrays, and subclasses; reading and writing to text files; and implementing object oriented programming concepts. Exercise files are included with the course.

Get it Now!  

Leave a Reply

eight − = 4

Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books