Clojure: Reading and writing a reasonably sized file

In a post a couple of days ago I described some code I’d written in R to find out all the features with zero variance in the Kaggle Digit Recognizer data set and yesterday I started working on some code to remove those features.

Jen and I had previously written some code to parse the training data in Clojure so I thought I’d try and adapt that to write out a new file without the unwanted pixels.
In the first version we’d encapsulated the reading of the file and parsing of it into a more useful data structure like so:

(defn get-pixels [pix] (map #( Integer/parseInt %) pix))

(defn create-tuple [[ head & rem]] {:pixels (get-pixels rem) :label head})

(defn tuples [rows] (map create-tuple rows))

(defn parse-row [row] (map #(clojure.string/split % #",") row))

(defn read-raw [path n] 
  (with-open [reader ( path)] (vec (take n (rest  (line-seq reader))))))

(def read-train-set-raw  (partial read-raw "data/train.csv"))

(def parsed-rows (tuples (parse-row (read-train-set-raw 42000))))

So the def parsed-rows gives an in memory representation of a row where we’ve separated the label and pixels into different key entries in a map. We wanted to remove any pixels which had a variance of 0 across the data set which in this case means that they always have a value of 0:

(def dead-to-us-pixels
  [0 1 2 3 4 5 6 7 8 9 10 11 16 17 18 19 20  21  22  23  24  25  26  27 28 29  30 31 52 53 54 55 56 57 82 83 84 85 111 112 139 140 141 168 196 392 420 421 448 476 532 560 644 645 671 672 673 699 700 701 727 728 729 730 731 754 755 756 757 758 759 760 780 781 782 783])

(defn in? 
  "true if seq contains elm"
  [seq elm]  
  (some #(= elm %) seq))

(defn dead-to-us? [pixel-with-index]
  (in? dead-to-us-pixels (first pixel-with-index)))

(defn remove-unwanted-pixels [row]
  (let [new-pixels
        (->> row :pixels (map-indexed vector) (remove dead-to-us?) (map second))]
    {:pixels new-pixels :label (:label row)}))

(defn -main []
  (with-open [wrt ( "/tmp/attempt-1.txt")]
    (doseq [line parsed-rows]
      (let [line-without-pixels (to-file-format (remove-unwanted-pixels line))]
        (.write wrt (str line-without-pixels "\n"))))))

We then ran the main method using ‘leon run’ which wrote out the new file. A print screen of the heap space usage while this function was running looks like this:

Encapsulated read tiff

While I was writing this version of the function I made a mistake somewhere and ended up passing the wrong data structure to one of the functions which resulted in all the intermediate steps that the data structure goes through getting stored in memory and caused an OutOfMemory exception.

A heap dump showed the following:

Gone wrong tiff

When I reduced the size of the erroneous collection by using a ‘take 10′ I got an exception indicating that the function couldn’t process the data structure which allowed me to sort it out.

I initially thought that the problem was to do with the loading of the file into memory at all but since the above seems to work I don’t think it is. When I was working along that theory Jen suggested it might make more sense to do the reading and writing of the files within a ‘with-open’ which tallies with a suggestion I came across in a StackOverflow post.

I ended up with the following code:

(defn split-on-comma [line]
  (string/split line #","))

(defn clean-train-file []
  (with-open [rdr ( "data/train.csv")
              wrt ( "/tmp/attempt-2.csv")]
    (doseq [line (drop 1 (line-seq rdr))]
      (let [line-with-removed-pixels
             ((comp to-file-format remove-unwanted-pixels create-tuple split-on-comma) line)]
        (.write wrt (str line-with-removed-pixels "\n"))))))

Which got called in the main method like this:

(defn -main [] (clean-train-file))

This version had the following heap usage:

All in with open tiff

Its peaks are slightly lower than the first one and it seems like it buffers a bunch of lines, writes them out to the file (and therefore out of memory) and repeats.

Reference: Clojure: Reading and writing a reasonably sized file  from our JCG partner Markh Needham at the Mark Needham Blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply

4 + = thirteen

Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below: