Scala

Scala Tutorial – scala.io.Source, accessing files, flatMap, mutable Maps

Preface

This is part 8 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for. Additionally you can find this and other tutorial series on the JCG Java Tutorials page.

This tutorial is about accessing the file system in order to work with text files. The previous tutorial showed how to build a Map that contains the counts of each word type in a given text. However, it was assumed that the text was available in a String variable, and typically we are interested in knowing things about files that live on the file system, or on the internet. This tutorial shows how to read a file’s contents into Scala for processing, both by building a single String for the file or by consuming it line-by-line in a streaming fashion. Along the way, immutable Maps are introduced as a way to enable word counting without reading an entire file into memory.

Word count on the contents of a file

As an example, we’ll use the complete Sherlock Holmes from project Gutenberg. Download it, put it into a directory, and then start up the Scala REPL in that directory. To access files, we’ll use the Source class, so to start you need to import it.

scala> import scala.io.Source
import scala.io.Source

Source provides a number of ways to interact with files and make them accessible to you in your Scala program. The fromFile method is the one you’ll probably need most.

scala> Source.fromFile("pg1661.txt")
res3: scala.io.BufferedSource = non-empty iterator

This creates a BufferedSource, from which you can easily get all of file’s contents as a String.

scala> val holmes = Source.fromFile("pg1661.txt").mkString
holmes: String =
"Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net
<...many more lines...>

With this, you can do the same things as shown it tutorial 7 to get the word counts (except that here we’ll split on white space sequences rather than just a single space).

scala> val counts = holmes.split("\\s+").groupBy(x=>x).mapValues(x=>x.length)
counts: scala.collection.immutable.Map[java.lang.String,Int] = Map(wood-work, -> 1, "Pray, -> 1, herself. -> 2, stern-post -> 1, "Should -> 1, incident -> 8, serious -> 14, earth--" -> 2, sinister -> 10, comply -> 7, breaks -> 1, forgotten -> 3, precious -> 10, 'It -> 3, compliment -> 2, suite, -> 1, "DEAR -> 1, summarise. -> 1, "Done -> 1, fine.' -> 1, lover -> 5, of. -> 2, lead. -> 1, plentiful -> 1, 'Lone -> 4, malignant -> 1, terrible -> 14, rate -> 1, mole -> 1, assert -> 1, lights -> 2, Stevenson, -> 1, submitted -> 4, tap. -> 1, beard, -> 1, band--a -> 1, force! -> 1, snow -> 7, Produced -> 2, ask, -> 1, purchasing -> 1, Hall, -> 1, wall. -> 5, remarked -> 32, laughing -> 4, member." -> 1, 30,000 -> 2, Redistributing -> 1, coat, -> 6, "'One -> 2, 'band,' -> 1, relapsed -> 1, apol...

scala> counts("Holmes")
res2: Int = 197

scala> counts("Watson")
res3: Int = 4

Lest you think it strange that Watson only shows up four times, keep in mind that we split on whitespace, and that means that in a sentence like the following, the token of interest is Watson,” rather than Watson.

“You could not possibly have come at a better time, my dear Watson,” he said cordially.

Looking that and others up shows more tokens containing Watson in the story.

scala> counts("Watson,\"")
res4: Int = 19

scala> counts("Watson,")
res5: Int = 40

scala> counts("Watson.")
res6: Int = 10

Of course, the real problem is that tokenizing on whitespace is too crude. To do this properly generally takes a good hand-built tokenizer (which is able to keep tokens like e.g. and Mr. and Yahoo! while splitting punctuation off most words) or a machine learned one that is trained on data hand-labeled for tokens. For an example of the latter, see the Apache OpenNLP toolkit tokenizers, which includes pre-trained models for English.

Working line by line

Quite often, you need to work through a file line by line, rather than reading the entire thing in as a single string as we did above. For example, you might need to process each line differently, so just having it as a single String isn’t particular convenient. Or, you might be working with a large file that cannot easily fit into memory (which is what happens when you read in the entire string). You can obtain the lines in the file as an Iterator[String], in which each item is a single line from the file, using the getLines method.

scala> Source.fromFile("pg1661.txt").getLines
res4: Iterator[String] = non-empty iterator

This iterator is ready for you to consume lines, but it doesn’t read all of the file into memory right away — instead it buffers it such that each line will be available for you as you ask for it, essentially reading off disk as you demand more lines. You can think of this as streaming the file to your Scala program, much like modern audio and video content is streamed to your computer: it is never actually stored, but is just transferred in parts to where it is needed, when it is needed.

Of course, Iterators share much with sequence data structures like Lists: once we have an Iterator, we can use foreach, for, map, etc. on it. So to print out all of the lines in the file, we can do the following.

scala> Source.fromFile("pg1661.txt").getLines.foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle
<...many more lines...>

That creates a lot of output, but it shows you how you can easily create your own Scala implementation of the Unix cat program: just save the following line in a file called cat.scala:

scala.io.Source.fromFile(args(0)).getLines.foreach(println)

And then call that with the name of the file to list its contents.

$ scala cat.scala pg1661.txt

Back in the REPL, it is somewhat less-than-ideal to see the entire file. If you just want to see the start of the file, use the take method on the Iterator before the foreach.

scala> Source.fromFile("pg1661.txt").getLines.take(5).foreach(println)
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included

The take method is quite useful in general with any sequence, and provides the complement of the drop method, as shown in the following examples on a simple List[Int].

scala> val numbers = 1 to 10 toList
numbers: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3)
res12: List[Int] = List(1, 2, 3)

scala> numbers.drop(3)
res13: List[Int] = List(4, 5, 6, 7, 8, 9, 10)

scala> numbers.take(3) ::: numbers.drop(3)
res14: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Word counting line by line, first try

Now that we’ve seen how to read a file and start working with it line-by-line, how do we count the number of occurrences of each word? Recall from tutorial 7 and above that the starting point was to have a sequence (Array, List, etc) of Strings in which each element is a word token. To start moving toward that, we can simply use the toList method on the Iterator[String] obtained from getLines.

scala> val holmes = Source.fromFile("pg1661.txt").getLines.toList
holmes: List[String] = List(The Project Gutenberg EBook of The Adventures of Sherlock Holmes, by Sir Arthur Conan Doyle, (#15 in our series by Sir Arthur Conan Doyle), "", Copyright laws are changing all over the world. Be sure to check the, copyright laws for your country before downloading or redistributing, this or any other Project Gutenberg eBook., "", This header should be the first thing seen when viewing this Project, Gutenberg file.  Please do not remove it.  Do not change or edit the, header without written permission., "", Please read the "legal small print," and other information about the, eBook and Project Gutenberg at the bottom of this file.  Included is, important information about your specific rights and restrictions in, how the file may be used.  You can also find ou...

We now have the contents of the file as a List[String], and may proceed to do useful things with it. For example, we could map each line (Strings) to be sequences of whitespace-separated Strings.

scala> val listOfListOfWords = Source.fromFile("pg1661.txt").getLines.toList.map(x => x.split(" ").toList)
listOfListOfWords: List[List[java.lang.String]] = List(List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle), List(""), List(This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with), List(almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or), List(re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included), List(with, this, eBook, or, online, at, www.gutenberg.net), List(""), List(""), List(Title:, The, Adventures, of, Sherlock, Holmes), List(""), List(Author:, Arthur, Conan, Doyle), List(""), List(Posting, Date:, April, 18,, 2011, [EBook, #1661]), List(First, Posted:, November, 29,, 2002), List(""), List(Language:, English), List(""), List(""), List(***, START, OF, THIS, PRO...

And, as we saw in tutorial 7, when we have a List of Lists, we can use flatten to create one big List.

scala> val listOfWords = listOfListOfWords.flatten
listOfWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project, Gut...

But, now you might recognize that this is the map-then-flatten pattern we saw previously, which means we can flatMap it instead.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.toList.flatMap(x => x.split(" "))
flatMappedWords: List[java.lang.String] = List(Project, Gutenberg's, The, Adventures, of, Sherlock, Holmes,, by, Arthur, Conan, Doyle, "", This, eBook, is, for, the, use, of, anyone, anywhere, at, no, cost, and, with, almost, no, restrictions, whatsoever., "", You, may, copy, it,, give, it, away, or, re-use, it, under, the, terms, of, the, Project, Gutenberg, License, included, with, this, eBook, or, online, at, www.gutenberg.net, "", "", Title:, The, Adventures, of, Sherlock, Holmes, "", Author:, Arthur, Conan, Doyle, "", Posting, Date:, April, 18,, 2011, [EBook, #1661], First, Posted:, November, 29,, 2002, "", Language:, English, "", "", ***, START, OF, THIS, PROJECT, GUTENBERG, EBOOK, THE, ADVENTURES, OF, SHERLOCK, HOLMES, ***, "", "", "", "", Produced, by, an, anonymous, Project,...

But you should be a bit bothered by all this: wasn’t the idea here (in part) not to read all of the lines in at once? Indeed, with what we did above, as soon as we said toList on the Iterator, the whole file was read into memory. However, we can do without the toList step and just directly flatMap the Iterator and get a new Iterator over the tokens rather than the lines.

scala> val flatMappedWords = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" "))
flatMappedWords: Iterator[java.lang.String] = non-empty iterator

Now, if we want to count the words, we can convert that to a List and do the groupBy the mapValues trick we’ve seen already (output omitted).

scala> val counts = Source.fromFile("pg1661.txt").getLines.flatMap(x => x.split(" ")).toList.groupBy(x=>x).mapValues(x=>x.length)

Oops — that worked, but we once again brought the whole file into memory because the List that was created from toList has all lines for the file. We’ll see next how to use a mutable Map to get around this.

Word counting by streaming with an Iterator and using mutable Maps

In all of the tutorials so far, I’ve pretty much stuck to immutable data structures except when mutable ones show up due to context (like Arrays coming out of the toString method). It’s good to try to make use of immutable data structures where possible, but there are times when mutable ones are more convenient and perhaps more appropriate.

With the immutable Maps we saw in the previous tutorial, you could not change the assignment to a key, nor could you add a new key.

lettersToNumbers: scala.collection.immutable.Map[java.lang.String,Int] = Map(A -> 1, B -> 2, C -> 3)

1
scala> lettersToNumbers("A") = 4
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("A") = 4

scala> lettersToNumbers("D") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[java.lang.String,Int]
lettersToNumbers("D") = 5

There is another kind of Map, scala.collection.mutable.Map, that does allow this sort of behavior.

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val mutableLettersToNumbers = mutable.Map("A"->1, "B"->2, "C"->3)
mutableLettersToNumbers: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, B -> 2, A -> 1)

scala> mutableLettersToNumbers("A") = 4

scala> mutableLettersToNumbers("D") = 5

scala> mutableLettersToNumbers
res4: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 5, B -> 2, A -> 4)

It also has a handy way to increase the count associated with a key, using the += method.

scala> mutableLettersToNumbers("D") += 5

scala> mutableLettersToNumbers
res6: scala.collection.mutable.Map[java.lang.String,Int] = Map(C -> 3, D -> 10, B -> 2, A -> 4)

However, we can’t use that method with a key that doesn’t exist.

scala> mutableLettersToNumbers("E") += 1
java.util.NoSuchElementException: key not found: E
<...stacktrace...>

Fortunately, we can provide a default. Here’s an example of starting a new Map with a default of 0.

scala> val counts = mutable.Map[String,Int]().withDefault(x=>0)
counts: scala.collection.mutable.Map[String,Int] = Map()

scala> counts("Z") += 1

scala> counts("Y") += 1

scala> counts("Z") += 1

scala> counts
res11: scala.collection.mutable.Map[String,Int] = Map(Z -> 2, Y -> 1)

Note: when you start with some values already in a Map, Scala can infer the types of the keys and the values, but when initializing an empty Map, it is necessary to explicitly declare the key and value types.

With this in hand, here is how we can use flatMap plus a mutable Map to count words in a text without reading the entire text into memory.

import scala.collection.mutable
val counts = mutable.Map[String, Int]().withDefault(x=>0)
for (token <- scala.io.Source.fromFile("pg1661.txt").getLines.flatMap(x =>x.split("\\s+")))
counts(token) += 1

Having created the counts Map in this way, we can convert it to an immutable Map with the toMap method once we are done adding elements.

scala> val fixedCounts = counts.toMap
fixedCounts: scala.collection.immutable.Map[String,Int] = Map(wood-work, -> 1,
<...output truncated...>

Now we can’t modify the values on fixedCounts, which has advantages in many contexts, e.g. we can’t accidentally destroy values or add unwanted keys, and there are (positive) implications for parallel processing.

scala> fixedCounts("Holmes") = 0
<console>:13: error: value update is not a member of scala.collection.immutable.Map[String,Int]
fixedCounts("Holmes") = 0
^

Reading a file from a URL

As it turns out scala.io.Source can do a lot more than read from a file. Another example is to read from a URL to access a file on the internet, using the fromURL method.

val holmesUrl = """http://www.gutenberg.org/cache/epub/1661/pg1661.txt"""
for (line <- Source.fromURL(holmesUrl).getLines)
println(line)

If you are just going to analyze the same file again and again, this is probably not what you need — just download the file and use it locally. However, it can be quite useful in contexts where you are exploring links within pages (e.g. while processing Wikipedia or Twitter data) and need to read in content from URLs on the fly.

Use (up) the Source

A final note on the Iterators you get with Source.fromFile and Source.fromURL: you can only iterate through them once! This is part of what makes them more efficient — they aren’t holding all thattext in memory. So, don’t be surprised if you get the following behavior.

scala> val holmesIterator = Source.fromFile("pg1661.txt").getLines
 holmesIterator: Iterator[String] = non-empty iterator

scala> holmesIterator.foreach(println)

Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle

This eBook is for the use of anyone anywhere at no cost and with
 almost no restrictions whatsoever.  You may copy it, give it away or
 re-use it under the terms of the Project Gutenberg License included
 with this eBook or online at www.gutenberg.net

<...many lines of output...>

This Web site includes information about Project Gutenberg-tm,
 including how to make donations to the Project Gutenberg Literary
 Archive Foundation, how to help produce our new eBooks, and how to
 subscribe to our email newsletter to hear about new eBooks.

scala> holmesIterator.foreach(println)

<...nothing output!...>

So, the Iterator is used up! If you want to go through the file again, you’ll need to spin up a new Iterator just like you did the first time around. The neat thing about staying with the Iterators and not converting to Lists (and thus bringing everything into memory) is that each mapping operation we do on the Iterator applies only for the current item we are looking at, so we never need to read the whole file into memory.

Of course, if you have a reasonably small file to work with, you should feel absolutely free to toList it and work with it that way if you prefer — it will often be more convenient since you can do the groupBy and mapValue pattern.

Reference: First steps in Scala for beginning programmers, Part 8 from our JCG partner Jason Baldridge at the Bcomposes blog.

Related Articles :

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
vinod
vinod
10 years ago

hi,

How to work with xml file to convert xml to csv using scala …

please help its urgent..

Back to top button