Scala

Scala Tutorial – iteration, for expressions, yield, map, filter, count

Preface

This is part 4 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for. Additionally you can find this and other tutorial series on the JCG Java Tutorials page.

This tutorial departs from the very beginner nature of the previous three, so this may be of more interest to readers who already have some programming experience in another language. (Though also, see the section on using matching in Scala in Part 3.)

Iteration, the Scala way(s)

Up to now, we have (mostly) accessed individual items on a list by using their indices. But one of the most natural things to do with a list is to repeat some action for each item on the list, for example: “For each word in the given list of words: print it”. Here is how to say this in Scala.

scala> val animals = List("newt", "armadillo", "cat", "guppy")
animals: List[java.lang.String] = List(newt, armadillo, cat, guppy)
 
scala> animals.foreach(println)
newt
armadillo
cat
guppy

This says to take each element of the list (indicated by foreach) and apply a function (in this case, println) to it, in order. There is some underspecification going on in that we aren’t providing a variable to name elements. This works in some cases, such as above, but won’t always be possible. Here’s is how it looks in full, with a variable naming the element.

scala> animals.foreach(animal => println(animal))
newt
armadillo
cat
guppy

This is useful when you need to do a bit more, such as concatenating a String element with another String.

scala> animals.foreach(animal => println("She turned me into a " + animal))
She turned me into a newt
She turned me into a armadillo
She turned me into a cat
She turned me into a guppy

Or, if you are performing a computation with it, like outputing the length of each element in a list of strings.

scala> animals.foreach(animal => println(animal.length))
4
9
3
5

We can obtain the same result as foreach using a for expression.

scala> for (animal <- animals) println(animal.length)
4
9
3
5

With what we have been doing so far, these two ways of expressing the pattern of iterating over the elements of a List are equivalent. However, they are different: a for expression returns a value, whereas foreach simply performs some function on every element of the list. This latter kind of use is termed a side-effect: by printing out each element, we are not creating new values, we are just performing an action on each element. With for expressions, we can yield values that create transformed Lists. For example, contrast using println with the following.

scala> val lengths = for (animal <- animals) yield animal.length
lengths: List[Int] = List(4, 9, 3, 5)

The result is a new list that contains the lengths (number of characters) of each of the elements of the animals list. (You can of course print its contents now by doing lengths.foreach(println), but typically we want to do other, usually more interesting, things with such values.)

What we just did was map the values of animals into a new set of values in a one-to-one manner, using the function length. Lists have another function called map that does this directly.

scala> val lengthsMapped = animals.map(animal => animal.length)
lengthsMapped: List[Int] = List(4, 9, 3, 5)

So, the for-yield expression and the map method achieve the same output, and in many cases they are pretty much equivalent. Using map, however, is often more convenient because you can easily chain a series of operations together. For example, let’s say you want to add 1 to a List of numbers and then get the square of that, so turning List(1,2,3) into List(2,3,4) into List(4,9,16). You can do that quite easily using map.

nums.map(x=>x+1).map(x=>x*x)

Some readers will be puzzled by what was just done. Here it is more explicitly, using an intermediate variable nums2 to store the add-one list.

scala> val nums2 = nums.map(x=>x+1)
nums2: List[Int] = List(2, 3, 4)
 
scala> nums2.map(x=>x*x)
res9: List[Int] = List(4, 9, 16)

Since nums.map(x=>x+1) returns a List, we don’t have to name it to a variable to use it — we can just immediately use it, including doing another map function on it. (Of course, one could do this computation in a single go, e.g. map((x+1)*(x+1)), but often one is using a series of built-in functions, or functions one has predefined already).

You can keep on mapping to your heart’s content, including mapping from Ints to Strings.

scala> nums.map(x=>x+1).map(x=>x*x).map(x=>x-1).map(x=>x*(-1)).map(x=>"The answer is: " + x)
res12: List[java.lang.String] = List(The answer is: -3, The answer is: -8, The answer is: -15)

Note: the use of x in all these cases is not important. They could have been named x, y, z and turlingdromes42 — any valid variable name.

Iterating through multiple lists

Sometimes you have two lists that are paired up and you need to do something to elements from each list simultaneously. For example, let’s say you have a list of word tokens and another list with their parts-of-speech. (See the previous tutorial for discussion of parts-of-speech.)

scala> val tokens = List("the", "program", "halted")
tokens: List[java.lang.String] = List(the, program, halted)
 
scala> val tags = List("DT","NN","VB")
tags: List[java.lang.String] = List(DT, NN, VB)

Now, let’s say we want to output these as the following string:

the/DT program/NN halted/VB

Initially, we’ll do it a step at a time, and then show how it can be done all in one line.

First, we use the zip function to bring two lists together and get a new list of pairs of elements from each list.

scala> val tokenTagPairs = tokens.zip(tags)
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((the,DT), (program,NN), (halted,VB))
 
Zipping two lists together in this way is a common pattern used for iterating over two lists.
 
Now we have a list of token-tag pairs we can use a for expression to turn it into a List of strings.
 
1
scala> val tokenTagSlashStrings = for ((token, tag) <- tokenTagPairs) yield token + "/" + tag
tokenTagSlashStrings: List[java.lang.String] = List(the/DT, program/NN, halted/VB)

Now we just need to turn that list of strings into a single string by concatenating all its elements with a space between each. The function mkString makes this easy.

scala> tokenTagSlashStrings.mkString(" ")
res19: String = the/DT program/NN halted/VB

Finally, here it all is in one step.

scala> (for ((token, tag) <- tokens.zip(tags)) yield token + "/" + tag).mkString(" ")
res23: String = the/DT program/NN halted/VB

Ripping a string into a useful data structure

It is common in computational linguistics to need convert string inputs into useful data structures. Consider the part-of-speech tagged sentence mentioned in the previous tutorial. Let’s begin by assigning it to the variable sentRaw.

val sentRaw = "The/DT index/NN of/IN the/DT 100/CD largest/JJS Nasdaq/NNP financial/JJ stocks/NNS rose/VBD modestly/RB as/IN well/RB ./."

Now, let’s turn it into a List of Tuples, where each Tuple has the word as its first element and the postag as its second. We begin with the single line that does this so that you can see what the desired result is, and then we’ll examine each step in detail.

scala> val tokenTagPairs = sentRaw.split(" ").toList.map(x => x.split("/")).map(x => Tuple2(x(0), x(1)))
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((The,DT), (index,NN), (of,IN), (the,DT), (100,CD), (largest,JJS), (Nasdaq,NNP), (financial,JJ), (stocks,NNS), (rose,VBD), (modestly,RB), (as,IN), (well,RB), (.,.))

Let’s take each of these in turn. The first split cuts sentRaw at each space character, and returns an Array of Strings, where each element is the material between the spaces.

scala> sentRaw.split(" ")
res0: Array[java.lang.String] = Array(The/DT, index/NN, of/IN, the/DT, 100/CD, largest/JJS, Nasdaq/NNP, financial/JJ, stocks/NNS, rose/VBD, modestly/RB, as/IN, well/RB, ./.)

What’s an Array? It’s a kind of sequence, like List, but it has some different properties that we’ll discuss later. For now, let’s stick with Lists, which we can do by using the toList method. Additionally, let’s assign it to a variable so that the remaining operations are easier to focus on.

scala> val tokenTagSlashStrings = sentRaw.split(" ").toList
tokenTagSlashStrings: List[java.lang.String] = List(The/DT, index/NN, of/IN, the/DT, 100/CD, largest/JJS, Nasdaq/NNP, financial/JJ, stocks/NNS, rose/VBD, modestly/RB, as/IN, well/RB, ./.)

Now, we need to turn each of the elements in that list into pairs of token and tag. Let’s first consider a single element, turning something like “The/DT” into the pair (“The”,”DT”). The next lines show how to do this one step at a time, using intermediate variables.

scala> val first = "The/DT"
first: java.lang.String = The/DT
 
scala> val firstSplit = first.split("/")
firstSplit: Array[java.lang.String] = Array(The, DT)
 
scala> val firstPair = Tuple2(firstSplit(0), firstSplit(1))
firstPair: (java.lang.String, java.lang.String) = (The,DT)

So, firstPair is a tuple representing the information encoded in the string first. This involved two operations, splitting and then creating a tuple from the Array that resulted from the split. We can do this for all of the elements in tokenTagSlashStrings using map. Let’s first convert the Strings into Arrays.

scala> val tokenTagArrays = tokenTagSlashStrings.map(x => x.split("/"))
res0: List[Array[java.lang.String]] = List(Array(The, DT), Array(index, NN), Array(of, IN), Array(the, DT), Array(100, CD), Array(largest, JJS), Array(Nasdaq, NNP), Array(financial, JJ), Array(stocks, NNS), Array(rose, VBD), Array(modestly, RB), Array(as, IN), Array(well, RB), Array(., .))

And finally, we turn the Arrays into Tuple2s and get the result we obtained with the one-liner earlier.

scala> val tokenTagPairs = tokenTagArrays.map(x => Tuple2(x(0), x(1)))
tokenTagPairs: List[(java.lang.String, java.lang.String)] = List((The,DT), (index,NN), (of,IN), (the,DT), (100,CD), (largest,JJS), (Nasdaq,NNP), (financial,JJ), (stocks,NNS), (rose,VBD), (modestly,RB), (as,IN), (well,RB), (.,.))

Note: if you are comfortable with using one-liners that chain a bunch of operations together, then by all means use them. However, there is no shame in using several lines involving a bunch of intermediate variables if that helps you break apart the task and get the result you need.

One of the very useful things of having a List of pairs (Tuple2s) is that the unzip function gives us back two Lists, one with all of the first elements and another with all of the second elements.

scala> val (tokens, tags) = tokenTagPairs.unzip
tokens: List[java.lang.String] = List(The, index, of, the, 100, largest, Nasdaq, financial, stocks, rose, modestly, as, well, .)
tags: List[java.lang.String] = List(DT, NN, IN, DT, CD, JJS, NNP, JJ, NNS, VBD, RB, IN, RB, .)

With this, we’ve come full circle. Having started with a raw string (such as we are likely to read in from a text file), we now have Lists that allow us to do useful computations, such as converting those tags into another form.

Providing a function you have defined to map

Let’s return to the postag simplification exercise we did in the previous tutorial. We’ll modify it a bit: rather than shortening the Penn Treebank parts-of-speech, let’s convert them to course parts-of-speech using the English words that most people are familiar with, like noun and verb. The following function turns Penn Treebank tags into these course tags, for more tags than we covered in the last tutorial (note: this is still incomplete, but serves to illustrate the point).

def coursePos (tag: String) = tag match {
  case "NN" | "NNS" | "NNP" | "NNPS"                       => "Noun"
  case "JJ" | "JJR" | "JJS"                                => "Adjective"
  case "VB" | "VBD" | "VBG" | "VBN" | "VBP" | "VBZ" | "MD" => "Verb"
  case "RB" | "RBR" | "RBS" | "WRB" | "EX"                 => "Adverb"
  case "PRP" | "PRP$" | "WP" | "WP$"                       => "Pronoun"
  case "DT" | "PDT" | "WDT"                                => "Article"
  case "CC"                                                => "Conjunction"
  case "IN" | "TO"                                         => "Preposition"
  case _                                                   => "Other"
}

We can now map this function over the parts of speech in the collection obtained previously.

scala> tags.map(coursePos)
res1: List[java.lang.String] = List(Article, Noun, Preposition, Article, Other, Adjective, Noun, Adjective, Noun, Verb, Adverb, Preposition, Adverb, Other)

Voila! If we want to convert the tags in this manner and then output them as a string like what we started with, it’s just a few steps. We’ll start from the beginning and recap. Try running the following for yourself.

val sentRaw = "The/DT index/NN of/IN the/DT 100/CD largest/JJS Nasdaq/NNP financial/JJ stocks/NNS rose/VBD modestly/RB as/IN well/RB ./."
 
val (tokens, tags) = sentRaw.split(" ").toList.map(x => x.split("/")).map(x => Tuple2(x(0), x(1))).unzip
 
tokens.zip(tags.map(coursePos)).map(x => x._1+"/"+x._2).mkString(" ")

A further point is that when you provide expressions like (x => x+1) to map, you are actually defining an anonymous function! Here is the same map operation with different levels of specification

scala> val numbers = (1 to 5).toList
numbers: List[Int] = List(1, 2, 3, 4, 5)
 
scala> numbers.map(1+)
res11: List[Int] = List(2, 3, 4, 5, 6)
 
scala> numbers.map(_+1)
res12: List[Int] = List(2, 3, 4, 5, 6)
 
scala> numbers.map(x=>x+1)
res13: List[Int] = List(2, 3, 4, 5, 6)
 
scala> numbers.map((x: Int) => x+1)
res14: List[Int] = List(2, 3, 4, 5, 6)

So, it’s all consistent: whether you pass in a named function or an anonymous function, map will apply it to each element in the list.

Finally, note that you can use that final form to define a function.

scala> def addOne = (x: Int) => x + 1
addOne: (Int) => Int
 
scala> addOne(1)
res15: Int = 2

This is similar to defining functions as we had previously (e.g. def addOne (x: Int) = x+1), but it is more convenient in certain contexts, which we’ll get to later. For now, the thing to realize is that whenever you map, you are either using a function that already existed or creating one on the fly.

Filtering and counting

The map method is a convenient way of performing computations on each element of a List, effectively transforming a List from one set of values to a new List with a set of values computed from each corresponding element. There are yet more methods that have other actions, such as removing elements from a List (filter), counting the number of elements satisfying a given predicate (count), and computing an aggregate single result from all elements in a List (reduce and fold). Let’s consider a simple task: count how many tokens are not a noun or adjective in a tagged sentence. As a starting point, let’s take the list of mapped postags from before.

scala> val courseTags = tags.map(coursePos)
courseTags: List[java.lang.String] = List(Article, Noun, Preposition, Article, Other, Adjective, Noun, Adjective, Noun, Verb, Adverb, Preposition, Adverb, Other)

One way of doing this is to filter out all of the nouns and adjectives to obtain a list without them and then get its length.

scala> val noNouns = courseTags.filter(x => x != "Noun")noNouns: List[java.lang.String] = List(Article, Preposition, Article, Other, Adjective, Adjective, Verb, Adverb, Preposition, Adverb, Other)
 
scala> val noNounsOrAdjectives = noNouns.filter(x => x != "Adjective")
noNounsOrAdjectives: List[java.lang.String] = List(Article, Preposition, Article, Other, Verb, Adverb, Preposition, Adverb, Other)
 
scala> noNounsOrAdjectives.length
res8: Int = 9

However, because filter just takes a Boolean value, we can of course use Boolean conjunction and disjunction to simplify things. And, we don’t need to save intermediate variables. Here’s the one liner.

scala> courseTags.filter(x => x != "Noun" && x != "Adjective").length
res9: Int = 9

If all we want is the number of elements, we can instead just use count with the same predicate.

scala> courseTags.count(x => x != "Noun" && x != "Adjective")
res10: Int = 9

As an exercise, try doing a one-liner that starts with sentRaw and provides the value “resX: Int = 9” (where X is whatever you get in your Scala REPL).

In the next tutorial, we’ll see how to use reduce and fold to compute aggregate results from a List.

Reference: First steps in Scala for beginning programmers, Part 4 from our JCG partner Jason Baldridge at the Bcomposes blog.

Related Articles :

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jame
Jame
5 years ago

awesome!

Back to top button