Scala

Scala Tutorial – regular expressions, matching and substitutions with the scala.util.matching API

Preface

This is part 6 of tutorials for first-time programmers getting into Scala. Other posts are on this blog, and you can get links to those and other resources on the links page of the Computational Linguistics course I’m creating these for. Additionally you can find this and other tutorial series on the JCG Java Tutorials page.

This post is the second of two about regular expressions (regexes), which are essential for a wide range of programming tasks, and for computational linguistics tasks in particular. If you haven’t read it already, you might want to start with the first post about regexes. For what its worth, this post might actually be of some use to programmers who already are reasonably familiar with Scala but who haven’t used regular expressions much yet: it might saving some poking around to figure out how to do things you already know how to do quite well in other languages.

The use of regular expressions for capturing values for variable assignment and cases in match expressions is a very clean, well-thought out and highly useful trait of support for regular expressions in the Scala language. However, their use for more complex string matching and substitution is, frankly, much less straightforward than it is in languages with built-in support for regular expressions, such as Perl (which—speaking as one who has coded a lot in Perl—you do *not* want to use for general programming). Scala is fully capabable in that you can use regular expressions fully, but you’ll need to use it via the Regex API. In other words, you need to use a number of commands, not all of which as as straightforward as they could be. (This is not a rant, though I do obviously wish regular expressions were supported more naturally in Scala.)

Though I’ll refer to what I’m doing below as using the Regex API, I’ll note first that this makes it sound like a bigger deal than it is. It just means you are directly using classes and objects from the scala.util.matching package rather than using some of the special syntax and integration with Scala pattern matching we saw in the previous post.

More extensive matching

First off, let’s do what we did with pattern matching in the previous post, but now using the Regex class and the methods available to it to achieve the same ends. We can then start working with finding multiple matches and performing substitutions.

To recap, recall the name regular expression and how we can use it to initialize a group of variables based on matching a given string.

scala> val Name = """(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)""".r
Name: scala.util.matching.Regex = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)
 
scala> val smith = "Mr. John Smith"
smith: java.lang.String = Mr. John Smith
 
scala> val Name(title, first, last) = smith
title: String = Mr
first: String = John
last: String = Smith

Instead of doing it this way, let’s instead use the API methods. We start by using the regex to find the matches, if any. The method findAllIn of Regex does this for us.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator

The result is an iterator, which is an object that is like a list in that you can iterate over its elements with for expressions and foreach, use map to transform its values, and more.

scala> matchesFound.foreach(println)
Mr. John Smith

However, unlike Lists, you can only do this a single time. As the following shows, after you iterate through it once, its elements are used up.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator
 
scala> matchesFound.foreach(println)
Mr. John Smith
 
scala> matchesFound.foreach(println)

Another difference is that you cannot index into its elements directly.

scala> val matchesFound = Name.findAllIn(smith)
matchesFound: scala.util.matching.Regex.MatchIterator = non-empty iterator
 
scala> matchesFound(0)
<console>:11: error: scala.util.matching.Regex.MatchIterator does not take parameters
matchesFound(0)
^

If you wish to do that, you need to just call toList on the MatchIterator.

scala> val matchList = Name.findAllIn(smith).toList
matchList: List[String] = List(Mr. John Smith)
 
scala> matchList.foreach(println)
Mr. John Smith
 
scala> matchList.foreach(println)
Mr. John Smith

I’ll primarily work with the match results as a List for the remainder of this tutorial. However, note that when you are programming, you should consider whether you really need to do this—usually, the iterator will be sufficient and it has the advantage of being a more efficient.

Note above that what we have is a List[String]. That means we can see which portions of a string matched, which could include multiple matches.

scala> val sentence = "Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown."
sentence: java.lang.String = Mr. John Smith said hello to Ms. Jane Hill and then to Mr. Bill Brown.
 
scala> val matchList = Name.findAllIn(sentence).toList
matchList: List[String] = List(Mr. John Smith, Ms. Jane Hill, Mr. Bill Brown)

This will be useful in many contexts, but it won’t allow us to access the match groups that were defined in the Regex. For that, we need to use the matchData method, which converts the MatchIterator (which offers Strings as its elements) into an Iterator[Match] (which offers Match objects as its elements).

scala> val matchList = Name.findAllIn(smith).matchDatamatchList: java.lang.Object with Iterator[scala.util.matching.Regex.Match] = non-empty iterator

Let’s convert that to a List and then grab the first element.

scala> val matchList = Name.findAllIn(smith).matchData.toList
matchList: List[scala.util.matching.Regex.Match] = List(Mr. John Smith)
 
scala> val firstMatch = matchList(0)
firstMatch: scala.util.matching.Regex.Match = Mr. John Smith

This Match object contains captured groups that we can access with the group method. The first index, 0, returns the entire match, and the rest access the captured groups.

scala> firstMatch.group(0)
res8: String = Mr. John Smith
 
scala> val title = firstMatch.group(1)
title: String = Mr
 
scala> val first = firstMatch.group(2)
first: String = John
 
scala> val last = firstMatch.group(3)
last: String = Smith

We can get a bit closer to the original pattern matched variable assignment by packaging them up as a tuple.

scala> val (title, first, last) = (firstMatch.group(1), firstMatch.group(2), firstMatch.group(3))
title: String = Mr
first: String = John
last: String = Smith

Update: There is a more concise way to do this using the range 1 to 3 and map firstMatch.group over that range. This creates a Seq(uence), which we can pattern match on. (Thanks to @missingfaktor.)

val Seq(title, first, last) = 1 to 3 map firstMatch.group

This should demonstrate why Scala’s support for Regexes in patterning match is very nice for this. What you gain with the API is the ability to match multiple instances of a pattern in a string and then to perform computations with the Match results on the fly. For example, let’s return to the sentence with multiple names in it and use the Name regex to say hello to every name found in it.

scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(0)))
Hello, Mr. John Smith
Hello, Ms. Jane Hill
Hello, Mr. Bill Brown

Of course, you can choose to print only subparts of the names, such as the title and the last name.

scala> Name.findAllIn(sentence).matchData.foreach(m => println("Hello, " + m.group(1) + ". " + m.group(3)))
Hello, Mr. Smith
Hello, Ms. Hill
Hello, Mr. Brown

Or you can filter the results, e.g. to only the Mr’s, and then print only the first names.

scala> Name.findAllIn(sentence).matchData.filter(m=>m.group(1) == "Mr").foreach(m => println("Hello, " + m.group(2)))
Hello, John
Hello, Bill

Notice that in the above lines, I didn’t convert the MatchIterator to a List since I was happy to just go through the list once and do some actions.

Performing substitutions

The other thing you gain is the ability to use regular expressions for substituting once class of expressions with another. For example, let’s say that (for some odd reason) you would like to reverse everyone’s name so that “Mr. John Smith” becomes “Mr. Smith John“. This is accomplished by using the Regex method replaceAllIn, which takes two arguments: the first is the original string and the second is a function that takes a Match object and returns a String.

scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

The variable m above is referring to each of the Match objects identified, in turn. That means we can access the groups as we did before. The thing that might feel strange at first is that the anonymous function m => m.group(1) + “. ” + m.group(3) + ” ” + m.group(2) is an argument. It’s not very different from the following, where we first create a named function and then pass it as an argument.

scala> def swapFirstLast = (m: scala.util.matching.Regex.Match) => m.group(1) + ". " + m.group(3) + " " + m.group(2)
swapFirstLast: (util.matching.Regex.Match) => java.lang.String
 
scala> val swapped = Name.replaceAllIn(sentence, swapFirstLast)swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

Note that now that we’ve defined it, we can use that same function to map the Matches returned by findAllIn to their swapped versions.

scala> val swappedNames = Name.findAllIn(sentence).matchData.map(swapFirstLast).toList
swappedNames: List[java.lang.String] = List(Mr. Smith John, Ms. Hill Jane, Mr. Brown Bill)

The difference is that using findAllIn gives us the Match results themselves, whereas replaceAllIn replaces them in the String in situ. Whether you need to do one or the other depends on your programming needs.

Determining whether an entire string matches using the Regex API

If you just want to know whether an entire given string matches a Regex, Scala unfortunately has a somewhat roundabout way for you to do this. First, here is the syntax, testing whether Name matches on the variables smith and sentence.

scala> Name.pattern.matcher(smith).matches
res21: Boolean = true
 
scala> Name.pattern.matcher(sentence).matches
res22: Boolean = false

So, sentence doesn’t match (despite having three names in it) because the entire string is not a single match to Name.

What is going on here is that we are actually using classes defined in Java for working with regular expressions. First, we get the java.util.regex.Pattern object associated with our scala.util.matching.Regex object.

scala> Name.pattern
res16: java.util.regex.Pattern = (Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+)

Then we use that Pattern to get a java.util.regex.Matcher for the string.

scala> Name.pattern.matcher(smith)
res17: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(Mr|Mrs|Ms)\. ([A-Z][a-z]+) ([A-Z][a-z]+) region=0,14 lastmatch=]

The Matcher class has a matches method that tells us whether there was a match or not for that string.

scala> Name.pattern.matcher(smith).matches
res18: Boolean = true

So, long-winded, but you can do it.

Note: there is another way to do this using Scala’s standard pattern matching paradigm discussed in the previous post on regexes.

scala> smith match { case Name(_,_,_) => true; case _ => false }
res23: Boolean = true
 
scala> sentence match { case Name(_,_,_) => true; case _ => false }
res24: Boolean = false

However, this requires the extra work of specifying the capture groups, which are being thrown away anyway.

Simple substitutions with a second regular expression

There is another replaceAllIn method that takes a String defining a (fairly) standard regular expresion substitution as its second argument rather than a function from Matches to Strings. This argument defines a regular expression similar to that used in standard s/// substitutions from the Perl programming language,e.g. the following, which turns strings like “xyzaaaabbb123” int “xyzbbbaaaa123“.

s/(a+)(b+)/\2\1/

Unlike Perl (which is the same as the syntax discussed in Jurafsky and Martin’s book), Scala uses $1, $2, etc. As an example, consider the first-last name swap we did before. Here it is repeated:

scala> val swapped = Name.replaceAllIn(sentence, m => m.group(1) + ". " + m.group(3) + " " + m.group(2))
swapped: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

You can get the exact same effect somewhat more easily by constructing the replacement string with $n variables that refer to the groups.

scala> val swapped2 = Name.replaceAllIn(sentence, "$1. $3 $2")
swapped2: String = Mr. Smith John said hello to Ms. Hill Jane and then to Mr. Brown Bill.

This is far more concise and readable than the m.group() style above, so it is preferable for cases like this. However, sometimes you’ll want to do some more interesting processing of the values in each group, such as changing the titles to another language and outputing only the first initial of the first name: e.g. “Mr. John Smith” would become “Sr. J. Smith” and “Mrs. Jane Hill” would become “Sra. J. Hill”. It isn’t clear to me how one could do this with the $n substitutions (if some reader is aware, please let me know). To do it with the Match => String function, it is straightforward. First, let’s define a method that maps the titles from English to Spanish.

def engTitle2Esp (title: String) = title match {
  case "Mr" => "Sr"
  case "Mrs" => "Sra"
  case "Ms" => "Srta"
}

Then we pass m.group(1) through that function by using engTitle2Esp(m.group(1)), and get just the first character of group 2 by indexing into it as m.group(2)(0).

scala> val spanishized = Name.replaceAllIn(sentence, m => engTitle2Esp(m.group(1)) + ". " + m.group(2)(0) + ". " + m.group(3))
spanishized: String = Sr. J. Smith said hello to Srta. J. Hill and then to Sr. B. Brown.

This gives you considerable control over how to process the replacements.

Reference: First steps in Scala for beginning programmers, Part 6 from our JCG partner Jason Baldridge at the Bcomposes blog.

Related Articles :

Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button