Clojure: All things regex

I’ve been doing some scrapping of web pages recently using Clojure and Enlive and as part of that I’ve had to write regular expressions to extract the data I’m interested in.

On my travels I’ve come across a few different functions and I’m never sure which is the right one to use so I thought I’d document what I’ve tried for future me.

Check if regex matches

The first regex I wrote was while scrapping the Champions League results from the Rec.Sport.Soccer Statistics Foundation and I wanted to determine which spans contained the match result and which didn’t.

A matching line would look like this:

Real Madrid-Juventus Turijn 2 - 1

And a non matching one like this:

53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2

I wrote the following regex to detect match results:

[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]

I then wrote the following function using re-matches which would return true or false depending on the input:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))
> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1")
true
> (recognise-match? "53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2")
false

re-matches only returns matches if the whole string matches the pattern which means if we had a line with some spurious text after the score it wouldn’t match:

> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
false

If we don’t mind that and we just want some part of the string to match our pattern then we can use re-find instead:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-find #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))
> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
true

Extract capture groups

The next thing I wanted to do was to capture the teams and the score of the match which I initially did using re-seq:

> (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

I then extracted the various parts like so:

> (def result (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1")))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

re-seq returns a list which contains consecutive matches of the regex. The list will either contain strings if we don’t specify capture groups or a vector containing the pattern matched and each of the capture groups.

For example if we now match only sequences of A-Z or spaces and remove the rest of the pattern from above we’d get the following results:

> (re-seq #"([a-zA-Z\s]+)" "FC Valencia-Internazionale Milaan 2 - 1")
(["FC Valencia" "FC Valencia"] ["Internazionale Milaan " "Internazionale Milaan "] [" " " "] [" " " "])

> (re-seq #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
("FC Valencia" "Internazionale Milaan " " " " ")

In our case re-find or re-matches actually makes more sense since we only want to match the pattern once. If there are further matches after this those aren’t included in the results. e.g.

> (re-find #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
"FC Valencia"

> (re-matches #"[a-zA-Z\s]*" "FC Valencia-Internazionale Milaan 2 - 1")
nil

re-matches returns nil here because there are characters in the string which don’t match the pattern i.e. the hyphen between the two scores.

If we tie that in with our capture groups we end up with the following:

> (def result 
    (re-find #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

I also came across the re-pattern function which provides a more verbose way of creating a pattern and then evaluationg it with re-find:

> (re-find (re-pattern "([a-zA-Z\\s]+)-([a-zA-Z\\s]+) ([0-9])[\\s]?.[\\s]?([0-9])") "FC Valencia-Internazionale Milaan 2 - 1")
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

One difference here is that I had to escape the special sequence ‘\s’ otherwise I was getting the following exception:

RuntimeException Unsupported escape character: \s  clojure.lang.Util.runtimeException (Util.java:170)

I wanted to play around with re-groups as well but that seemed to throw an exception reasonably frequently when I expected it to work.

The last function I looked at was re-matcher which seemed to be a long-hand for the ‘#””‘ syntax used earlier in the post to define matchers:

> (re-find (re-matcher #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

In summary

So in summary I think most use cases are covered by re-find and re-matches and maybe re-seq on special occasions. I couldn’t see where I’d use the other functions but I’m happy to be proved wrong.
 

Reference: Clojure: All things regex from our JCG partner Mark Needham at the Mark Needham Blog blog.
Related Whitepaper:

Java Essential Training

Author David Gassner explores Java SE (Standard Edition), the language used to build mobile apps for Android devices, enterprise server applications, and more!

The course demonstrates how to install both Java and the Eclipse IDE and dives into the particulars of programming. The course also explains the fundamentals of Java, from creating simple variables, assigning values, and declaring methods to working with strings, arrays, and subclasses; reading and writing to text files; and implementing object oriented programming concepts. Exercise files are included with the course.

Get it Now!  

Leave a Reply


two × 9 =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books
Get tutored by the Geeks! JCG Academy is a fact... Join Now
Hello. Add your message here.