Clojure: All things regex

I’ve been doing some scrapping of web pages recently using Clojure and Enlive and as part of that I’ve had to write regular expressions to extract the data I’m interested in.

On my travels I’ve come across a few different functions and I’m never sure which is the right one to use so I thought I’d document what I’ve tried for future me.

Check if regex matches

The first regex I wrote was while scrapping the Champions League results from the Rec.Sport.Soccer Statistics Foundation and I wanted to determine which spans contained the match result and which didn’t.

A matching line would look like this:

Real Madrid-Juventus Turijn 2 - 1

And a non matching one like this:

53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2

I wrote the following regex to detect match results:

[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]

I then wrote the following function using re-matches which would return true or false depending on the input:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-matches #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))
> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1")
true
> (recognise-match? "53’Nedved 0-1, 66'Xavi Hernández 1-1, 114’Zalayeta 1-2")
false

re-matches only returns matches if the whole string matches the pattern which means if we had a line with some spurious text after the score it wouldn’t match:

> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
false

If we don’t mind that and we just want some part of the string to match our pattern then we can use re-find instead:

(defn recognise-match? [row]
  (not (clojure.string/blank? (re-find #"[a-zA-Z\s]+-[a-zA-Z\s]+ [0-9][\s]?.[\s]?[0-9]" row))))
> (recognise-match? "Real Madrid-Juventus Turijn 2 - 1 abc")
true

Extract capture groups

The next thing I wanted to do was to capture the teams and the score of the match which I initially did using re-seq:

> (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

I then extracted the various parts like so:

> (def result (first (re-seq #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1")))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

re-seq returns a list which contains consecutive matches of the regex. The list will either contain strings if we don’t specify capture groups or a vector containing the pattern matched and each of the capture groups.

For example if we now match only sequences of A-Z or spaces and remove the rest of the pattern from above we’d get the following results:

> (re-seq #"([a-zA-Z\s]+)" "FC Valencia-Internazionale Milaan 2 - 1")
(["FC Valencia" "FC Valencia"] ["Internazionale Milaan " "Internazionale Milaan "] [" " " "] [" " " "])

> (re-seq #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
("FC Valencia" "Internazionale Milaan " " " " ")

In our case re-find or re-matches actually makes more sense since we only want to match the pattern once. If there are further matches after this those aren’t included in the results. e.g.

> (re-find #"[a-zA-Z\s]+" "FC Valencia-Internazionale Milaan 2 - 1")
"FC Valencia"

> (re-matches #"[a-zA-Z\s]*" "FC Valencia-Internazionale Milaan 2 - 1")
nil

re-matches returns nil here because there are characters in the string which don’t match the pattern i.e. the hyphen between the two scores.

If we tie that in with our capture groups we end up with the following:

> (def result 
    (re-find #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))

> result
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

> (nth result 1)
"FC Valencia"

> (nth result 2)
"Internazionale Milaan"

I also came across the re-pattern function which provides a more verbose way of creating a pattern and then evaluationg it with re-find:

> (re-find (re-pattern "([a-zA-Z\\s]+)-([a-zA-Z\\s]+) ([0-9])[\\s]?.[\\s]?([0-9])") "FC Valencia-Internazionale Milaan 2 - 1")
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

One difference here is that I had to escape the special sequence ‘\s’ otherwise I was getting the following exception:

RuntimeException Unsupported escape character: \s  clojure.lang.Util.runtimeException (Util.java:170)

I wanted to play around with re-groups as well but that seemed to throw an exception reasonably frequently when I expected it to work.

The last function I looked at was re-matcher which seemed to be a long-hand for the ‘#””‘ syntax used earlier in the post to define matchers:

> (re-find (re-matcher #"([a-zA-Z\s]+)-([a-zA-Z\s]+) ([0-9])[\s]?.[\s]?([0-9])" "FC Valencia-Internazionale Milaan 2 - 1"))
["FC Valencia-Internazionale Milaan 2 - 1" "FC Valencia" "Internazionale Milaan" "2" "1"]

In summary

So in summary I think most use cases are covered by re-find and re-matches and maybe re-seq on special occasions. I couldn’t see where I’d use the other functions but I’m happy to be proved wrong.
 

Reference: Clojure: All things regex from our JCG partner Mark Needham at the Mark Needham Blog blog.

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


1 × seven =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close