About Vlad Mihalcea

Vlad Mihalcea is a software architect passionate about software integration, high scalability and concurrency challenges.

The regex that broke a server

I’ve never thought I would see an unresponsive server due to a bad regex matcher but that’s just happened to one of our services, yielding it it unresponsive.

Let’s assume we parse some external dealer car info. We are trying to find all those cars with “no air conditioning” among various available input patterns (but without matching patterns such as “mono air conditioning”).

The regex that broke our service looks like this:
 
 
 

String TEST_VALUE = "ABS, traction control, front and side airbags, Isofix child seat anchor points, no air conditioning, electric windows, \r\nelectrically operated door mirrors";
double start = System.nanoTime();
Pattern pattern = Pattern.compile("^(?:.*?(?:\\s|,)+)*no\\s+air\\s+conditioning.*$");
assertTrue(pattern.matcher(TEST_VALUE).matches());
double end = System.nanoTime();
LOGGER.info("Took {} micros", (end - start) / (1000 ));

After 2 minutes this test was still running and one CPU core was fully overloaded.

regex-overload

First, the matches method uses the entire input data, so we don’t need the start(^) or the end($) delimiters, and because of the new line characters in the input string we must instruct our Regex Pattern to operate in a MULTILINE mode:

Pattern pattern = Pattern.compile("(?:.*?(?:\\s|,)+)*no\\s+air\\s+conditioning.*?", Pattern.MULTILINE);

Let’s see how multiple versions of this regex behave:

RegexDuration [microseconds]Observation
“(?:.*?(?:\\s|,)+)*no\\s+air\\s+conditioning.*?”35699.334This is way too slow
“(?:.*?(?:\\s|,)+)?no\\s+air\\s+conditioning.*?”108.686The non-capturing group doesn’t need the one-or-many(+) multiplier, so we can replace it with zero-or-one(?)
“(?:.*?\\b)?no\\s+air\\s+conditioning.*?”153.636It works for more input data than the previous one, which only uses the space(\s) and the comma(,) to separate the matched pattern
“\\bno\\s+air\\s+conditioning”78.831Find is much faster than matches and we are only interested in the first occurrence of this pattern.

Why not using String.indexOf() instead?

While this would be much faster than using regex, we would still have to consider the start of the string, patterns such as “mono air conditioning”, tabs or multiple space characters between our pattern tokens. Custom implementations as such may be faster, but are less flexible and take more time to implement.

Conclusion

Regex is a fine tool for pattern matching, but you must not take it for granted since small changes may yield big differences. The reason why the first regex was counterproductive is due to catastrophic backtracking, a phenomena that every developer should be aware of before starting writing regular expressions.
 

Reference: The regex that broke a server from our JCG partner Vlad Mihalcea at the Vlad Mihalcea’s Blog blog.
Related Whitepaper:

Software Architecture

This guide will introduce you to the world of Software Architecture!

This 162 page guide will cover topics within the field of software architecture including: software architecture as a solution balancing the concerns of different stakeholders, quality assurance, methods to describe and evaluate architectures, the influence of architecture on reuse, and the life cycle of a system and its architecture. This guide concludes with a comparison between the professions of software architect and software engineer.

Get it Now!  

3 Responses to "The regex that broke a server"

  1. Robby says:

    A great resource for regexp related things: http://swtch.com/~rsc/regexp/

    It specifically talks about the backtrack problem in the first article.

  2. It’s no wonder that regex took down a server.

Leave a Reply


two + 7 =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books