About Yifan Peng

Yifan is a senior CIS PhD student in University of Delaware. His main researches include relation extraction, sentence simplification, text mining and natural language processing. He does not consider himself of a Java geek, but all his projects are written in Java.

Using regex to hanging indent a paragraph in Java

This post shows how to hanging indent a long paragraph using regular expression. The method will consider word boundaries, which means it will not break words for the indentation. To illustrate the problem, consider the following example:

There has been an increasing effort in recent years to extract relations between entities from natural language text. In this dissertation, I will focus on various aspects of recognizing biomedical relations between entities reported in scientific articles.

The output should be:

There has been an increasing effort in recent years to extract relations between
  entities from natural language text. In this dissertation, I will focus on
  various aspects of recognizing biomedical relations between entities reported
  in scientific articles.

My method

We need a regular expression to break the paragraph into a sequence of strings with fixed length. Suppose the text width is 80 and the indent is 3, the length of first string is 80. All remainders’ length is 77.

The main process of the algorithm is following

  1. Get the first 80 characters
  2. For the remaining strings, replace splitting points with three spaces

To find the splitting points, we use the regular expression (.{1,77})\s+. This regex searches a substring whose length is less and equal to 77 and whose last char is not a white space. After finding it, we replace the group ($1) with $1\n. Therefore, the java code should look like this

String regex = "(.{1,77})\\s+";
String replacement = "   $1\n";
text.replaceAll(regex, replacement);

This regex works perfect except for the last line. If the given text doesn’t end with a whitespace, like \n, the last line will not be handled correctly. Consider the last line as

in scientific articles.

In the last search, the regex cannot find the whitespace at the end of the line, so it will locate the space between “scientific” and “articles”. As the result, we will get

...
   in scientific
articles.

To overcome this problem, I add a fake “\n” at the end of the paragraph. After formatting, I remove it then.

Other part of the code are trivial. Here I attach my source code. I use Apache common libraries to generate indent spaces and assert the validation of indent. For more recent codes, you can check my Github

/**
   * Format a paragraph to that has all lines but the first indented.
   * 
   * @param text text to be formatted
   * @param hangIndent hanging indentation. hangIndent >= 0
   * @param width the width of formatted paragraph
   * @param considerSpace true if only split at white spaces.
   * @return
   */
  public static String hangIndent(String text, int hangIndent, int width,
      boolean considerSpace) {
    Validate.isTrue(
        hangIndent >= 0,
        "hangIndent should not be negative: %d",
        hangIndent);
    Validate.isTrue(width >= 0, "text width should not be negative: %d",
        width);
    Validate.isTrue(
        hangIndent < width,
        "hangIndent should not be less than width: "
        + "hangIndent=%d, width=%d",
        hangIndent,
        width);

    StringBuilder sb = new StringBuilder(text.substring(0, hangIndent));
    // Needed to handle last line correctly.
    // Will be trimmed at last
    text = text.substring(hangIndent) + "\n";
    // hang indent
    String spaces = org.apache.commons.lang3.StringUtils
        .repeat(' ', hangIndent);
    String replacement = spaces + "$1\n";
    String regex = "(.{1," + (width - hangIndent) + "})";
    if (considerSpace) {
      regex += "\\s+";
    }
    text = text.replaceAll(regex, replacement);
    // remove first spaces and last "\n"
    text = text.substring(hangIndent, text.length() - 1);
    return sb.append(text).toString();
  }

Related work

There are many other ways to implement the hanging indentation function. The simplest appears to be first breaking the paragraph into words, then using a counter to calculate the length of current line. Whenever it will exceed the max length, we add a newline and the indent.

I am not sure which way is more efficient, but absolutely the non-regular-expression way is easier to implement and maintain. So I guess the main point of this post is to learn something “NEW“.
 

Reference: Using regex to hanging indent a paragraph in Java from our JCG partner Yifan Peng at the PGuru blog.
Related Whitepaper:

Bulletproof Java Code: A Practical Strategy for Developing Functional, Reliable, and Secure Java Code

Use Java? If you do, you know that Java software can be used to drive application logic of Web services or Web applications. Perhaps you use it for desktop applications? Or, embedded devices? Whatever your use of Java code, functional errors are the enemy!

To combat this enemy, your team might already perform functional testing. Even so, you're taking significant risks if you have not yet implemented a comprehensive team-wide quality management strategy. Such a strategy alleviates reliability, security, and performance problems to ensure that your code is free of functionality errors.Read this article to learn about this simple four-step strategy that is proven to make Java code more reliable, more secure, and easier to maintain.

Get it Now!  

Leave a Reply


3 × two =



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books