About Yifan Peng

Yifan is a senior CIS PhD student in University of Delaware. His main researches include relation extraction, sentence simplification, text mining and natural language processing. He does not consider himself of a Java geek, but all his projects are written in Java.

How to read CSV files in Java – A case study of Iterator and Decorator

In this post, I will talk about how to read CSV (Comma-separated values) files using Apache Common CSV. From this case study, we will learn how to use Iterator and Decorator in context of design pattern to improve the reusability in different situations. But before we get started, I guess I have to answer two questions first.
 
 
 
 
 
 
 

  1. Why do I need a third party library if there are more than enough DIY posts talking about how to read CSV files?
    It is true that when you google “java csv parser”, you will get several related posts. But even if you are a beginner, you won’t be satisfied with these shallow methods. Of course using BufferedReader and String.split() will successfully parse a typical CSV file, but you won’t learn ANYTHING from it except making redundant. On the other hand, like what I will show below, using and studying Apache Common CSV will teach you several topics in Design Pattern, for instance iterator and decorator.
  2. Why Apache Common CSV, not others?
    As far as I know, there are several other libraries on Sourceforge or Google code. However, if you look into details of their code, forgive my criticism, none of them are flexible and manageable: some are too simple to meet users various requirements; others are too complicated and painful to use. Furthermore, most of them I’ve come across don’t have commercial-friendly licenses. You know, sometimes, it really scares users off.

Apache Common CSV is still in sandbox, which means there are currently no official download and stable release. But nightly builds may be available.

Using Iterator to hide underlying representation

Let me begin with a sample CSV file, where each record is located on a separate line, delimited by a line break. The first line is the header containing two names COL1 and COL2 corresponding to the fields in the file. The rest of the file contains three records with fields separated by commas.

COL1,COL2
a,b
c,d
e,f

The code using Apache Common CSV to read this file is:

public void test() throws FileNotFoundException, IOException {
  CSVParser parser = new CSVParser(
      new FileReader("test.csv"), 
      CSVFormat.DEFAULT.withHeader());
  for (CSVRecord record : parser) {
    System.out.printf("%s\t%s\n", 
      record.get("COL1"), 
      record.get("COL2"));
  }
  parser.close();
}

CSVParser is used to parse CSV files according to the specified format. Here I use the default CSVFormat together with setting withHeader() with no argument. This enables the parser to treat the first line of the CSV file as the header and to make the record.get("COL1") valid. CSVParser provides an iterative way of reading records. Here we meet the first design pattern Iterator. It provides a way to access the records of a CSV file sequentially without exposing its underlying representation, like how to skip over comment line and how to map the column name to the field value. For each record, we use CSVRecord.get(String name) to retrieve the field value by its name.

CSVRecord provides different ways to access the field value: by name or by index. If you are not sure the field has a value or is empty, CSVRecord.isSet(String name) can be called before. If you just want to check whether a name has been defined to the parser, call CSVRecord.isMapped(String name) instead.

Using Decorator to allow different behaviors

CSVFormat.DEFAULT or CSVFormat.RFC4180 follows the RFC4180 format. So fields enclosed in double quotes can be handled too, such as

"COL1","COL2"
"a","b"
"c","d"
"e","f"

In RFC4180, fields in a CSV file should be separated by commas. But in general, the library can handle arbitrary delimiter like TAB or space. To make the code reusable, the library provides a way to create your own CSVFormat,

CSVFormat format = CSVFormat.newFormat(',')
    .withQuoteChar('"')
    .withHeader();

The above format is same as the CSVFormat.DEFAULT. Here we encounter another design pattern Decorator, which allows behavior to be added to an individual object, either statically or dynamically, without affecting the behavior of other objects from the same class. In the case of CSVFormat, every withXXX() method returns a new CSVFormat that is equal to the calling one but with one attribute modified. The question here might be why not just return the self-reference this? I think it is because the later way will fail the following code

CSVFormat format = CSVFormat.newFormat(',');
CSVFormat format1 = format.withQuoteChar('"');
CSVFormat format2 = format.withHeader();

If we simply return this, format1 will be equal to format2, which is absolutely now what we are expecting.

CSVFormat provides quite flexible ways of specifying CSV format. Details can be found in its javadoc, which is well documented. We can set the delimiter character, the comment start marker, the quote character, etc. Therefore, for the following CSV file where fields are seperated by TAB and comments are started with #,

COL1    COL2
# comments
a       b
c       d
e       f

We can create a format

CSVFormat format = CSVFormat.newFormat('\t')
    .withCommentStart('#')
    .withIgnoreEmptyLines(true)
    .withNullString("")
    .withHeader();

In summary, Apache Common CSV was started to unify a common and simple interface for reading and writing CSV files under an ASL license. It is still in sandbox, but it is quite flexible to meet different requirements. At last, I would like to emphasize that reading sophisticated codes is really helpful to improve programming skills. Therefore, I would highly recommend you to read this project source code, which is very simple but powerful.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


× four = 32



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close