Google Protocol Buffers in Java

Luis AtencioJune 25th, 2012Last Updated: October 22nd, 2012

3 331 9 minutes read

Overview

Protocol buffers is an open source encoding mechanism for structured data. Developed at Google, it was designed to be language/platform neutral and extensible. In this post, my aim is to cover the basic use of protocol buffers in the context of the Java platform.

Protobuffs are faster and simpler than XML and more compact than JSON. Currently, there is support for C++, Java, and Python. However, there are other platforms supported (not by Google) as open source projects –I tried a PHP implementation but it wasn’t fully developed so I stopped using it; nonetheless, support is catching on. With Google announcing support for PHP in Google App Engine, I believe they will take this to next level.

Basically, you define how you want your data to be structured once using a .proto specification file. This is analogous to an IDL file or a specification language to describe a software component. This file is consumed by the protocol buffer compiler (protoc) which will generate supporting methods so that you can write and read objects to and from a variety of streams.

The message format is very straightforward. Each message type has one or more uniquely numbered fields (we’ll see why this is later). Nested message types have their own set of uniquely numbered fields. Value types can be numbers, booleans, strings, bytes, collections and enumerations (inspired in the Java enum). Also, you can nest other message types, allowing you to structure your data hierarchically in much the same way JSON allows you to.

Fields can be specified as optional, required, or repeated. Don’t let the type of the field (e.g enum, int32, float, string, etc) confuse you when implementing protocol buffers in Python. The types in the field are just hints to protoc about how to serialize a fields value and produce the message encoded format of your message (more on this later). The encoded format looks a flatten and compressed representation of your object. You would write this specification the exact same way whether you are using protocol buffers in Python, Java, or C++.

Protobuffs are extensible, you can update the structure of your objects at a later time without breaking programs that used the old format. If you wanted to send data over the network, you would encode the data using Protocol Buffer API and then serialize the resulting string.

This notion of extensibility is a rather important one since Java, and many other serialization mechanisms for that matter, could potentially have issues with interoperability and backwards compatibility. With this approach, you don’t have to worry about maintaining a serialVersionId field in your code that represents the structure of an object. Maintaining this field is essential as Java’s serialization mechanism will use it as a quick checksum when deserializing objects. As a result, once you have serialized your objects into some file system, or perhaps a blob store, it is risky to make drastic changes to your object structure at a later time. Protocol buffer suffers less from this. So long as you only add optional fields to your objects, you will be able to deserialize old types at which point you will probably upgrade them.

Furthermore, you can define a package name for your .proto files with the java_package keyword. This is nice to avoid name collisions from the generated code. Another alternative is to specifically name the generated class file as I did in my example below. I prefixed my generated classes with “Proto” to indicate this was a generated class.

Here’s a simple message specification describing a User with an embedded Address message User.proto:

option java_outer_classname="ProtoUser";

message User {

   required int32  id = 1;  // DB record ID
   required string name = 2;
   required string firstname = 3;
   required string lastname = 4;
   required string ssn= 5; 

 

   // Embedded Address message spec

    message Address {
      required int32 id = 1;
      required string country = 2 [default = "US"];; 
      optional string state = 3;
      optional string city = 4;
      optional string street = 5;
      optional string zip = 6;

 

      enum Type {
         HOME = 0;

         WORK = 1; 

       }

       optional Type addrType = 7 [default = HOME]; 

 }
   repeated Address addr = 16;
}

Let’s talk a bit about the tag numbers you see to the right of each property since they are very important. These tags identify the field order of your message in the binary representation on an object of this specification. Tag values 1 – 15 will be stored as 1 byte, whereas fields tagged with values 16 – 2047 take 2 bytes to encode — not quiet sure why they do this. Google recommends you use tags 1 – 15 for very frequently occurring data and also reserve some tag values in this range for any future updates.
Note: You cannot use numbers 19000 though 19999. There are reserved for protobuff implementation. Also, you can define fields to be required, repeated, and optional.From the Google documentation:

required: a well-formed message must have exactly one of this field, i.e trying to build a message with a required field uninitialized will throw a RuntimeException.
optional: a well-formed message can have zero or one of this field (but not more than one).
repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

The documentation warns developers to be cautious about using required, as this types of fields will cause problems if you ever decide to deprecate one. This is a classical backwards compatibility problem that all serialization mechanisms suffer from. Google engineers even recommend using optional for everything.

Furthermore, I specified a nested message specification Address. I could have just as easily place this definition outside the User object in the same proto file. So for related message definitions it makes sense to have them all in the same .proto file. Even though the Address message type is not a very good example of this, I would go with a nested type if a message type does not make sense to exist outside of its ‘parent’ object. For instance, if you wanted to serialize a Node of a LinkedList. Then node would in this case be an embedded message definition. It’s up to you and your design.

Optional message properties take on default values when they are left out. In particular a type-specific default value is used instead: for strings, the default value is the empty string; for bools, the default value is false; for numeric types, the default value is zero; for enums, the default value is the first value listed in the enum’s type definition (this is pretty cool but not so obvious).

Enumerations are pretty nice. They work cross-platform in much the same way as enum works in Java. The value of the enum field can just be a single value. You can declare enumerations inside the message definition or outside as if it was it’s own independent entity. If specified inside a message type, you can expose it another message type via [Message-name].[enum-name].

Protoc

When running the protocol buffer compiler against a .proto file, the compiler will generate code for chosen language. It will convert your message types into augmented classes providing, among other things, getters and setters for your properties. The compiler also generates convenience methods to serialize messages to and from output streams and strings.

In the case of an enum type, the generated code will have a corresponding enum for Java or C++, or a special EnumDescriptor class for Python that’s used to create a set of symbolic constants with integer values in the runtime-generated class.

For Java, the compiler will generate .java files with a fluent design Builder classes for each message type to streamline object creation and initialization. The message classes generated by the compiler are immutable; once built, they cannot be changed.

You can read about other platforms (Python, C++) in the resources section with details into field encodings here:

https://developers.google.com/protocol-buffers/docs/reference/overview.

For our example, we will invoke protoc with the –java_out command line flag. This flag indicates to the compiler the output directory for the generated Java classes –one Java class for each proto file.

API

The generated API provides support for the following convenience methods:

isInitialized()
toString()
mergeFrom(…)
clear()

For parsing and serialization:

byte[] toByteArray()
parseFrom()
writeTo(OutputStream) Used in sample code to encode
parseFrom(InputStream) Used in sample code to decode

Sample Code

Let’s set up a simple project. I like to follow the Maven default archetype:

protobuff-example/src/main/java/ [Application Code]
protobuff-example/src/main/java/gen [Generated Proto Classes]
protobuff-example/src/main/proto [Proto file definitions]

To generate the protocol buffer classes, I will execute the following command:

     #  protoc --proto_path=/home/user/workspace/eclipse/trunk/protobuff/
                    --java_out=/home/user/workspace/eclipse/trunk/protobuff/src/main/java 
                    /home/user/workspace/eclipse/trunk/protobuff/src/main/proto/User.proto

I will show some pieces of the generated code and speak about them briefly. The generated class is quiet large but it’s straightforward to understand. It will provide builders to create instances of User and Address.

public final class ProtoUser {


   public interface UserOrBuilder
     extends com.google.protobuf.MessageOrBuilder

...


   public interface AddressOrBuilder
        extends com.google.protobuf.MessageOrBuilder {

 ....

}

The generated class contains Builder interfaces that makes for really fluent object creation. These builder interfaces have getters and setters for each property specified in our proto file, such as:

public String getCountry() {
        java.lang.Object ref = country_;
        if (ref instanceof String) {
          return (String) ref;
        } else {
          com.google.protobuf.ByteString bs =
              (com.google.protobuf.ByteString) ref;
          String s = bs.toStringUtf8();
          if (com.google.protobuf.Internal.isValidUtf8(bs)) {
            country_ = s;
          }
          return s;
        }
      }

Since this is a custom encoding mechanism, logically all of the fields have custom byte wrappers. Our simple String field, when stored, is compacted using a ByteString which then gets de-serialized into a UTF-8 string.

// required int32 id = 1;
                                         
public static final int ID_FIELD_NUMBER = 1;
                      
private int id_;
                        
public boolean hasId() {
                   
      return ((bitField0_ & 0x00000001) == 0x00000001);
                     
}

In this call we see the importance of the tag numbers we spoke of at the beginning. Those tag numbers seem to represent some sort of bit position that define where the data is located in the byte string. Next we see snippets of the write and read methods I mentioned earlier.

Writing an instance to the an output stream:

public void writeTo(com.google.protobuf.CodedOutputStream output)
                    throws java.io.IOException {

        getSerializedSize();

        if (((bitField0_ & 0x00000001) == 0x00000001)) {
          output.writeInt32(1, id_);
        }
        if (((bitField0_ & 0x00000002) == 0x00000002)) {
          output.writeBytes(2, getCountryBytes());
....
}

Reading from an input stream:

public static ProtoUser.User parseFrom(java.io.InputStream input)
      throws java.io.IOException {
    return newBuilder().mergeFrom(input).buildParsed();
}

This class is about 2000 lines of code. There are other details such as how Enum types are mapped and how repeated types are stored. Hopefully, the snippets that I provided give you a high level idea of the structure of this class.

Let’s take a look at some application level code for using the generated class. To persist the data, we can simply do:

// Create instance of Address
                                         
 Address addr = ProtoUser.User.Address.newBuilder()  
              .setAddrType(Address.Type.HOME)        
              .setCity("Weston")
              .setCountry("USA")
              .setId(1)
              .setState("FL")
              .setStreet("123 Lakeshore")
              .setZip("90210")
              .build();
                
// Serialize instance of User
                                             
   User user = ProtoUser.User.newBuilder() 
              .setId(1)
              .setFirstname("Luis")
              .setLastname("Atencio")
              .setName("luisat")
              .setSsn("555-555-5555")          
              .addAddr(addr)
              .build();
                       
  // Write file
                     
   FileOutputStream output = new FileOutputStream("target/user.ser");  
   user.writeTo(output);          
   output.close();

Once persisted, we can read as such:

User user = User.parseFrom(
      
   new FileInputStream("target/user.ser");
                           
System.out.println(user);

To run the sample code, use:

java -cp .:../lib/protobuf-java-2.4.1.jar app.Serialize ../target/user.ser

Protobuff vs XML

Google claims that protocol buffers are 20 to 100 times faster (in nanoseconds) than XML and 3 to 10 smaller removing whitespace. However, until there is support and adoption in all platforms (not just the aforementioned 3), XML will be continue to be a very popular serialization mechanism. In addition, not everyone has the performance requirements and expectations that Google users have. An alternative to XML is JSON.

Protobuff vs JSON

I did some comparison testing to evaluate using Protocol buffers over JSON. The results were quiet dramatic, a simple test reveals that protobuffs are 50%+ more efficient in terms of storage. I created a simple POJO version of my User-Address classes and used the GSON library to encode an instance with the same state as the example above (I will omit implementation details, please check gson project referenced below). Encoding the same user data, I got:

                    
-rw-rw-r-- 1 luisat luisat 206 May 30 09:47 json-user.ser 
-rw-rw-r-- 1 luisat luisat 85 May 30 09:42  user.ser

Which is remarkable. I also found this in another blog (see resources below):

It’s definitely worth a read.

Conclusion and Further Remarks

Protocol buffers can be a good solution to cross platform data encoding. With clients written in Java, Python, C++ and many others, storing/sending compressed data is really straightforward.

One tricky point to make is: “Remember REQUIRED is forever.” If you go crazy and make every single field of your .proto file required, then it will extremely difficult to delete or edit those fields.

Also a bit of incentive, protobuffs are used across Google’s data stores: there are 48,162 different message types defined in the Google code tree across 12,183 .proto files.

Protocol Buffers promote good Object Oriented Design, since .proto files are basically dumb data holders (like structs in C++). According to Google documentation, if you want to add richer behavior to a generated class or you don’t have control over the design of the .proto file, the best way to do this is to wrap the generated protocol buffer class in an application-specific class.

Finally, remember you should never add behaviour to the generated classes by inheriting from them . This will break internal mechanisms and is not good object-oriented practice anyway.

A lot of the information presented here comes from personal experience, other resources, and most importantly google developer code. Please check out the documentation in the resources section.

Resources

Reference: Java Protocol Buffers from our JCG partner Luis Atencio at the Reflective Thought blog.