Monday, 8 November 2010

Java Best Practices – Char to Byte and Byte to Char conversions


Continuing our series of articles concerning proposed practices while working with the Java programming language, we are going to talk about String performance tunning. Especially we will focus on how to handle character to byte and byte to character conversions efficiently when the default encoding is used. This article concludes with a performance comparison between two proposed custom approaches and two classic ones (the "String.getBytes()" and the NIO ByteBuffer) for converting characters to bytes and vice – versa.

All discussed topics are based on use cases derived from the development of mission critical, ultra high performance production systems for the telecommunication industry.

Prior reading each section of this article it is highly recommended that you consult the relevant Java API documentation for detailed information and code samples.

All tests are performed against a Sony Vaio with the following characteristics :
  • System : openSUSE 11.1 (x86_64)
  • Processor (CPU) : Intel(R) Core(TM)2 Duo CPU T6670 @ 2.20GHz
  • Processor Speed : 1,200.00 MHz
  • Total memory (RAM) : 2.8 GB
  • Java : OpenJDK 1.6.0_0 64-Bit

The following test configuration is applied :
  • Concurrent worker Threads : 1
  • Test repeats per worker Thread : 1000000
  • Overall test runs : 100

Char to Byte and Byte to Char conversions
Character to byte and byte to character conversions are considered common tasks among Java developers who are programming against a networking environment, manipulate streams of byte data, serialize String objects, implementing communication protocols etc. For that reason Java provides a handful of utilities that enable a developer to convert a String (or a character array) to its byte array equivalent and vice versa.

The “getBytes(charsetName)” operation of the String class is probably the most commonly used  method for converting a String into its byte array equivalent. Since every character can be represented differently according to the encoding scheme used, its of no surprise that the aforementioned operation requires a “charsetName” in order to correctly convert the String characters. If no “charsetName” is provided, the operation encodes the String into a sequence of bytes using the platform's default character set.

Another “classic” approach for converting a character array to its byte array equivalent is by using the ByteBuffer class of the NIO package. An example code snippet for the specific approach will be provided later on.

Both the aforementioned approaches although very popular and indisputably easy to use and straightforward greatly lack in performance compared to more fine grained methods. Keep in mind that we are not converting between character encodings. For converting between character encodings you should stick with the “classic” approaches using either the “String.getBytes(charsetName)” or the NIO framework methods and utilities.

When all characters to be converted are ASCII characters, a proposed conversion method is the one shown below :
public static byte[] stringToBytesASCII(String str) {
 char[] buffer = str.toCharArray();
 byte[] b = new byte[buffer.length];
 for (int i = 0; i < b.length; i++) {
  b[i] = (byte) buffer[i];
 }
 return b;
}
The resulted byte array is constructed by casting every character value to its byte equivalent since we know that all characters are in the ASCII range (0 – 127) thus can occupy just one byte in size.

Using the resulted byte array we can convert back to the original String, by utilizing the “classic” String constructor “new String(byte[])

For the default character encoding we can use the methods shown below to convert a String to a byte array and vice – versa :
public static byte[] stringToBytesUTFCustom(String str) {
 char[] buffer = str.toCharArray();
 byte[] b = new byte[buffer.length << 1];
 for(int i = 0; i < buffer.length; i++) {
  int bpos = i << 1;
  b[bpos] = (byte) ((buffer[i]&0xFF00)>>8);
  b[bpos + 1] = (byte) (buffer[i]&0x00FF);
 }
 return b;
}
Every character type in Java occupies 2 bytes in size. For converting a String to its byte array equivalent we convert every character of the String to its 2 byte representation.

Using the resulted byte array we can convert back to the original String, by utilizing the method provided below :
public static String bytesToStringUTFCustom(byte[] bytes) {
 char[] buffer = new char[bytes.length >> 1];
 for(int i = 0; i < buffer.length; i++) {
  int bpos = i << 1;
  char c = (char)(((bytes[bpos]&0x00FF)<<8) + (bytes[bpos+1]&0x00FF));
  buffer[i] = c;
 }
 return new String(buffer);
}
We construct every String character from its 2 byte representation. Using the resulted character  array we can convert back to the original String, by utilizing the “classic” String constructor “new String(char[])

Last but not least we provide two example methods using the NIO package in order to convert a String to its byte array equivalent and vice – versa :
public static byte[] stringToBytesUTFNIO(String str) {
 char[] buffer = str.toCharArray();
 byte[] b = new byte[buffer.length << 1];
 CharBuffer cBuffer = ByteBuffer.wrap(b).asCharBuffer();
 for(int i = 0; i < buffer.length; i++)
  cBuffer.put(buffer[i]);
 return b;
}
public static String bytesToStringUTFNIO(byte[] bytes) {
 CharBuffer cBuffer = ByteBuffer.wrap(bytes).asCharBuffer();
 return cBuffer.toString();
}

For the final part of this article we provide the performance comparison charts for the aforementioned String to byte array and byte array to String conversion approaches. We have tested all methods using the input string “a test string”.

First the String to byte array conversion performance comparison chart :


The horizontal axis represents the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. As expected, both “String.getBytes()” and “stringToBytesUTFNIO(String)” approaches performed poorly compared to the “stringToBytesASCII(String)” and “stringToBytesUTFCustom(String)” suggested approaches. As you can see, our proposed methods achieve almost 30% increase in TPS compared to the “classic” methods.


Lastly the byte array to String performance comparison chart :


The horizontal axis represents the number of test runs and the vertical axis the average transactions per second (TPS) for each test run. Thus higher values are better. As expected, both “new String(byte[])” and “bytesToStringUTFNIO(byte[])” approaches performed poorly compared to the “bytesToStringUTFCustom(byte[])” suggested approach. As you can see, our proposed method achieved almost 15% increase in TPS compared to the “new String(byte[])” method, and almost 30% increase in TPS compared to the “bytesToStringUTFNIO(byte[])” method.

In conclusion, when you are dealing with character to byte or byte to character conversions and you do not intent to change the encoding used, you can achieve superior performance by utilizing custom – fine grained – methods rather than using the “classic” ones provided by the String class and the NIO package. Our proposed approach achieved an overall of 45% increase in performance compared to the “classic” approaches when converting the test String to its byte array equivalent and vice – versa.


Happy coding


Justin


P.S.

After taking into consideration the proposition from several of our readers to utilize the "String.charAt(int)" operation instead of using the "String.toCharArray()" so as to convert the String characters into bytes, I altered our proposed methods and re-executed the tests. As expected, further performance gains where achieved. In particular, an extra 13% average increase in TPS was recorded for the “stringToBytesASCII(String)” method and an extra 2% average increase in TPS was recorded for the “stringToBytesUTFCustom(String)”. So you should use the altered methods as they perform even better than the original ones. The updated methods are shown below :
public static byte[] stringToBytesASCII(String str) {
 byte[] b = new byte[str.length()];
 for (int i = 0; i < b.length; i++) {
  b[i] = (byte) str.charAt(i);
 }
 return b;
}
public static byte[] stringToBytesUTFCustom(String str) {
 byte[] b = new byte[str.length() << 1];
 for(int i = 0; i < str.length(); i++) {
  char strChar = str.charAt(i);
  int bpos = i << 1;
  b[bpos] = (byte) ((strChar&0xFF00)>>8);
  b[bpos + 1] = (byte) (strChar&0x00FF); 
 }
 return b;
}


Related Articles :

20 comments:

  1. "When all characters to be converted are ASCII characters"

    Fail.

    ReplyDelete
  2. Java uses UTF-16 strings, which means that any given codepoint can be represented in one or two char variables. ASCII only uses the first 7 bits of an octet (byte), so the valid range of values is 0 to 127. UTF-16 uses identical values for this range (they're just wider). So I do not see how the proposed method fails!

    ReplyDelete
  3. "when the default encoding (UTF-16) is used"

    Whose default encoding? That's certainly not my platform's default encoding. Mine's UTF-8. If you did not check your encoding, then your code is obviously faster since, AFAIK, UTF-16 is used internally.

    ReplyDelete
  4. sv,

    "default encoding" refers to JVM default encoding scheme. The proposed conversion approaches are applicable regardless of the default JVM character encoding. Performance gains follow the same pattern also. The default JVM character encoding scheme is mentioned in the article just as a hint.

    Derived from JDK javadoc : http://download.oracle.com/javase/1.5.0/docs/api/java/nio/charset/Charset.html
    ....
    The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system
    ....
    The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units and sequences of bytes.

    BRs

    ReplyDelete
  5. Thanks for repeating what I said. I'll restate my comment differently.

    I believe you are comparing String.getBytes(), which returns on my platform an UTF-8 array of bytes, to a method that returns a UTF-16 representation of the string. Thus, the standard jdk approach includes a conversion (unlike what you seem to be stating).

    The default encoding can be set using the system property file.encoding. By changing this property, you will modify the output of String.getBytes(). So if you want to compare apples with apples, please re-execute your test using -Dfile.encoding=UTF16.

    Unless you do that (or use the method getBytes("UTF16"), the JVM will convert the input strings, and ruin your test.

    ReplyDelete
  6. sv, i did not repeat what you said, on the contrary, I clearly stated that "The proposed conversion approaches are applicable regardless of the default JVM character encoding. Performance gains follow the same pattern also". So let me restate my comment differently.

    String.getBytes() will return an array of bytes converted to the default JVM character encoding scheme. In your case this is UTF-8 (the hint given in this article is UTF-16 as this is the native character encoding of the platform).

    The proposed method DOES NOT return a UTF-16 representation of the String. In fact the method just converts (as clearly stated in the relevant section of the article) a character array (returned from the String.toCharArray() operation) to its byte array equivalent. In Java every character type occupies 2 bytes in size but this does not mean that the specific character represents a specific code point in a character encoding scheme. UTF-8 encoding scheme for example, uses up to 4 bytes to represent its code points. Thus in your case it is possible that the character array returned will contain up to two Java character types for a single UTF-8 code point. That's why we provided the custom "reverse" operation to construct the original string from the produced byte array.

    In other words the proposed conversion method just produces an array of bytes from an input string regardless of the string's actual encoding.

    That's the source for the resulted performance gain since String.getBytes() will trigger, in addition to many other internal checks, a character encoding conversion.

    Hope my explanation helped!

    BRs

    ReplyDelete
  7. It's really questionable to call the proposed technique as "best practices".

    Method stringToBytesASCII() is indeed rather useless in large parts of the world, as Sebastien has mentioned correctly.

    And bytesToStringUTFCustom() is - as already noted above - nothing more than just a simple (and thus fast) conversion between 8-bit and 16-bit units, which is the old UCS-2 way and not a valid UTF-16 conversion. Things get tricky if you also want to handle Unicode characters outside the basic multilingual plane correctly (those little bastards from \u10000 up to \u10ffff represented by surrogate pairs in an unassigned range of the 16-bit space). And last but not least, byte order marks and endianess also have to be considered carefully.

    It's absolutely OK to use the above simplifications in a few performance-critical hot-spots (non-BMP characters are indeed rare), but for the rest of us it's better not to twiddle around with character encodings this way. And finally it's also a best practice to avoid hard-coding the used character encoding.

    ReplyDelete
  8. Ok, I understand what you did. I'm not sure I understand why this is a best practice. It might be a best practice in a very specific domain, and that should be emphasised; I saw the telecom spiel, but it's not very specific.

    One thing that is *not* a best practice is using the default encoding if you are going to convert string to byte for transmission/storage as another machine/jvm might use a different default.

    ReplyDelete
  9. Thank you all for your comments,

    The specific domain cv is talking about, that this article applies as a best practice, is the need to convert a String to a byte array and vice versa (to send it over the wire for example). That is clearly stated allover this article. What clarifies our proposition as a "best practice" (for the specific domain) is the fact that the aforementioned conversion can be achieved using the "classic" String.getBytes() or the NIO approaches which have more complex implementations and thus perform worse than our proposed methods.

    This article has nothing to do with character encoding conversions. Our proposed methods do not perform any, nor hard code the used character encoding. As a matter of fact it is clearly stated that when character encoding conversions are to be made you should use the "classic" String.getBytes() method, since its the safest way to properly convert between encodings.

    BRs

    ReplyDelete
  10. ... and cv you are absolutely right, we should proceed with caution when we are converting a String to its byte array equivalent for transmission/storage to another another machine/jvm because it might use a different default.

    Our proposed methods expect same default encodings between JVMs

    BRs

    ReplyDelete
  11. If it's just for transmission/storage, then I see no reason to allocate a byte array for each and every string, for example as done by java.io.DataOutputStream.

    ReplyDelete
  12. using the shifting is what made this difference I guess

    ReplyDelete
  13. Hi there,

    I just skimmed through your article, well mainly the code that is. One question arose:

    Why are you constantly creating a char[] from string (char[] buffer = str.toCharArray())?

    It should be faster to iterate through a string's chars using str.charAt(i) instead of creating and copying an array. While the difference is almost always negligible, it's certainly worth considering in your "best practices".

    Cheers

    ReplyDelete
  14. Have you done any analysis or tests to see how the extra bytes (most of which may be '\0') affect the overall performance?

    I mean, if the goal is:

    string --> bytes --> wire --> bytes --> string

    should we not measure the overall performance, since the extra wire overhead may more than offset any gains in conversion?

    ReplyDelete
  15. Stefan, I will test your proposition and get back to you for the performance results.

    Haam, the proposed methods do not generate any extra bytes. The comparison test is between the byte array generated by the "classic" Java String and NIO methods and our proposed custom operations.

    BRs

    ReplyDelete
  16. Hello all,

    Since the Java native encoding hint (UTF-16) confused several of our readers, I decided to remove it from this post.

    I believe things are more clear now!

    BRs

    ReplyDelete
  17. Interesting article. It provides an alternative way of String.getBytes("UTF-16"), of course, the proposing method doesn't have the BOM which makes it not a valid UTF-16 byte stream.
    I'm a little surprised that String.getBytes("UTF-16") doesn't perform better, because all it needs to do is System.arraycopy().

    Paul

    ReplyDelete
  18. Stefan and all,

    I have updated the article with a revised implementation of the ASCII and UTF conversion methods so as to utilize the "String.charAt(int)" operation instead of the "String.toCharArray()".

    The conversion tests where re-executed. An average 13% increase in TPS was achieved for the ASCII and a 2% increase in TPS was achieved for the UTF methods.

    BRs

    ReplyDelete
  19. Mehmet Nuri DeveciJan 4, 2012 06:19 AM

    Thanks for the article but I have a question.

    I have tested your last stringToBytesUTFCustom method. It works much more faster than String.getBytes(Charset). But the problem is, I am using some unicode characters ('\u258E'   and    '\u0374') and it produces invalid data. For example:

    String.getBytes(UTF-8)    [-30, -106, -114, -51, -76]stringToBytesUTFCustom    [37, -114, 3, 116]Am I missing something here?

    ReplyDelete
  20. byron kiourtzoglouJan 4, 2012 08:51 AM

    Hello Mehmet,

    Our custom method produces a byte array using the default character encoding in Java which utilizes a 2 bytes per character scheme. By using UTF-8 encoding a single character can occupy up to 4 bytes of data. Thats why the two methods return different number of bytes for the same two unicode characters. Nevertheless If you examine the returned byte array for our stringToBytesUTFCustom() method you will see that the resulted bytes are correct!

    BRs

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...