Core Java

Working with GZIP and compressed data

Abstract

We all know what it means to zip a file with zip or gzip.  But using zipped files in Java is not quite as straight forward as you would like to think, especially if you are not working directly with files but rather with compressing streaming data.  We’ll go though:

  • how to convert a String into a compressed / zipped byte array and visa versa
  • create utility functions for reading and writing files without having to know in advance whether the file or stream is gzip’d or not.

The basics

So why would you want to zip anything?  Quite simply because it is great way to cut down the amount of data that you have to ship across a network or store to disk therefore increase the speed of the operation.  A typical text file or message can be reduced by a factor of 10 or more depending on the nature of your document.  Of course you will have to factor in the cost of zipping and unzipping but when you have a large amount of data it will be unlikely that these costs will be significant.

Does Java support this?

Yes, Java support reading and writing gzip files in the java.util.zip package. It also supports zip files as well data inflating and deflating of the popular ZLIB compression library.

How do I compress/uncompress a Java String?

Here’s an example of how to compress and decompress a String using the DeflaterOutputStream.

Here are two methods to use the Java built in compressor as well as a method using GZIP:

  1. Using the DeflaterOutputStream is the easiest way:
    enum StringCompressor {
            ;
            public static byte[] compress(String text) {
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                try {
                    OutputStream out = new DeflaterOutputStream(baos);
                    out.write(text.getBytes("UTF-8"));
                    out.close();
                } catch (IOException e) {
                    throw new AssertionError(e);
                }
                return baos.toByteArray();
            }
    
            public static String decompress(byte[] bytes) {
                InputStream in = new InflaterInputStream(new ByteArrayInputStream(bytes));
                ByteArrayOutputStream baos = new ByteArrayOutputStream();
                try {
                    byte[] buffer = new byte[8192];
                    int len;
                    while((len = in.read(buffer))>0)
                        baos.write(buffer, 0, len);
                    return new String(baos.toByteArray(), "UTF-8");
                } catch (IOException e) {
                    throw new AssertionError(e);
                }
            }
        }
  2. If you want to use the Deflater / Inflater directly:
    enum StringCompressor2 {
            ;
            public static byte[] compress(String text) throws Exception{
                byte[] output = new byte;
                Deflater compresser = new Deflater();
                compresser.setInput(text.getBytes("UTF-8"));
                compresser.finish();
                int compressedDataLength = compresser.deflate(output);
                byte[] dest = new byte[compressedDataLength];
                System.arraycopy(output, 0, dest, 0, compressedDataLength);
                return dest;
            }
    
            public static String decompress(byte[] bytes) throws Exception{
                Inflater decompresser = new Inflater();
                decompresser.setInput(bytes, 0, bytes.length);
                byte[] result = new byte[bytes.length *10];
                int resultLength = decompresser.inflate(result);
                decompresser.end();
    
                // Decode the bytes into a String
                String outputString = new String(result, 0, resultLength, "UTF-8");
                return outputString;
            }
        }
  3. Here’s how to do it using GZIP:
    enum StringGZipper {
            ;
            private static String ungzip(byte[] bytes) throws Exception{
                InputStreamReader isr = new InputStreamReader(new GZIPInputStream(new ByteArrayInputStream(bytes)), StandardCharsets.UTF_8);
                StringWriter sw = new StringWriter();
                char[] chars = new char[1024];
                for (int len; (len = isr.read(chars)) > 0; ) {
                    sw.write(chars, 0, len);
                }
                return sw.toString();
            }
    
            private static byte[] gzip(String s) throws Exception{
                ByteArrayOutputStream bos = new ByteArrayOutputStream();
                GZIPOutputStream gzip = new GZIPOutputStream(bos);
                OutputStreamWriter osw = new OutputStreamWriter(gzip, StandardCharsets.UTF_8);
                osw.write(s);
                osw.close();
                return bos.toByteArray();
            }
        }

How to decode a stream of bytes to allow for both GZip and normal streams:

The code below will turn a stream of bytes into a String (dump) irrespective of without having to know in advance if the stream was zipped or not.

if (isGZIPStream(bytes)) {
            InputStreamReader isr = new InputStreamReader(new GZIPInputStream(new ByteArrayInputStream(bytes)), StandardCharsets.UTF_8);
            StringWriter sw = new StringWriter();
            char[] chars = new char[1024];
            for (int len; (len = isr.read(chars)) > 0; ) {
                sw.write(chars, 0, len);
            }
            dump = sw.toString();
        } else {
            dump = new String(bytes, 0, length, StandardCharsets.UTF_8);
        }
}

This is the implementation of the isGZIPStream method. Reveals the truth about what’s behind GZIP_MAGIC!

public static boolean isGZIPStream(byte[] bytes) {
        return bytes[0] == (byte) GZIPInputStream.GZIP_MAGIC 
         && bytes[1] == (byte) (GZIPInputStream.GZIP_MAGIC >>> 8);
}

This is a simple way to do read a file without knowing whether it was zipped or not (relying on the extension .gz).

static Stream<String> getStream(String dir, @NotNull String fileName) 
  throws IOException {
        File file = new File(dir, fileName);
        InputStream in;
        if (file.exists()) {
            in = new FileInputStream(file);
        } else {
            file = new File(dir, fileName + ".gz");
            in = new GZIPInputStream(new FileInputStream(file));
        }

        return new BufferedReader(new InputStreamReader(in)).lines();
}
Reference: Working with GZIP and compressed data from our JCG partner Daniel Shaya at the Rational Java blog.

Daniel Shaya

Daniel has been programming in Java since it was in beta. Working predominantly in the finance industry he has created real time trading and margin risk applications. He is currently a director at OpenHFT where we are building next generation Java low latency products.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Lucky
Lucky
9 years ago

Thanks for the article it’s used so much

Mmaduaburuchi
7 years ago

This is great,thanks shaya

Back to top button