Apache Arrow on the JVM: Streaming Writes

Emmanouil GkatziourasMay 23rd, 2021Last Updated: May 20th, 2021

0 330 3 minutes read

Previously we went to create some schemas on Arrow. On this blog we will have a look on writing through streaming API.

Based on the previous post’s Schema we shall create a DTO for our classes.

package com.gkatzioura.arrow;
 
import lombok.Builder;
import lombok.Data;
 
@Data
@Builder
public class DefaultArrowEntry {
 
    private String col1;
    private Integer col2;
 
}

Our goal would be to transform those Java objects into a Stream of Arrow bytes.

The allocator creates DirectByteBuffer‘s.
Those buffers are off-heap. You do need to free up the memory used, but for the library user this is done by executing the close() operation on the allocator. In our case our class will implement the Closeable interface which shall do the allocator close operation.

By using the stream api, the data will be streamed to the OutPutStream submitted using the Arrow format.

package com.gkatzioura.arrow;
 
import java.io.Closeable;
import java.io.IOException;
import java.nio.channels.WritableByteChannel;
import java.util.List;
 
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.IntVector;
import org.apache.arrow.vector.VarCharVector;
import org.apache.arrow.vector.VectorSchemaRoot;
import org.apache.arrow.vector.dictionary.DictionaryProvider;
import org.apache.arrow.vector.ipc.ArrowStreamWriter;
import org.apache.arrow.vector.util.Text;
 
import static com.gkatzioura.arrow.SchemaFactory.DEFAULT_SCHEMA;
 
public class DefaultEntriesWriter implements Closeable {
 
    private final RootAllocator rootAllocator;
    private final VectorSchemaRoot vectorSchemaRoot;
 
    public DefaultEntriesWriter() {
        rootAllocator = new RootAllocator();
        vectorSchemaRoot = VectorSchemaRoot.create(DEFAULT_SCHEMA, rootAllocator);
    }
 
    public void write(List<DefaultArrowEntry> defaultArrowEntries, int batchSize, WritableByteChannel out) {
        if (batchSize <= 0) {
            batchSize = defaultArrowEntries.size();
        }
 
        DictionaryProvider.MapDictionaryProvider dictProvider = new DictionaryProvider.MapDictionaryProvider();
        try(ArrowStreamWriter writer = new ArrowStreamWriter(vectorSchemaRoot, dictProvider, out)) {
            writer.start();
 
            VarCharVector childVector1 = (VarCharVector) vectorSchemaRoot.getVector(0);
            IntVector childVector2 = (IntVector) vectorSchemaRoot.getVector(1);
            childVector1.reset();
            childVector2.reset();
 
            boolean exactBatches = defaultArrowEntries.size()%batchSize == 0;
            int batchCounter = 0;
 
            for(int i=0; i < defaultArrowEntries.size(); i++) {
                childVector1.setSafe(batchCounter, new Text(defaultArrowEntries.get(i).getCol1()));
                childVector2.setSafe(batchCounter, defaultArrowEntries.get(i).getCol2());
 
                batchCounter++;
 
                if(batchCounter == batchSize) {
                    vectorSchemaRoot.setRowCount(batchSize);
                    writer.writeBatch();
                    batchCounter = 0;
                }
            }
 
            if(!exactBatches) {
                vectorSchemaRoot.setRowCount(batchCounter);
                writer.writeBatch();
            }
 
            writer.end();
        } catch (IOException e) {
            throw new ArrowExampleException(e);
        }
    }
 
    @Override
    public void close() throws IOException {
        vectorSchemaRoot.close();
        rootAllocator.close();
    }
 
}

To display the support of batches on Arrow a simple batch algorithm has been implemented within the function. For our example just take into account that data will be written in batches.

Let’s dive into the function.

The vector allocator discussed previously is created

public DefaultEntriesToBytesConverter() {
        rootAllocator = new RootAllocator();
        vectorSchemaRoot = VectorSchemaRoot.create(DEFAULT_SCHEMA, rootAllocator);
    }

Then when writing to a stream, an arrow stream writer is implemented and started

ArrowStreamWriter writer = new ArrowStreamWriter(vectorSchemaRoot, dictProvider, Channels.newChannel(out));
writer.start();

We shall use the vectors in order to populated them with the data. Also reset them but let the pre-alocated buffers to exist

VarCharVector childVector1 = (VarCharVector) vectorSchemaRoot.getVector(0);
            IntVector childVector2 = (IntVector) vectorSchemaRoot.getVector(1);
            childVector1.reset();
            childVector2.reset();

We use the setSafe operation when writing data. This way if more buffer needs to be allocated shall be done. For this example it’s done on every write, but can be avoided when the operations and the buffer size needed is taken into account.

childVector1.setSafe(i, new Text(defaultArrowEntries.get(i).getCol1()));
                childVector2.setSafe(i, defaultArrowEntries.get(i).getCol2());

Then we write the batch to the stream.

vectorSchemaRoot.setRowCount(batchSize);
                    writer.writeBatch();

Last but not least we close the writer.

@Override
    public void close() throws IOException {
        vectorSchemaRoot.close();
        rootAllocator.close();
    }

The next blog will focus on reading Arrow Data from a stream.

Published on Java Code Geeks with permission by Emmanouil Gkatziouras, partner at our JCG program. See the original article here: Apache Arrow on the JVM: Streaming Writes

Opinions expressed by Java Code Geeks contributors are their own.

Apache Arrow on the JVM: Streaming Writes

Thank you!

Emmanouil Gkatziouras

Thank you!

Thank you!

Related Articles

Thank you!