Apache Arrow on the JVM: Streaming Writes

Previously we went to create some schemas on Arrow.  On this blog we will have a look on writing through streaming API. Based on the previous post’s Schema we shall create a DTO for our classes. 01 02 03 04 05 06 07 08 09 10 11 12 13 package com.gkatzioura.arrow;   import lombok.Builder; import lombok.Data;   @Data @Builder public ...

Apache Arrow on the JVM: Get Started and Schemas

Arrow is memory format for flat and hierarchical data. It is a popular format used by various big data tools, among them BigQuery. One of the benefits that Arrow brings is that the format of the data has the same byte representation on the languages supported. So apart from the benefits of a columnar memory format there are also the ...

Where is Apache Spark heading?

I watched (COVID19-era version of “attended”) the latest spark Summit and in one of the keynotes Reynold Xin from Databricks, presented the following two images comparing spark usage on their platform on 2013 vs. 2020: While Databricks’ platform is, of course, not the whole spark community, I would wager that they have enough users to represent the overall trend. Incidentally, ...

Popular frameworks for big data processing in Java


The big data challenge The concept of big data is understood differently in the variety of domains where companies face the need to deal with increasing volumes of data. In most of these scenarios the system under consideration needs to be designed in such a way so that it is capable of processing that data without sacrificing throughput as data ...

Big data isn’t – well, almost

Back in ancient history (2004) Google’s Jeff Dean & Sanjay Ghemawat presented their innovative idea for dealing with huge data sets – a novel idea called MapReduce Jeff and Sanjay presented that a typical cluster was made of 100s to few 1000s of machines with 2 CPUs and 2-4 GB RAM each. They presented that in the whole of Aug ...

