Previously we had an introduction on the BigQuery Storage API and we proceeded reading data using the Arrow format. In this tutorial we shall read Data using the Avro format. What applied on the previous tutorial applies here too. We shall create a BigQuery Storage Client, create a ReadSession using the Avro format and iterate the data on each stream. ...
Read More »Home »
BigQuery Storage API: Arrow
Previously we had an introduction on the BigQuery Storage API. As explained the storage API of BigQuery supports two formats. For this tutorial we will choose the Arrow Format. First let’s import the dependencies. The BigQuery storage API binary does not come with a library to parse Arrow. This way the consumer receives the binaries in an Arrow format, and ...
Read More »BigQuery Storage API: Get Started and Comparisons
BigQuery provides us with the Storage API for fast access using an rpc-based protocal. With this option you can receive the data in a binary serialized format. The alternative ways to retrieve BigQuery Data is through the Rest API and a Bulk export. Bulk Data export is a good solution to export big result sets however you are limited to ...
Read More »Apache Arrow on the JVM: Streaming Writes
Previously we went to create some schemas on Arrow. On this blog we will have a look on writing through streaming API. Based on the previous post’s Schema we shall create a DTO for our classes. 01 02 03 04 05 06 07 08 09 10 11 12 13 package com.gkatzioura.arrow; import lombok.Builder; import lombok.Data; @Data @Builder public ...
Read More »Apache Arrow on the JVM: Get Started and Schemas
Arrow is memory format for flat and hierarchical data. It is a popular format used by various big data tools, among them BigQuery. One of the benefits that Arrow brings is that the format of the data has the same byte representation on the languages supported. So apart from the benefits of a columnar memory format there are also the ...
Read More »Where is Apache Spark heading?
I watched (COVID19-era version of “attended”) the latest spark Summit and in one of the keynotes Reynold Xin from Databricks, presented the following two images comparing spark usage on their platform on 2013 vs. 2020: While Databricks’ platform is, of course, not the whole spark community, I would wager that they have enough users to represent the overall trend. Incidentally, ...
Read More »Processing real-time data with Storm, Kafka and ElasticSearch – Part 1
This is an article of processing real-time data with Storm, Kafka and ElasticSearch. 1. Introduction How would you process a stream of real or near-real time data? In the era of Big Data, there are a number of technologies available that can help you in this task. In this series of articles we shall see a real example scenario and ...
Read More »Popular frameworks for big data processing in Java
The big data challenge The concept of big data is understood differently in the variety of domains where companies face the need to deal with increasing volumes of data. In most of these scenarios the system under consideration needs to be designed in such a way so that it is capable of processing that data without sacrificing throughput as data ...
Read More »Big data isn’t – well, almost
Back in ancient history (2004) Google’s Jeff Dean & Sanjay Ghemawat presented their innovative idea for dealing with huge data sets – a novel idea called MapReduce Jeff and Sanjay presented that a typical cluster was made of 100s to few 1000s of machines with 2 CPUs and 2-4 GB RAM each. They presented that in the whole of Aug ...
Read More »