In this week’s Whiteboard Walkthrough Parth Chandra, Chair of PMC for Apache Drill project and member of MapR engineering team, describes how the Apache Drill SQL query engine reads data in Parquet format and some of the best practices to get maximum performance from Parquet.
Additional Apache Drill resources:
- ”Overview Apache Drill’s Query Execution Capabilities” Whiteboard Walkthrough video
- “SQL Query on Mixed Schema Data Using Apache Drill” blog post
- Free download Apache Drill on MapR sandbox
Here is the video transcription:
Hello, I’m Part Chandra. I’m one of the developers in the Apache Drill Project, in fact I’m the Apache Drill PMC, and I’m a member of the MapR Engineering team. Today we are going to talk about the parquet column file format, and particularly about how Apache Drill reads the parquet file format and how we can use some very good best practices in order to get maximum performance out of reading parquet files.
Let’s starts by talking about what the parquet file looks like. A parquet file is columnar. What that means is that it organizes data in groups of columns instead of one record at a time. Here is a picture of what it looks like. A parquet file is divided into a logical unit called a row group. Within a row group which you can see here the parquet file will have all the data for a column together, then for the next column together, so there’s going to be a bunch of records. And the only thing that you know for sure is that the number of records in each of these groups is the same. Within the column the parquet file organizes data into pages. Pages can be of different types. One of the types of pages can be a dictionary page, but the thing to remember particularly for parquet is that a page is the minimum unit of granularity for a parquet file. What that means is that when parquet saves the data it takes an entire page, it encodes the data in the parquet format, then it compresses that data. What that means is that I cannot access an individual record within a page until I’ve read the entire page, I’ve decompressed it and I’ve decoded it. That’s an important thing to remember.
The other thing to remember is that a single file is divided into many row groups. One of the things that happens in a distributive file system is that a row group can spend multiple machines. The thing we try to do though is, we try to make sure that a single row group always rests on the same note. A single file, let’s suppose this is a file is now divided across multiple nodes, so it’s across node 1 and node 2, we try to make sure that row group 1 for this file is on node 1, and row group 2 on this file is on node 2. There’s a reason for that. One of the things we try to see is exactly how Drill reads parquet files, and what we’ve tried to see here, here’s a little pseudo-representation of what happens, is that Drill looks at all the parquet files that have to be read, and it looks inside the parquet file and it discovers how many row groups there are to read. It then asks the file system, tell me what node this row group rests on.
It finds out that for this file, file 1 let’s say, row group 1 is a node 1 and row group 2 is on node 2, and what I’ll do is it’ll start a distributed operation where it will start by reading row group 1 on node 1 and it’ll read row group 2 on node 2. That way there is no network traffic while reading data for row group 1 because the data has been read locally, and the same thing for row group 2 which has been read locally on node 2.
Once a particular process has started on a single note we’re reading a particular node group. Within that node Drill will start reading every column, and this is a parallel operation, so one of the things that columnar data bases allow you to do is, if you don’t need to read column B, you don’t read column B, you only read column A. In a row oriented format you end up reading data one row at a time, so you end up reading column A and column B together, but in case of a columnar data base you don’t need to do that.
Let’s suppose in this particular case we’re going to read both column A and column B, right. On node 1, a single reader will start reading row group 1 and within that it will read column A and column B because that’s what it needs to do, and as we mentioned the single unit of granularity is a page so we will read a page at a time, and sometimes as I mentioned the page could be a dictionary page which is really a very efficient way of storing words that occur very frequently in a particular column for instance. The process essentially reads columns, it will read a page, it’ll decompress the page, it’ll decode the page, and it will copy the page data that has been decoded into individual values into the internal memory in memories columnar format for Drill which is called a value vector. Of course, the value vector is a very efficient way of storing columnar data in memory because it allows random access but we’re not going to go into the details of value vectors today. It’s a good thing to remember that that’s one of the operations that happens as Drill is reading the parquet file.
What can we do to make sure that Drill is able to take maximum advantage of how to read a parquet file? One of the things I already mentioned was that we try to make sure that a particular row group rests always on the same node. We want to make sure that a file system which actually doesn’t know that it’s writing a parquet file always guarantees that a row group is always on the same node. One of the things you do here is you try to make sure that the size of a row group, and we like to make row groups large, we prefer to make them as large as a gigabyte, the default in Apache Drill is about half a gig, it’s 512 Megabytes, but a Gigabyte is a fairly good size, and we try to make sure that the HDFS block size, or if you’re on MFS the MFS chunk size is the same. What that does is make sure that the file system knows if it’s going to write a block of data, that 1 gigabyte of data will always rest on the same node. Because we’re writing a row group of the same size we sort of guarantee that the row group is now written on the same node. That eliminates, as I mentioned before, the network transfers.
The next thing we try to do is we try to use a much smaller page size because a page is the unit of granularity. It allows us to process a page in parallel as you’re reading columns, so we’re going to read a page, we’re going to decompress the page, we’re going to decode the page, and we’re going to copy that data into a value vector but we want to do it in large enough chunks of units of works, but we don’t want that to be too small, and we don’t want that to be too large. One of the things we noticed, and this again is a default, is that you try to have a 1-gigabyte page size. The reason I’m mentioning this is because even though these are the defaults in Drill, so if you created a parquet file using Drill you would automatically get this sufficiency, the same is not true for parquet files created using other tools. Other tools will allow you to change these 2 parameters, the actual row group size and the page size, and so you want to take advantage of the fact that Drill is aware of these sizes and can take maximum advantage out of this.
The last thing I wanted to mention was that if you store data which is compressed you minimize the amount of data that is on disc, and reading from disc typically being a slower operation is something you want to minimize. It is a good idea to store data as compressed, and again data is compressed at a level of a page, but there are multiple options for compression techniques. The 2 most commonly used are Snappy and GZIP, and the trade-off here is that Snappy is much much faster than GZIP when it comes to decompressing data but it does not compress data as much, so it will use a larger space on disc but when it comes to reading the data it can read it much faster than GZIP. If you want to constrain the amount of disc space being used by your data, you would prefer to use GZIP so you minimize the space. If you just want to read much much faster and you’re willing to trade off some disc space you want to use Snappy. Again, you may have to try this a few times depending if particularly on your kind of data but these are the 2 most common options.
That’s what I have for you guys today. Hopefully that was helpful. If you have any comments please feel free to leave some feedback, and thank you very much for watching.