Why (re)Chunk Data?#
If you are completely new to chunking, then you are probably interested in learning “what is data chunking?” and “why should I care?”. The goal of this notebook is to answer these two basic questions and give you the understanding of what it means for data to be chunked and why you would want to do it.
What is chunking?#
Since modern computers were invented, there have existed datasets that were too large to fully read into computer memory. These datasets have come to be known as “larger-than-memory” datasets. While these datasets may be larger than memory, we will still want to access them and perform analysis on the data. This is where chunking comes in. “Chunking” is the process of breaking down large amounts of data into smaller, more manageable pieces. By breaking the data down into “chunks”, it allows for us to work with the chunks of the larger overall dataset using a structured approach without exceeding our machine’s available memory. Additionally, proper chunking can allow for faster retrieval and analysis when we only need to work with part of the dataset.
Note
Chunks are not another dimension to your data, but merely a map to how the dataset is partitioned into more palatable sized units for manipulation in memory.
Why should I care?#
The simple reason you should care is that you are working with a dataset that is larger-than-memory. This dataset has to be divided in some way so that only those parts of the data being actively worked on are loaded into memory at a given time; otherwise, your machine would crash. This has benefits when it comes to parallel algorithms - if work can be performed on independent chunks, it is easy to set up your algorithm such that separate parallel workers each work on a chunk of the data simultaneously. Therefore, proper chunking can allow for faster retrieval and analysis of the dataset. Even datasets that are small enough to fit into memory can still technically be chunked, and proper chunking of these datasets can potentially speed up retrieval and analysis. To help you understand this, let’s begin with a simple example.
Example - First Principles#
In this example, we will illustrate two common memory organization strategies (analagous to chunking) that computers use when handling basic multidimensional data. To simplify this, let’s consider a small 10x10 array of integer values.
While this is easy for us humans to visualize, computer memory is not addressed in grids. Instead, it is organized as a linear address space. So, the 2D matrix has to be organized in memory such that it presents as 2D, while being stored as 1D. Two common options are row-major order, and column-major order:
Row-Major: A row of data occupies a contiguous block of memory. This implies that cells which are logically adjacent vertically are not physically near one another in memory. The “distance” from
r0c0
tor0c1
(a one-cell logical move within the row) is short, while the “distance” tor1c0
(a one-cell logical move within the column) is long.
Column-Major: A column of the array occupies a contiguous block of memory. This implies that cells which are adjacent horizontally are not near one another physically in memory.
In either mapping, r3c5
(for example) still fetches the same value.
For a single value, this is not a problem.
The array is still indexed/addressed in the same way as far as the user is concerned, but the memory organization strategy determines how nearby an “adjacent” index is.
This becomes important when trying to get a subsection of the data.
For example, if the array is in row-major order and we select say r0
, this is fast for the computer as all the data is adjacent.
However, if we wanted c0
, then the computer has to access every 10th value in memory, which as you can imagine is not as efficient.
Extend to Chunking#
The basic idea behind chunking is an extension of this memory organization principle. As the size of the array increases, the chunk shape becomes more relevant. Now suppose the square array is now larger-than-memory and stored on disk such that only a single row or column can fit into memory at a time. If your data is chunked by row, and you need to process the \(i^{th}\) column, you will have to read one row at a time into memory, skip to the \(i^{th}\) column value in each row, and extract that value. For this type of analysis, you can see why this would be slow due to the massive amount of I/O and it would be better if the array could instead be chunked in column-major order. Just to make this clear, if your data was now chunked by columns, all you would have to do is read the \(i^{th}\) column into memory, and you would be good to go. Meaning you would just need a single read from disk versus reading however many rows your data has. While handling chunks may seem like it would become complicated, array-handling libraries (numpy, xarray, pandas, dask, and others) will handle all of the record-keeping to know which chunk holds what data within the dataset.
Toy Example#
By now, we have hopefully answered both of the questions about “what is data chunking?” and “why should I care?”. To really drive home the idea, let’s apply the above theoretical example using dask. In this case, we will generate a square array of ones to test how different “chunk shapes” compare.
import dask.array as da
Chunk by Rows#
First, let’s start with the square array chunked by rows. We’ll do a 50625x50625 array as this is about 19 GiB, which is larger than the typical memory availability of a laptop. The nice thing about dask is that we can see how big our array and chunks are in the output.
vals = da.ones(shape=(50625, 50625), chunks=(1, 50625))
vals
|
Now, let’s see how long on average it takes to get the first column.
Note
We use the .compute()
method on our slice to ensure its extraction is not lazily performed.
%%timeit
vals[:, 0].compute()
18.9 s ± 988 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Chunk by Columns#
Switching the array to be chunked by columns.
vals = da.ones(shape=(50625, 50625), chunks=(50625, 1))
vals
|
Time to see how much faster this is.
%%timeit
vals[:, 0].compute()
1.7 ms ± 7.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
As expected, the time difference is massive when properly chunked.
Balanced Chunks#
As a final example, let’s check a square chunk shape that keeps about the same number of elements as the pure row and column chunking.
vals = da.ones(shape=(50625, 50625), chunks=(225, 225))
vals
|
%%timeit
vals[:, 0].compute()
75.9 ms ± 5.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As we can see, this is only slightly slower when accessing the first column compared to the column chunking. However, let’s time how long it takes to access a single row.
%%timeit
vals[0, :].compute()
75.5 ms ± 6.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
As expected, it is about the same as accessing a single column. However, that means it is drastically faster than the column chunking when accessing rows. Therefore, a chunk shape that balances the dimensions is more generally applicable when both dimensions are needed for analysis.
Pros & Cons to Chunking#
As a wrap up, let’s review some of the pros and cons to chunking. Some we have clearly discussed while others may be more subtle. The primary pro, as we hopefully conveyed with our previous example, is that well chunked data substantially speeds up any analysis that favors that chunk shape. However, this becomes a con when you change your analysis to one that favors a new chunk shape. In other words, data that is well-organized to optimize one kind of analysis may not suit another kind of analysis on the same data. While not a problem for our example here, changing the chunk shape (known as “rechunking”) on an established dataset is time-consuming, and it produces a separate copy of the dataset, increasing storage requirements. The space commitment can be substantial if a complex dataset needs to be organized for many different analyses. If our example above used unique values that we wanted to keep as we changed chunking, this would have meant that rather than having a single ~19 GiB dataset, we would have needed to keep all three, tripling our storage to almost 60 GiB. Therefore, selecting an appropriate chunk shape is critical when generating widely used datasets.