Basics of Chunk Shape and Size

Basics of Chunk Shape and Size#

The goal of this notebook is to learn the basics about chunk shape and size. We will discuss several factors to think about when deciding on chunk shape and size for datasets being written to storage. These factors can affect the read pattern from storage and subsequently the computations.

import xarray as xr
import fsspec

Chunk Shape and Size#

Given what we know about this data, we can apply some storage principles to form a strategy for how best to chunk the data if we were to write it to storage (assuming it isn’t already). Broadly, we need to specify chunk shape and size.

Shape Considerations#

“Chunk shape” is the shape of a chunk, which specifies the number of elements in each dimension. So, we will need to decide on the size of each of the dimensions of the chunks. The preferred shape of each chunk will depend on the read pattern for future analyses. Our goal is to chunk the data so that future reads will be performant, and that depends on whether the analyses favor one dimension or another. For some datasets, this will be very apparent. For example, streamflow gage data is very likely to be consumed along the time dimension. So, a collection of data from multiple gages is more likely to have the individual time series analyzed as opposed to analyzing all gages at a given time. Therefore, we would want a chunk shape that is larger along the time dimension. For datasets where there is no clear preference, we can try to chunk based on likely read patterns, but allow for other patterns without too much of a performance penalty.

Let’s see how we might do this for our example dataset. Being this dataset spans space and time, it will likely be used in one of two dominant read patterns:

Time series for a given location (or small spatial extent)
- Special case: Is it likely that the time series will be subset by a logical unit (e.g., will this monthly data be consumed in blocks of 12 (i.e., yearly))?
Full spatial extent for a given point in time.
- Special case: Are specific spatial regions more used than others?

Let’s look at a couple of options for space and time chunking:

Time Dimension#

As we can see above, the example dataset has 1555 monthly time steps. How many chunks would we have if we chunked in groups of twelve (i.e., a year at a time)?

print(f"Number of chunks: {len(ds.time) / 12:0.2f}")

Number of chunks: 129.58

In this case, a user could get a single year of monthly data as a single chunk. It is important to note that we have just over a round number of chunks. Having 129.58 time chunks means we will have 130 chunks in practice, but the last one is not full-sized. The last chunk would be a “partial chunk” because we do not have a full year of data for 2024.

So this is where the judgement call gets made: Which is the more likely read pattern for time: year-by-year, or the whole time series (or some sequence of a few years)? For PRISM, it is more likely that someone will want more than just one year of data. A happy medium for chunk shape along the time dimension could be 6 years of data per chunk.

time_chunk_shape = 12 * 6
print(f"Number of chunks: {len(ds.time) / time_chunk_shape:0.2f}; Chunk of shape: {time_chunk_shape}")

Number of chunks: 21.60; Chunk of shape: 72

This pattern means only 22 chunks (instead of the 126 chunks we were considering a moment ago) are needed for a full time series in a given location.

Spatial Dimension#

As we can see in our example dataset, it technically contains two spatial dimensions: lat and lon. So, we’re really chunking both of these dimensions when we talk about chunking with respect to space. While we will consider them both together here, it is important to point out that they can have separate chunk shapes. This leads to the the question of whether future users of this data will want strips of latitude or longitude, square “tiles” in space, or some proportionally-sized tiles of latitude and longitude? That is, is it important that the North-South extent be broken into the same number of chunks as the East-West extent? Let’s start by chunking this into square tiles. Being that there are more lon elements than lat elements, this means there will be more lon chunks than lat chunks.

nlon = len(ds.lon)
nlat = len(ds.lat)
space_chunk_size = nlat // 4 # split the smaller of the two dimensions into 4 chunks
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.3f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.3f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")

Number of 'lon' chunks: 9.065; Chunk of shape 155; Size of last chunk: 10
Number of 'lat' chunks: 4.006; Chunk of shape 155; Size of last chunk: 1

Having 9.06 longitude chunks means we will have 10 chunks in practice, but that last one is not full-sized. In this case, this means that the last chunk in the given dimension will be extremely thin.

In the case of the latitude chunks, the extra 0.006 of a chunk means that the last, fractional chunk (or “partial chunk”) is only one lat observation.

Tip

Ideally, we would want partial chunks to be at least half the size of the standard chunk. The bigger that “remainder” fraction, the better.

Let’s adjust the chunk shape a little so that we don’t have that sliver. We’re still committed to square tiles, so let’s try a larger chunk shape to change the size of that last fraction. Increasing the chunk size a little should get us bigger “remainders”.

space_chunk_size = 157
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")

Number of 'lon' chunks: 8.95; Chunk of shape 157; Size of last chunk: 149
Number of 'lat' chunks: 3.96; Chunk of shape 157; Size of last chunk: 150

With this pattern, the “remainder” latitude chunk will have a shape of 150 in the lat dimension, and the “remainder” longitude chunk will have a shape of 149 in the lon dimension. All others will be a square 157 observations in both dimensions. This amounts to a 9x4 chunk grid, with the last chunk in each dimension being partial.

The entire spatial extent for a single time step can be read in 36 chunks with this chunk shape. That seems a little high, given that this dataset will likely be taken at full spatial extent for a typical analysis. Let’s go a little bigger to see what that gets us:

space_chunk_size = 354 # 157 * 2
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")

Number of 'lon' chunks: 3.97; Chunk of shape 354; Size of last chunk: 343
Number of 'lat' chunks: 1.75; Chunk of shape 354; Size of last chunk: 267

This is just as good in terms of full-chunk remainders, and the whole extent can be read in with only 8 chunks. The smallest remainder is still >75% of a full-sized square tile, which is acceptable.

Alternatively, we could stop being committed to square tiles and try and split the spatial regions more evenly. For example, we could get as close to a 4x2 split as possible:

# Add one to do a ceil divide
lon_space_chunk_size = nlon // 4 + 1
lat_space_chunk_size = nlat // 2 + 1
print(f"Number of 'lon' chunks: {nlon / lon_space_chunk_size:0.3f}; Chunk of shape {lon_space_chunk_size}; Size of last chunk: {nlon % lon_space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / lat_space_chunk_size:0.3f}; Chunk of shape {lat_space_chunk_size}; Size of last chunk: {nlat % lat_space_chunk_size}")

Number of 'lon' chunks: 3.991; Chunk of shape 352; Size of last chunk: 349
Number of 'lat' chunks: 1.997; Chunk of shape 311; Size of last chunk: 310

Or we could aim for a 3x3 split:

# Add one to do a ceil divide
lon_space_chunk_size = nlon // 3 + 1
lat_space_chunk_size = nlat // 3 + 1
print(f"Number of 'lon' chunks: {nlon / lon_space_chunk_size:0.3f}; Chunk of shape {lon_space_chunk_size}; Size of last chunk: {nlon % lon_space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / lat_space_chunk_size:0.3f}; Chunk of shape {lat_space_chunk_size}; Size of last chunk: {nlat % lat_space_chunk_size}")

Number of 'lon' chunks: 2.996; Chunk of shape 469; Size of last chunk: 467
Number of 'lat' chunks: 2.986; Chunk of shape 208; Size of last chunk: 205

As you might be getting, the chunking proportion between latitude and longitude is not super important. What is important for basic chunk shape is the total number of chunks between the two and the minimization of the remainder in the final chunk of each dimension.

Note

If we were really confident that most analyses wanted the full extent, we might be better off to just put the whole lat/lon dimensions into single chunks each. This would ensure (and require) that we read the entire spatial extent. However, our poor time-series analysis would then be stuck reading the entire dataset to get all time values for a single location.

Size Considerations#

Shape is only part of the equation. Total “chunk size” also matters. Size considerations come into play mostly as a consideration of how the chunks are stored on disk. The retrieval time is influenced by the size of each chunk. Here are some constraints:

Files Too Big: In a Zarr dataset, each chunk is stored as a separate binary file (and the entire zarr dataset is a directory grouping these many “chunk” files). If we need data from a particular chunk, no matter how little or how much, that file gets opened, decompressed, and the whole thing read into memory. A large chunk size means that there may be a lot of data transferred in situations when only a small subset of that chunk’s data is actually needed. It also means there might not be enough chunks to allow the dask workers to stay busy loading data in parallel.
Files Too Small: If the chunk size is too small, the time it takes to read and decompress the data for each chunk can become comparable to the latency of S3 (typically 10-100ms). We want the reads to take at least a second or so, so that the latency is not a significant part of the overall timing.

Tip

As a general rule, aim for chunk sizes between 10 and 200 MB, depending on shape and expected read pattern of a user.

Total Chunk Size#

To esimate the total chunk size, all we need is the expected chunk shape and data type to know how many bytes a value takes up. As an example, let’s use a chunk shape of {'time': 72, 'lat': 354, 'lon': 354} This will tell us if we’ve hit our target of between 10 and 200 MB per chunk.

chunks = {'time': 72, 'lat': 354, 'lon': 354}
bytes_per_value = ds.tmn.dtype.itemsize
total_bytes = chunks['time'] * chunks['lat'] * chunks['lon'] * bytes_per_value
kiB = total_bytes / (2 ** 10)
MiB = kiB / (2 ** 10)
print(f"TMN chunk size: {total_bytes} ({kiB=:.2f}) ({MiB=:.2f})")

TMN chunk size: 72182016 (kiB=70490.25) (MiB=68.84)

We’re looking really good for size: about 69 MiB. This maybe even a bit low. But we’re in the (admittedly broad) range of 10-200 MiB of uncompressed data (i.e., in-memory) per chunk. Therefore, this seems like it would be a reasonable chunk shape and size for our dataset. If we were curious about other chunk shapes, like a non-square lat and lon chunk, we could repeat this computation to estimate its size and determine if it is reasonable. However, we aren’t going to do that here, but it is something you could try on your own if you are curious.

Review and Final Considerations#

Now that you have a general idea on how to pick chunk shape and size, let’s review and add a few final considerations.

Basic Chunking Recommendations#

When determining the basic chunk shape and size, the choice will depend on the future read pattern and analysis. If this pattern is unknown, then it is important to take a balanced chunking approach that does not favor one dimension over the others (i.e., larger overall shape in a given dimension). Next, choosing a chunk shape should try to prevent partial chunks if possible. Otherwise, partial chunks should be at least half the size of the standard chunk. Finally, the total chunk size should be between 10 and 200 MiB for optimal performance.

Final Considerations#

One final thing to consider is that these basic recommendations assume that your chunked data will be static and not updated. However, some datasets, especially climate related ones, are periodically updated in their time dimension. These datasets are commonly updated at regular intervals (e.g., every year with the previous years data). This can change the choice of chunk shape such that adding the next year’s worth of data does not require rechunking the whole data set or result in small partial chunks. For our PRISM example, if we chose a temporal chunk shape of length 72 (i.e., six years per chunk), adding a year worth of data would require appending the partial chunk until it becomes full. Then, further new data would require starting a new partial chunk. This could be prevented if we chose a chunk size of 12 (i.e., one year per chunk). Then, additional data would only require making new chunks versus editing existing chunks. Therefore, considering updates to the dataset when deciding the chunking plan can save a lot of time when appending the dataset in the future.

Additionally, all of the information provided here does not discuss proper optimization of chunk shape and size. Proper optimization would attempt to select chunk sizes that are near powers of two (i.e., \(2^N\)) to facilitate optimal storage and disk retrieval. Details on this topic can be found in the advanced topic notebook of Choosing an Optimal Chunk Size and Shape.