Basics of Chunk Shape and Size#
The goal of this notebook is to learn the basics about chunk shape and size. We will discuss several factors to think about when deciding on chunk shape and size for datasets being written to storage. These factors can affect the read pattern from storage and subsequently the computations.
import xarray as xr
import fsspec
Accessing the Example Dataset#
In this notebook, we will use the monthly PRISM v2 dataset as an example for understanding the effects of chunk shape and size. Let’s go ahead and read in the file using xarray. To do this, we will use fsspec to get a mapper to the Zarr file on the HyTEST OSN.
Note
The xarray loader is “lazy”, meaning it will read just enough of the data to make decisions about its shape, structure, etc. It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.
fs = fsspec.filesystem(
's3',
anon=True, # anonymous = does not require credentials
client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org/'}
)
ds = xr.open_dataset(
fs.get_mapper('s3://mdmf/gdp/PRISM_v2.zarr/'),
engine='zarr'
)
ds
<xarray.Dataset> Size: 33GB Dimensions: (lat: 621, lon: 1405, time: 1555, tbnd: 2) Coordinates: * lat (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.19 24.15 24.1 * lon (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52 * time (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01 Dimensions without coordinates: tbnd Data variables: crs int64 8B ... ppt (time, lat, lon) float64 11GB ... time_bnds (time, tbnd) datetime64[ns] 25kB ... tmn (time, lat, lon) float64 11GB ... tmx (time, lat, lon) float64 11GB ... Attributes: (12/24) Conventions: CF-1.4 Metadata_Conventions: Unidata Dataset Discovery v1.0 acknowledgment: PRISM Climate Group, Oregon State University, ... authors: PRISM Climate Group cdm_data_type: Grid creator_email: daley@nacse.org ... ... publisher_url: http://prism.oregonstate.edu/ summary: This dataset was created using the PRISM (Para... time_coverage_resolution: Monthly title: Parameter-elevation Regressions on Independent... time_coverage_start: 1895-01-01T00:00 time_coverage_end: 2024-07-01T00:00
Chunk Shape and Size#
Given what we know about this data, we can apply some storage principles to form a strategy for how best to chunk the data if we were to write it to storage (assuming it isn’t already). Broadly, we need to specify chunk shape and size.
Shape Considerations#
“Chunk shape” is the shape of a chunk, which specifies the number of elements in each dimension.
So, we will need to decide on the size of each of the dimensions of the chunks.
The preferred shape of each chunk will depend on the read pattern for future analyses.
Our goal is to chunk the data so that future reads will be performant, and that depends on whether the analyses favor one dimension or another.
For some datasets, this will be very apparent.
For example, streamflow gage data is very likely to be consumed along the time
dimension.
So, a collection of data from multiple gages is more likely to have the individual time series analyzed as opposed to analyzing all gages at a given time.
Therefore, we would want a chunk shape that is larger along the time dimension.
For datasets where there is no clear preference, we can try to chunk based on likely read patterns, but allow for other patterns without too much of a performance penalty.
Let’s see how we might do this for our example dataset. Being this dataset spans space and time, it will likely be used in one of two dominant read patterns:
Time series for a given location (or small spatial extent)
Special case: Is it likely that the time series will be subset by a logical unit (e.g., will this monthly data be consumed in blocks of 12 (i.e., yearly))?
Full spatial extent for a given point in time.
Special case: Are specific spatial regions more used than others?
Let’s look at a couple of options for space and time chunking:
Time Dimension#
As we can see above, the example dataset has 1555 monthly time steps. How many chunks would we have if we chunked in groups of twelve (i.e., a year at a time)?
print(f"Number of chunks: {len(ds.time) / 12:0.2f}")
Number of chunks: 129.58
In this case, a user could get a single year of monthly data as a single chunk.
It is important to note that we have just over a round number of chunks. Having 129.58
time chunks means we will have 130 chunks in practice, but the last one is not full-sized. The last chunk would be a “partial chunk” because we do not have a full year of data for 2024.
So this is where the judgement call gets made: Which is the more likely read pattern for time: year-by-year, or the whole time series (or some sequence of a few years)? For PRISM, it is more likely that someone will want more than just one year of data. A happy medium for chunk shape along the time dimension could be 6 years of data per chunk.
time_chunk_shape = 12 * 6
print(f"Number of chunks: {len(ds.time) / time_chunk_shape:0.2f}; Chunk of shape: {time_chunk_shape}")
Number of chunks: 21.60; Chunk of shape: 72
This pattern means only 22 chunks (instead of the 126 chunks we were considering a moment ago) are needed for a full time series in a given location.
Spatial Dimension#
As we can see in our example dataset, it technically contains two spatial dimensions: lat
and lon
.
So, we’re really chunking both of these dimensions when we talk about chunking with respect to space.
While we will consider them both together here, it is important to point out that they can have separate chunk shapes.
This leads to the the question of whether future users of this data will want strips of latitude or longitude, square “tiles” in space, or some proportionally-sized tiles of latitude and longitude?
That is, is it important that the North-South extent be broken into the same number of chunks as the East-West extent?
Let’s start by chunking this into square tiles.
Being that there are more lon
elements than lat
elements, this means there will be more lon
chunks than lat
chunks.
nlon = len(ds.lon)
nlat = len(ds.lat)
space_chunk_size = nlat // 4 # split the smaller of the two dimensions into 4 chunks
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.3f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.3f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")
Number of 'lon' chunks: 9.065; Chunk of shape 155; Size of last chunk: 10
Number of 'lat' chunks: 4.006; Chunk of shape 155; Size of last chunk: 1
Having 9.06
longitude chunks means we will have 10 chunks in practice, but that last one is not full-sized.
In this case, this means that the last chunk in the given dimension will be extremely thin.
In the case of the latitude chunks, the extra 0.006
of a chunk means that the last, fractional chunk (or “partial chunk”) is only one lat
observation.
Tip
Ideally, we would want partial chunks to be at least half the size of the standard chunk. The bigger that “remainder” fraction, the better.
Let’s adjust the chunk shape a little so that we don’t have that sliver. We’re still committed to square tiles, so let’s try a larger chunk shape to change the size of that last fraction. Increasing the chunk size a little should get us bigger “remainders”.
space_chunk_size = 157
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")
Number of 'lon' chunks: 8.95; Chunk of shape 157; Size of last chunk: 149
Number of 'lat' chunks: 3.96; Chunk of shape 157; Size of last chunk: 150
With this pattern, the “remainder” latitude chunk will have a shape of 150 in the lat
dimension, and the “remainder” longitude chunk will have a shape of 149 in the lon
dimension.
All others will be a square 157 observations in both dimensions.
This amounts to a 9x4 chunk grid, with the last chunk in each dimension being partial.
The entire spatial extent for a single time step can be read in 36 chunks with this chunk shape. That seems a little high, given that this dataset will likely be taken at full spatial extent for a typical analysis. Let’s go a little bigger to see what that gets us:
space_chunk_size = 354 # 157 * 2
print(f"Number of 'lon' chunks: {nlon / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlon % space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / space_chunk_size:0.2f}; Chunk of shape {space_chunk_size}; Size of last chunk: {nlat % space_chunk_size}")
Number of 'lon' chunks: 3.97; Chunk of shape 354; Size of last chunk: 343
Number of 'lat' chunks: 1.75; Chunk of shape 354; Size of last chunk: 267
This is just as good in terms of full-chunk remainders, and the whole extent can be read in with only 8 chunks. The smallest remainder is still >75% of a full-sized square tile, which is acceptable.
Alternatively, we could stop being committed to square tiles and try and split the spatial regions more evenly. For example, we could get as close to a 4x2 split as possible:
# Add one to do a ceil divide
lon_space_chunk_size = nlon // 4 + 1
lat_space_chunk_size = nlat // 2 + 1
print(f"Number of 'lon' chunks: {nlon / lon_space_chunk_size:0.3f}; Chunk of shape {lon_space_chunk_size}; Size of last chunk: {nlon % lon_space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / lat_space_chunk_size:0.3f}; Chunk of shape {lat_space_chunk_size}; Size of last chunk: {nlat % lat_space_chunk_size}")
Number of 'lon' chunks: 3.991; Chunk of shape 352; Size of last chunk: 349
Number of 'lat' chunks: 1.997; Chunk of shape 311; Size of last chunk: 310
Or we could aim for a 3x3 split:
# Add one to do a ceil divide
lon_space_chunk_size = nlon // 3 + 1
lat_space_chunk_size = nlat // 3 + 1
print(f"Number of 'lon' chunks: {nlon / lon_space_chunk_size:0.3f}; Chunk of shape {lon_space_chunk_size}; Size of last chunk: {nlon % lon_space_chunk_size}")
print(f"Number of 'lat' chunks: {nlat / lat_space_chunk_size:0.3f}; Chunk of shape {lat_space_chunk_size}; Size of last chunk: {nlat % lat_space_chunk_size}")
Number of 'lon' chunks: 2.996; Chunk of shape 469; Size of last chunk: 467
Number of 'lat' chunks: 2.986; Chunk of shape 208; Size of last chunk: 205
As you might be getting, the chunking proportion between latitude and longitude is not super important. What is important for basic chunk shape is the total number of chunks between the two and the minimization of the remainder in the final chunk of each dimension.
Note
If we were really confident that most analyses wanted the full extent, we might be better off to just put the whole lat/lon dimensions into single chunks each. This would ensure (and require) that we read the entire spatial extent. However, our poor time-series analysis would then be stuck reading the entire dataset to get all time values for a single location.
Size Considerations#
Shape is only part of the equation. Total “chunk size” also matters. Size considerations come into play mostly as a consideration of how the chunks are stored on disk. The retrieval time is influenced by the size of each chunk. Here are some constraints:
Files Too Big: In a Zarr dataset, each chunk is stored as a separate binary file (and the entire zarr dataset is a directory grouping these many “chunk” files). If we need data from a particular chunk, no matter how little or how much, that file gets opened, decompressed, and the whole thing read into memory. A large chunk size means that there may be a lot of data transferred in situations when only a small subset of that chunk’s data is actually needed. It also means there might not be enough chunks to allow the dask workers to stay busy loading data in parallel.
Files Too Small: If the chunk size is too small, the time it takes to read and decompress the data for each chunk can become comparable to the latency of S3 (typically 10-100ms). We want the reads to take at least a second or so, so that the latency is not a significant part of the overall timing.
Tip
As a general rule, aim for chunk sizes between 10 and 200 MB, depending on shape and expected read pattern of a user.
Total Chunk Size#
To esimate the total chunk size, all we need is the expected chunk shape and data type to know how many bytes a value takes up.
As an example, let’s use a chunk shape of {'time': 72, 'lat': 354, 'lon': 354}
This will tell us if we’ve hit our target of between 10 and 200 MB per chunk.
chunks = {'time': 72, 'lat': 354, 'lon': 354}
bytes_per_value = ds.tmn.dtype.itemsize
total_bytes = chunks['time'] * chunks['lat'] * chunks['lon'] * bytes_per_value
kiB = total_bytes / (2 ** 10)
MiB = kiB / (2 ** 10)
print(f"TMN chunk size: {total_bytes} ({kiB=:.2f}) ({MiB=:.2f})")
TMN chunk size: 72182016 (kiB=70490.25) (MiB=68.84)
We’re looking really good for size: about 69 MiB.
This maybe even a bit low.
But we’re in the (admittedly broad) range of 10-200 MiB of uncompressed data (i.e., in-memory) per chunk.
Therefore, this seems like it would be a reasonable chunk shape and size for our dataset.
If we were curious about other chunk shapes, like a non-square lat
and lon
chunk, we could repeat this computation to estimate its size and determine if it is reasonable.
However, we aren’t going to do that here, but it is something you could try on your own if you are curious.
Review and Final Considerations#
Now that you have a general idea on how to pick chunk shape and size, let’s review and add a few final considerations.
Basic Chunking Recommendations#
When determining the basic chunk shape and size, the choice will depend on the future read pattern and analysis. If this pattern is unknown, then it is important to take a balanced chunking approach that does not favor one dimension over the others (i.e., larger overall shape in a given dimension). Next, choosing a chunk shape should try to prevent partial chunks if possible. Otherwise, partial chunks should be at least half the size of the standard chunk. Finally, the total chunk size should be between 10 and 200 MiB for optimal performance.
Final Considerations#
One final thing to consider is that these basic recommendations assume that your chunked data will be static and not updated. However, some datasets, especially climate related ones, are periodically updated in their time dimension. These datasets are commonly updated at regular intervals (e.g., every year with the previous years data). This can change the choice of chunk shape such that adding the next year’s worth of data does not require rechunking the whole data set or result in small partial chunks. For our PRISM example, if we chose a temporal chunk shape of length 72 (i.e., six years per chunk), adding a year worth of data would require appending the partial chunk until it becomes full. Then, further new data would require starting a new partial chunk. This could be prevented if we chose a chunk size of 12 (i.e., one year per chunk). Then, additional data would only require making new chunks versus editing existing chunks. Therefore, considering updates to the dataset when deciding the chunking plan can save a lot of time when appending the dataset in the future.
Additionally, all of the information provided here does not discuss proper optimization of chunk shape and size. Proper optimization would attempt to select chunk sizes that are near powers of two (i.e., \(2^N\)) to facilitate optimal storage and disk retrieval. Details on this topic can be found in the advanced topic notebook of Choosing an Optimal Chunk Size and Shape.