Choosing an Optimal Chunk Size and Shape#
The objective of this notebook is to learn and discuss how to select the optimal chunk size and shape for a given dataset. In the Basics of Chunk Shape and Size notebook, we discussed the general considerations for chunking, but not how to apply these consisderations for selecting chunk shape and size. These considerations included:
chunk size between 10-200 MiB,
having partial final chunks at least half the size of the standard chunk,
performant future reads across all dimensions (or optimized for the dominant read pattern), and
chunk shape that is optimized for future data additions (if they occur).
Here, we build upon those considerations and present a basic algorithm for automatically selecting the “optimal” chunk shape and size. Therefore, giving us a method for selecting chunk shape and size without much trial and error.
import fsspec
import xarray as xr
import numpy as np
import itertools
The Many Considerations for Optimal Chunk Shape and Size Selection#
As we just listed, there are several things to consider when selecting a chunk size. Creating an “optimal” chunk shape and size requires balancing all of these things to make our dataset perform as we want. For example, the “perfect” chunking would have:
chunk size around 10 MiB (to accomodate different hardware),
no partial final chunks (i.e., chunk shape is integer divisor of the full data shape),
performant future reads across all dimensions and groups of dimensions, and
a shape that is an integer divisor of future data additions (if they occur).
In addition to these, if we can select a chunk size that is optimal for our disk storage (e.g., disk block size), we should further improve read times. However, in practice, there is no way to get the “perfect” chunk. We will almost always have to compromise on one or more of these criteria to stay within the constraints created by another. The criteria we compromise on is up to us, which makes determining the “optimal” chunk relatively subjective. Therefore, chunk shape and size selection is just as much an art as a science. It depends some firm rules, but it also depends on our preferences.
Optimal Chunk Shape Algorithm#
As chunk shape and size selection is relatively subjective, let’s create an algorithm using some of our preferences and restrictions to automate the chunk selection process. To take into account the first two considerations (size and partial chunks), this algorithm will focus on determining a chunk shape from the dataset shape given a maximum (uncompressed) chunk size and minimum partial chunk fraction. It will return a chunk shape that has a size as close the maximum size as possible, while ensuring any final chunks contain partial chunk fractions above the specified minimum. For reads across dimensions, we will limit our algorithm to 3D data only and try and balance 1D-to-2D read times. For example, if we have a 3D spatiotemporal dataset, it would propose an “optimal” shape that allows for almost equal read times for time-series and spatial reads. Finally, we will assume the algorithm will only be applied to static datasets and that we do not need to worry about any future data additions.
So, let’s take a look at the algorithm, which we will write as a function.
def chunk_shape_3D(var_shape, chunk_size=4096, var_value_size=4, partial_chunk_frac=0.5, verbose=False):
"""
Return a "optimal" shape for a 3D variable, assuming balanced 1D-to-2D access.
Parameters
----------
var_shape : tuple[int]
Length 3 list giving the variable shape in terms of (T, X, Y).
chunk_size : int
Maximum chunk size desired, in bytes (default = 4096).
var_value_size : int
Size of each variable data value, in bytes (default = 4).
partial_chunk_frac : float
The minimum fraction of data that the partial final chunk must
contain using the returned chunk shape (default = 0.5).
verbose : bool
If True, info on other candidate chunk shapes will be printed
(default = False).
Returns
-------
chunk_shape : tuple[int]
The optimal chunk shape that provides balanced access of 1D subsets
and 2D subsets of a variable with shape (T, X, Y), T is the typically
the time dimension and X and Y are the spatial dimensions. An "optimal"
shape for chunks means that the number of chunks accessed to read
either a full time series (1D) or full spatial map (2D) is
approximately equal, and the size of each chunk (uncompressed) is no
more than `chunk_size`, which is often a disk block size.
"""
rank = len(var_shape)
# Get ideal chunk info using only chunk size and balanced read
ideal_num_vals_per_chunk = chunk_size / float(var_value_size)
ideal_num_chunks = np.prod(var_shape) / ideal_num_vals_per_chunk
if ideal_num_chunks < 1:
ideal_num_chunks = 1
ideal_1d_num_chunks = np.sqrt(ideal_num_chunks)
ideal_2d_num_chunks = np.sqrt(ideal_1d_num_chunks)
if verbose:
print(f'Ideal number of values in a chunk = {ideal_num_vals_per_chunk:0.1f}')
print(f'Ideal number of chunks = {ideal_num_chunks:0.1f}')
print(f'Ideal number of chunks along the 1D axis = {ideal_1d_num_chunks:0.1f}')
print(f'Ideal number of chunks along each 2D axis = {ideal_2d_num_chunks:0.1f}')
# Get 1D optimal chunk shape along dimension
# Check if the first dimension has smaller shape then number of chunks
# If so, set to chunk shape to 1 and adjust 2D ideal chunks
if var_shape[0] < ideal_1d_num_chunks:
chunk_dim = 1.0
ideal_2d_num_chunks = (ideal_2d_num_chunks
/ np.sqrt(var_shape[0] / ideal_1d_num_chunks))
else:
chunk_dim = var_shape[0] // ideal_1d_num_chunks
# Add chunk dim to optimal chunk shape list
optimal_chunk_shape = [chunk_dim]
# Get 2D optimal chunk shape along each dimension
prod = 1.0 # factor to increase other dims if some must be increased to 1.0
for i in range(1, rank):
if var_shape[i] < ideal_2d_num_chunks:
prod *= ideal_2d_num_chunks / var_shape[i]
for i in range(1, rank):
if var_shape[i] < ideal_2d_num_chunks:
chunk_dim = 1.0
else:
chunk_dim = (prod * var_shape[i]) // ideal_2d_num_chunks
optimal_chunk_shape.append(chunk_dim)
# Calculate the partial chunk fraction from the remainder
remainder_frac_per_dim = np.remainder(var_shape, optimal_chunk_shape) / optimal_chunk_shape
# If the remainder fraction is 0, swap with 1 for multiplication of parial chunk fraction
optimal_chunk_frac = np.where(remainder_frac_per_dim == 0, 1, remainder_frac_per_dim).prod()
if verbose:
print(f'Ideal chunk shape = {tuple(map(int, optimal_chunk_shape))}')
print(f'Ideal chunk, partial chunk fraction = {optimal_chunk_frac}')
"""
Optimal_chunk_shape is typically too small, size(optimal_chunk_shape) < chunk_size
and may have small partial chunks. So, we adjust by adding 1 to some chunk shape
dimensions to get as close as possible to chunk_size without exceeding it and then
check if the shape is over the minimum partial chunk fraction. If it is acceptable,
that is our chunk shape. Otherwise, we continually subtract one from the 1D dimension
to get to the partial fraction minimum, while adding one to the 2D dimensions to
maintain the size request. We then reverse this and subtract from the 2D demensions
and add to the 1D demensions. The optimal chunk is then the one that is the most
balanced of these two increment and decrement methods.
"""
# Increment the optimal chunk shape by 1
best_chunk_size = 0
best_chunk_shape = []
if verbose:
print('\n--- Candidates ---')
for dim_increment in itertools.product([0, 1], repeat=3):
candidate_chunk_shape = np.add(optimal_chunk_shape, dim_increment)
this_chunk_size = int(var_value_size * np.prod(candidate_chunk_shape))
remainder = np.remainder(var_shape, candidate_chunk_shape) / candidate_chunk_shape
this_chunk_frac = np.where(remainder == 0, 1, remainder).prod()
if verbose:
if (this_chunk_size <= chunk_size) and (this_chunk_frac >= partial_chunk_frac):
print(f'{tuple(map(int, candidate_chunk_shape))}; '
f'Total size per chunk (MB): {this_chunk_size/2**20:0.3f} '
f'(ratio: {np.prod(candidate_chunk_shape) / ideal_num_vals_per_chunk:0.3f}); '
f'Partial chunk fraction: {this_chunk_frac}')
# Only keep if closest to chunk size limit and above partial fraction limit
if (best_chunk_size < this_chunk_size <= chunk_size) and (this_chunk_frac >= partial_chunk_frac):
best_chunk_size = this_chunk_size
best_chunk_shape = list(candidate_chunk_shape) # make a copy of best candidate so far
# Return if a shape was found
if best_chunk_shape:
return list(map(int, best_chunk_shape))
# Increment and decrement 1D and 2D from optimal chunk shape to get a best shape
increments_decrements = [[[-1, 0, 0], [0, 1, 0], [0, 0, 1]],
[[1, 0, 0], [0, -1, 0], [0, 0, -1]]]
# Use Euclidean distance to estimate balanced shape
best_shape_balance = np.linalg.norm(np.array(optimal_chunk_shape) - np.array([0, 0, 0]))
for increment_decrement in increments_decrements:
best_chunk_frac = optimal_chunk_frac
candidate_chunk_shape = list(optimal_chunk_shape)
while best_chunk_frac < partial_chunk_frac:
# Quit if any candidate is too big or too small in a dimension
if ((np.array(candidate_chunk_shape) < 1).any()
or (candidate_chunk_shape > np.array(var_shape)).any()):
break
for dim_increment in increment_decrement:
candidate_chunk_shape = np.add(candidate_chunk_shape, dim_increment)
this_chunk_size = int(var_value_size * np.prod(candidate_chunk_shape))
remainder = np.remainder(var_shape, candidate_chunk_shape) / candidate_chunk_shape
this_chunk_frac = np.where(remainder == 0, 1, remainder).prod()
if (this_chunk_size <= chunk_size) and (this_chunk_frac >= partial_chunk_frac):
if verbose:
print(f'{tuple(map(int, candidate_chunk_shape))}; '
f'Total size per chunk (MB): {this_chunk_size/2**20:0.3f} '
f'(ratio: {np.prod(candidate_chunk_shape) / ideal_num_vals_per_chunk:0.3f}); '
f'Partial chunk fraction: {this_chunk_frac}')
best_chunk_frac = this_chunk_frac
shape_balance = np.linalg.norm(np.array(optimal_chunk_shape) - candidate_chunk_shape)
# Only save candidate if it is more balanced than previous one
if shape_balance < best_shape_balance:
best_shape_balance = shape_balance
best_chunk_shape = list(candidate_chunk_shape)
return tuple(map(int, best_chunk_shape))
Examples#
Toy Example#
Okay, now that we have our algorithm, let’s give it a test using some made up data.
We will use a variable of shape (365, 240, 150)
, aim for a chunk of 1 MiB (2**20
bytes) in size, and a partial chunk fraction of 0.5
(0.8 in each dimensions; \(0.8^3 \approx 0.5\)).
var_shape = (365, 240, 150)
chunk_shape = chunk_shape_3D(
var_shape, chunk_size=2**20, partial_chunk_frac=0.5, verbose=True
)
print(f'\n"Optimal" chunk shape: {chunk_shape}')
Ideal number of values in a chunk = 262144.0
Ideal number of chunks = 50.1
Ideal number of chunks along the 1D axis = 7.1
Ideal number of chunks along each 2D axis = 2.7
Ideal chunk shape = (51, 90, 56)
Ideal chunk, partial chunk fraction = 0.07096171802054155
--- Candidates ---
(16, 124, 90); Total size per chunk (MB): 0.681 (ratio: 0.681); Partial chunk fraction: 0.5067204301075268
(53, 88, 54); Total size per chunk (MB): 0.961 (ratio: 0.961); Partial chunk fraction: 0.5016199733180866
"Optimal" chunk shape: (53, 88, 54)
Nice!
From these results, we can see that the “ideal” chunk shape given only our chunk size would have been (51, 90, 56)
.
However, this had some partial chunks that would have been very small.
So, its choice would not have been good given this additional constraint.
Therefore, it adjusted the chunk shape to (53, 88, 54)
to meet the partial chunk constraint and provide us with a reasonably close alternative in terms of both dimension balance and chunk size.
This is great, exactly what we wanted!
PRISM Example#
As a real world example, let’s apply the algorithm to the PRISM data from the Basics of Chunk Shape and Size notebook. First, we need to read in the data.
fs = fsspec.filesystem(
's3',
anon=True, # anonymous = does not require credentials
client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org/'}
)
ds = xr.open_dataset(
fs.get_mapper('s3://mdmf/gdp/PRISM_v2.zarr/'),
engine='zarr',
chunks={}
)
ds
<xarray.Dataset> Size: 33GB Dimensions: (lat: 621, lon: 1405, time: 1555, tbnd: 2) Coordinates: * lat (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.19 24.15 24.1 * lon (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52 * time (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01 Dimensions without coordinates: tbnd Data variables: crs int64 8B ... ppt (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray> time_bnds (time, tbnd) datetime64[ns] 25kB dask.array<chunksize=(68, 2), meta=np.ndarray> tmn (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray> tmx (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray> Attributes: (12/24) Conventions: CF-1.4 Metadata_Conventions: Unidata Dataset Discovery v1.0 acknowledgment: PRISM Climate Group, Oregon State University, ... authors: PRISM Climate Group cdm_data_type: Grid creator_email: daley@nacse.org ... ... publisher_url: http://prism.oregonstate.edu/ summary: This dataset was created using the PRISM (Para... time_coverage_resolution: Monthly title: Parameter-elevation Regressions on Independent... time_coverage_start: 1895-01-01T00:00 time_coverage_end: 2024-07-01T00:00
Now, we can estimate the “optimal” chunk shape using our algorithm. We will only do this for the precipitaiton data variable as the others all have the same shape and data type. If you had additional variables with different data types, you would want to run this algorithm on a variable of each type to find its optimal size. Also, we will use a maximum chunk size that matches the same chunk size of the dataset currently to see how our algorithm compares to the chosen chunking.
current_chunk_size = np.prod([chunksize[0] for chunksize in ds['ppt'].chunksizes.values()])
current_chunk_size *= ds['ppt'].dtype.itemsize
chunk_shape = chunk_shape_3D(
ds['ppt'].shape,
chunk_size=current_chunk_size,
var_value_size=ds['ppt'].dtype.itemsize,
verbose=True
)
print(f'\n"Optimal" chunk shape: {chunk_shape}')
Ideal number of values in a chunk = 2618952.0
Ideal number of chunks = 518.0
Ideal number of chunks along the 1D axis = 22.8
Ideal number of chunks along each 2D axis = 4.8
Ideal chunk shape = (68, 130, 294)
Ideal chunk, partial chunk fraction = 0.5250604087788961
--- Candidates ---
(68, 130, 294); Total size per chunk (MB): 19.828 (ratio: 0.992); Partial chunk fraction: 0.5250604087788961
(68, 130, 295); Total size per chunk (MB): 19.896 (ratio: 0.996); Partial chunk fraction: 0.5141402714932127
(68, 131, 294); Total size per chunk (MB): 19.981 (ratio: 1.000); Partial chunk fraction: 0.5004165788452786
"Optimal" chunk shape: [68, 131, 294]
As we can see, our algorithm struggles with this dataset to come up with a balanced chunk shape that meets our size and partial fraction restrictions.
Rather than having the “ideal” chunk shape of (125, 178, 404)
, we got a chunk shape of (168, 135, 362)
, which is quite different than the current chunk shape of (72, 354, 354)
.
The primary driver of this discrepancy is our restriction on balancing the dimensions, as the current chunk shape has a parital fraction of 0.73.
Therefore, enforcing the balance can be very restrictive.
Further Considerations#
As noted in the PRISM example, this algorithm is a very niche case for chunk shape selection. It assumes you want even read times for the temporal and spatial dimensions, and it only allows for 3D data. What if we had more dimensions or fewer? What if we wanted a different read pattern that was unbalanced? What if we wanted to not have any partial chunks and only use a chunk shape that is a divisor of the variable shape? Therefore, this algorithm is not general by any means, but does give us an idea on how to formulate more general algorithms in the future. For example, one of these more general algorithms can be found in this repo on dynamic rechunking, which has some algorithms that allow for the user to specify the balancing of the dimensions and a chunk size limit. As algorithms like these become more developed, selecting an optimal chunk size should become easier. However, the subjective component will never go away and the need for someone to make a decision on what critera that influences chunk shape and size is more important will always be required.