How to Examine a Stored Dataset’s Chunk Shape

How to Examine a Stored Dataset’s Chunk Shape#

The objective of this notebook is to learn how to examine a stored dataset and understand if it is chunked and, if so, what its “chunk shape” is. To do this, we will utilize an existing dataset from the HyTEST OSN, take a guided tour of the data, and show how to figure out its chunk shape.

import xarray as xr
import fsspec

Getting the Chunking When Reading Data#

While checking the “encoding” of the variable can tell you what the dataset’s stored chunk shape is, it is typically easier to do this in one step when you open the dataset. To do this, all we need is to add a another keyword when we open the dataset with xarray: chunks={}. As per the xarray.open_dataset documentation:

chunks={} loads the data with dask using the engine’s preferred chunk size, generally identical to the format’s chunk size.

In other words, using chunks={} will load the data with chunk shape equal to 'preferred_chunks'. Let’s check this out and see how our data looks when we include this keyword when opening.

ds = xr.open_dataset(file, engine='zarr', chunks={})
ds

<xarray.Dataset> Size: 33GB
Dimensions:    (lat: 621, lon: 1405, time: 1555, tbnd: 2)
Coordinates:
  * lat        (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.19 24.15 24.1
  * lon        (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time       (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Dimensions without coordinates: tbnd
Data variables:
    crs        int64 8B ...
    ppt        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
    time_bnds  (time, tbnd) datetime64[ns] 25kB dask.array<chunksize=(68, 2), meta=np.ndarray>
    tmn        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
    tmx        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
Attributes: (12/24)
    Conventions:               CF-1.4
    Metadata_Conventions:      Unidata Dataset Discovery v1.0
    acknowledgment:            PRISM Climate Group, Oregon State University, ...
    authors:                   PRISM Climate Group
    cdm_data_type:             Grid
    creator_email:             daley@nacse.org
    ...                        ...
    publisher_url:             http://prism.oregonstate.edu/
    summary:                   This dataset was created using the PRISM (Para...
    time_coverage_resolution:  Monthly
    title:                     Parameter-elevation Regressions on Independent...
    time_coverage_start:       1895-01-01T00:00
    time_coverage_end:         2024-07-01T00:00

Now when we inspect the data variables metadata, we will see the data is now read in as a dask array. Let’s look at the tmn variable again to simplify this.

ds.tmn

<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(68, 131, 294), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

As we can see the data is chunked into chunks of shape (68, 131, 294) with a chunk size of ~20 MiB. This is exactly what we saw when looking at the encoding. So, this additional keyword worked as expected and gives us a standard way to open chunked datasets using the stored chunk shape as our chunk shape!

Note that the coordinate variables themselves (lat, lon, and time) are stored as single unchunked arrays of data. Recall that these are used to translate a coordinate value into the index of the corresponding array. Therefore, these coordinate arrays will always be needed in their entirity. So, they are included in each chunk such that they read whenever a chunk is read, and they do not affect how the data representing the data variables is chunked.

Changing the Chunk Shape and Size#

Now we can identify the stored chunk shape and size, but these settings may not always be ideal for performing analysis. For example, Zarr recommends a stored chunk size of at least 1 MB uncompressed as they give better performance. However, dask recommends chunk sizes between 10 MB and 1 GB for computations, depending on the availability of RAM and the duration of computations. Therefore, our stored chunk size may not be large enough for optimal computations. Thankfully, stored chunks do not need to be the same size as those we use for our computations. In other words, we can group multiple smaller stored chunks together when performing our computations. Xarray makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.

Let’s show how this works by increasing our chunks of the minimum monthly temperature to a size of ~500 MiB. To do so when reading in the data, all we need to do is specify the chunk shape with the chunks argument. For our example, let’s do chunks of shape: {'time': 150, 'lat': 310, 'lon': 1405}.

# Note we drop the other variables and select tmn when reading the data
ds_tmn = xr.open_dataset(file, engine='zarr',
                         chunks={'time': 150, 'lat': 310, 'lon': 1405},
                         drop_variables=['ppt', 'time_bnds', 'tmx', 'crs']).tmn
ds_tmn

/tmp/ipykernel_4639/2712997614.py:2: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 150. This could degrade performance. Instead, consider rechunking after loading.
  ds_tmn = xr.open_dataset(file, engine='zarr',
/tmp/ipykernel_4639/2712997614.py:2: UserWarning: The specified chunks separate the stored chunks along dimension "lat" starting at index 310. This could degrade performance. Instead, consider rechunking after loading.
  ds_tmn = xr.open_dataset(file, engine='zarr',

<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(150, 310, 1405), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Nice! As we can see, the chunk shape is now displayed in the DataArray description with the chunk size we requested. However, we did get a warning indicating that:

UserWarning: The specified chunks separate the stored chunks along dimension X starting at index i. This could degrade performance. Instead, consider rechunking after loading.

Important

This warning is telling us the chunk shape we have chosen is not a multiple (or grouping) of the stored chunks, and if we really want this chunk shape, we should rechunk the data.

Oops, as we are not attached to this chunk shape nor wanting to rechunk the data (see Why (re)Chunk Data? notebook for reasons why you might), we need to select a chunk shape that is a multiple of the stored chunks. This time, let’s try: {'time': 68*3, 'lat': 131*3, 'lon': 294*3}. This should increase our original chunk size (~20 MiB) by a factor of 27 (\(3^3 = 27\)) - close to the ~500 MiB we are wanting.

ds_tmn = xr.open_dataset(file, engine='zarr',
                         chunks={'time': 68 * 3, 'lat': 131 * 3, 'lon': 294 * 3},
                         drop_variables=['ppt', 'time_bnds', 'tmx', 'crs']).tmn
ds_tmn

<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(204, 393, 882), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Look at that, no warning and close to the chunk size we wanted!

As a final note, we selected our chunk shape while reading in the data. However, we could change them after using xarray.Dataset.chunk().

ds.tmn.chunk({'time': 68 * 4, 'lat': 131 * 4, 'lon': 294 * 4})

<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<rechunk-merge, shape=(1555, 621, 1405), dtype=float64, chunksize=(272, 524, 1176), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Not Recommended

Warning: We do not recommend using this method as you will not get the same warning notifying you that the chosen chunk shape does not match a multiple of the stored chunk shape. If you choose a non-multiple chunk shape you could slow down your whole workflow as the data will have to be rechunked to meet your requested chunk shape.

How to Examine a Stored Dataset’s Chunk Shape

Contents

How to Examine a Stored Dataset’s Chunk Shape#

Accessing the Dataset#

Variable = `xarray.DataArray`#

Getting the Chunking When Reading Data#

Changing the Chunk Shape and Size#

How to Examine a Stored Dataset’s Chunk Shape

Contents

How to Examine a Stored Dataset’s Chunk Shape#

Accessing the Dataset#

Variable = xarray.DataArray#

Getting the Chunking When Reading Data#

Changing the Chunk Shape and Size#

Variable = `xarray.DataArray`#