How to Examine a Stored Dataset’s Chunk Shape#

The objective of this notebook is to learn how to examine a stored dataset and understand if it is chunked and, if so, what its “chunk shape” is. To do this, we will utilize an existing dataset from the HyTEST OSN, take a guided tour of the data, and show how to figure out its chunk shape.

import xarray as xr
import fsspec

Accessing the Dataset#

Before we can open the dataset, we must first get a mapper that will easily allow for xarray to open the dataset. To do this, we will use fsspec to perform an anonymous read from an endpoint outside of S3, that uses the S3 API (i.e., the HyTEST OSN). This requires us to set up an S3 file system and feed it the endpoint URL. We can then point the file system to our dataset (in this case the PRISM V2 Zarr store) and get a mapper to the file.

fs = fsspec.filesystem(
    's3',
    anon=True,   # anonymous = does not require credentials
    client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org/'}
)
file = fs.get_mapper('s3://mdmf/gdp/PRISM_v2.zarr/')

Now that we have our file mapper, we can open the dataset using xarray.open_dataset() with zarr specified as our engine.

Note

The xarray loader is “lazy”, meaning it will read just enough of the data to make decisions about its shape, structure, etc. It will pretend like the whole dataset is in memory (and we can treat it that way), but it will only load data as required.

ds = xr.open_dataset(file, engine='zarr')
ds
<xarray.Dataset> Size: 33GB
Dimensions:    (lat: 621, lon: 1405, time: 1555, tbnd: 2)
Coordinates:
  * lat        (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.19 24.15 24.1
  * lon        (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time       (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Dimensions without coordinates: tbnd
Data variables:
    crs        int64 8B ...
    ppt        (time, lat, lon) float64 11GB ...
    time_bnds  (time, tbnd) datetime64[ns] 25kB ...
    tmn        (time, lat, lon) float64 11GB ...
    tmx        (time, lat, lon) float64 11GB ...
Attributes: (12/24)
    Conventions:               CF-1.4
    Metadata_Conventions:      Unidata Dataset Discovery v1.0
    acknowledgment:            PRISM Climate Group, Oregon State University, ...
    authors:                   PRISM Climate Group
    cdm_data_type:             Grid
    creator_email:             daley@nacse.org
    ...                        ...
    publisher_url:             http://prism.oregonstate.edu/
    summary:                   This dataset was created using the PRISM (Para...
    time_coverage_resolution:  Monthly
    title:                     Parameter-elevation Regressions on Independent...
    time_coverage_start:       1895-01-01T00:00
    time_coverage_end:         2024-07-01T00:00

The HTML output for the xarray.Dataset includes a lot of information, some of which is hidden behind toggles. Click on the icons to the right to expand and see all the metadata available for the dataset. The page icon will display attributes attached to the data, while the database icon will display information about the dataset.

Notable observations:

  • Dimensions: This dataset is 3D, with data being indexed by lat, lon, and time (setting aside tbnd for the moment; it is a special case). Looking at the “Dimensions” line, you can see that each of these dimensions is quantified (i.e., the size of each dimension).

    • lat = 621

    • lon = 1405

    • time = 1555

  • Coordinates: These are the convenient handles by which dimensions can be referenced. In this dataset, a coordinate can be used to pick out a particular cell of the array. Selecting cells where say lat=49.9 is possible because these coordinates map the meaningful values of latitude to the behind-the-scenes cell index needed to fetch the value.

  • Data Variables: The primary variables are ppt, tmn, and tmx, which each have three dimensions (time, lat, lon) by which data values are located in space and time.

  • Indexes: This is an internal data structure to help xarray quickly find items in the array.

  • Attributes: Arbitrary metadata that has been given to the dataset.

Let’s look at one of the data variables to learn more about it.

Variable = xarray.DataArray#

Each data variable is its own N-dimensional array (in this case, 3-dimensional, indexed by lat, lon, and time). We can look at the individual variables by examining its array separately from the dataset:

ds.tmn
<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
[1356745275 values with dtype=float64]
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Note from the top line that this variable is indexed as a tuple in (time, lat, lon). So, behind the scenes, there is an array whose first index (for time) is a value between 0 and 1555. But how do we know the time value of index 0 (or any index, really)? The “Coordinates” are the lookup table to say what “real” time value is associated with each index address.

You’ll notice the data description in this case is merely “1356745275 values with dtype=float64” with no indication as to how it is chunked. Assuming our 3-D array is fully populated, this value makes sense:

# time  lat  lon
1555 * 621 * 1405
1356745275

In terms of chunking, this is where it gets interesting. If you thoroughly examined the HTML output, you may have noticed that there is no reference to chunking anywhere. Therefore, we need to directly access the data in a way that returns the true chunk shape of the stored dataset.

To do this, we can simply check a variable’s “encoding”. This returns metadata that was used by xarray when reading the data.

ds.tmn.encoding
{'chunks': (68, 131, 294),
 'preferred_chunks': {'time': 68, 'lat': 131, 'lon': 294},
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': np.int16(-9999),
 'scale_factor': 0.01,
 'add_offset': 0.0,
 'dtype': dtype('int16')}

From here we can see two keys dealing with chunks: 'chunks' and 'preferred_chunks' - in this case both contain the same information of how the data is chunked. The difference between the two is 'chunks' is the chunk shape of the chunks stored on disk (what is commonly termed “stored chunks”), while 'preferred_chunks' is the chunk shape that the engine chose to open the dataset with. Generally, these are the same, but they may be different if the engine you use has not been set to equate them or if a different chunk shape is specified when opening the dataset. Therefore, our data has a stored chunk shape of {'time': 68, 'lat': 131, 'lon': 294}.

Getting the Chunking When Reading Data#

While checking the “encoding” of the variable can tell you what the dataset’s stored chunk shape is, it is typically easier to do this in one step when you open the dataset. To do this, all we need is to add a another keyword when we open the dataset with xarray: chunks={}. As per the xarray.open_dataset documentation:

chunks={} loads the data with dask using the engine’s preferred chunk size, generally identical to the format’s chunk size.

In other words, using chunks={} will load the data with chunk shape equal to 'preferred_chunks'. Let’s check this out and see how our data looks when we include this keyword when opening.

ds = xr.open_dataset(file, engine='zarr', chunks={})
ds
<xarray.Dataset> Size: 33GB
Dimensions:    (lat: 621, lon: 1405, time: 1555, tbnd: 2)
Coordinates:
  * lat        (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.19 24.15 24.1
  * lon        (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time       (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Dimensions without coordinates: tbnd
Data variables:
    crs        int64 8B ...
    ppt        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
    time_bnds  (time, tbnd) datetime64[ns] 25kB dask.array<chunksize=(68, 2), meta=np.ndarray>
    tmn        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
    tmx        (time, lat, lon) float64 11GB dask.array<chunksize=(68, 131, 294), meta=np.ndarray>
Attributes: (12/24)
    Conventions:               CF-1.4
    Metadata_Conventions:      Unidata Dataset Discovery v1.0
    acknowledgment:            PRISM Climate Group, Oregon State University, ...
    authors:                   PRISM Climate Group
    cdm_data_type:             Grid
    creator_email:             daley@nacse.org
    ...                        ...
    publisher_url:             http://prism.oregonstate.edu/
    summary:                   This dataset was created using the PRISM (Para...
    time_coverage_resolution:  Monthly
    title:                     Parameter-elevation Regressions on Independent...
    time_coverage_start:       1895-01-01T00:00
    time_coverage_end:         2024-07-01T00:00

Now when we inspect the data variables metadata, we will see the data is now read in as a dask array. Let’s look at the tmn variable again to simplify this.

ds.tmn
<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(68, 131, 294), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

As we can see the data is chunked into chunks of shape (68, 131, 294) with a chunk size of ~20 MiB. This is exactly what we saw when looking at the encoding. So, this additional keyword worked as expected and gives us a standard way to open chunked datasets using the stored chunk shape as our chunk shape!

Note that the coordinate variables themselves (lat, lon, and time) are stored as single unchunked arrays of data. Recall that these are used to translate a coordinate value into the index of the corresponding array. Therefore, these coordinate arrays will always be needed in their entirity. So, they are included in each chunk such that they read whenever a chunk is read, and they do not affect how the data representing the data variables is chunked.

Changing the Chunk Shape and Size#

Now we can identify the stored chunk shape and size, but these settings may not always be ideal for performing analysis. For example, Zarr recommends a stored chunk size of at least 1 MB uncompressed as they give better performance. However, dask recommends chunk sizes between 10 MB and 1 GB for computations, depending on the availability of RAM and the duration of computations. Therefore, our stored chunk size may not be large enough for optimal computations. Thankfully, stored chunks do not need to be the same size as those we use for our computations. In other words, we can group multiple smaller stored chunks together when performing our computations. Xarray makes this easy by allowing us to adjust the chunk shape and size, either as we load the data or after.

Let’s show how this works by increasing our chunks of the minimum monthly temperature to a size of ~500 MiB. To do so when reading in the data, all we need to do is specify the chunk shape with the chunks argument. For our example, let’s do chunks of shape: {'time': 150, 'lat': 310, 'lon': 1405}.

# Note we drop the other variables and select tmn when reading the data
ds_tmn = xr.open_dataset(file, engine='zarr',
                         chunks={'time': 150, 'lat': 310, 'lon': 1405},
                         drop_variables=['ppt', 'time_bnds', 'tmx', 'crs']).tmn
ds_tmn
/tmp/ipykernel_4639/2712997614.py:2: UserWarning: The specified chunks separate the stored chunks along dimension "time" starting at index 150. This could degrade performance. Instead, consider rechunking after loading.
  ds_tmn = xr.open_dataset(file, engine='zarr',
/tmp/ipykernel_4639/2712997614.py:2: UserWarning: The specified chunks separate the stored chunks along dimension "lat" starting at index 310. This could degrade performance. Instead, consider rechunking after loading.
  ds_tmn = xr.open_dataset(file, engine='zarr',
<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(150, 310, 1405), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Nice! As we can see, the chunk shape is now displayed in the DataArray description with the chunk size we requested. However, we did get a warning indicating that:

UserWarning: The specified chunks separate the stored chunks along dimension X starting at index i. This could degrade performance. Instead, consider rechunking after loading.

Important

This warning is telling us the chunk shape we have chosen is not a multiple (or grouping) of the stored chunks, and if we really want this chunk shape, we should rechunk the data.

Oops, as we are not attached to this chunk shape nor wanting to rechunk the data (see Why (re)Chunk Data? notebook for reasons why you might), we need to select a chunk shape that is a multiple of the stored chunks. This time, let’s try: {'time': 68*3, 'lat': 131*3, 'lon': 294*3}. This should increase our original chunk size (~20 MiB) by a factor of 27 (\(3^3 = 27\)) - close to the ~500 MiB we are wanting.

ds_tmn = xr.open_dataset(file, engine='zarr',
                         chunks={'time': 68 * 3, 'lat': 131 * 3, 'lon': 294 * 3},
                         drop_variables=['ppt', 'time_bnds', 'tmx', 'crs']).tmn
ds_tmn
<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<open_dataset-tmn, shape=(1555, 621, 1405), dtype=float64, chunksize=(204, 393, 882), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs

Look at that, no warning and close to the chunk size we wanted!

As a final note, we selected our chunk shape while reading in the data. However, we could change them after using xarray.Dataset.chunk().

ds.tmn.chunk({'time': 68 * 4, 'lat': 131 * 4, 'lon': 294 * 4})
<xarray.DataArray 'tmn' (time: 1555, lat: 621, lon: 1405)> Size: 11GB
dask.array<rechunk-merge, shape=(1555, 621, 1405), dtype=float64, chunksize=(272, 524, 1176), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 2kB 49.94 49.9 49.85 49.81 ... 24.23 24.19 24.15 24.1
  * lon      (lon) float32 6kB -125.0 -125.0 -124.9 ... -66.6 -66.56 -66.52
  * time     (time) datetime64[ns] 12kB 1895-01-01 1895-02-01 ... 2024-07-01
Attributes:
    units:         degC
    long_name:     Minimum monthly temperature
    grid_mapping:  crs