Generating a Virtual Zarr Store

The objective of this notebook is to learn how to create a virtual Zarr store for a collection of NetCDF files that together make up a complete dataset. To do this, we will use Kerchunk and VirtualiZarr. As these two packages can both create virtual Zarr stores but do it in different ways, we will utilize them both to show how they compare in combination with dask for parallel execution.

import os
# Needed when boto3 >= 1.36.0 or the rechunking process will fail
# This needs to be set before the boto3 library gets loaded
# See: https://github.com/aws/aws-cli/issues/9214#issuecomment-2606619168
os.environ['AWS_REQUEST_CHECKSUM_CALCULATION'] = 'when_required'
os.environ['AWS_RESPONSE_CHECKSUM_VALIDATION'] = 'when_required'
import fsspec
import xarray as xr
import ujson
import time
import kerchunk.hdf
import kerchunk.combine
from virtualizarr import open_virtual_dataset
import logging
import dask

Kerchunk vs VirtualiZarr¶

To begin, let’s explain what a virtual Zarr store even is. A “virtual Zarr store” is a virtual representation of a Zarr store generated by mapping any number of real datasets in individual files (e.g., NetCDF/HDF5, GRIB2, TIFF) together into a single, sliceable dataset via an interface layer. This interface layer, which Kerchunk and VirtualiZarr generate, contains information about the original files (e.g., chunking, compression, data byte location, etc.) needed to efficiently access the data. While this could be done with xarray.open_mfdataset, we don’t want to run this command every time we open the dataset as it can be a slow and expensive process. The reason for this is that xarray.open_mfdataset performs many consistency checks as it runs, and it requires partially opening all of the datasets to get general metadata information on each of the individual files. Therefore, for numerous files, this can have significant overhead, and it would be preferable to just cache these checks and metadata for more performant future reads. This cache (specifically in Zarr format) is what a virtual Zarr store is. Once we have the virtual Zarr store, we can open the combined xarray dataset using xarray.open_dataset for an almost instantaneous read.

Now that we know what a virtual Zarr store is, let’s discuss the differences between Kerchunk and VirtualiZarr and their virtual Zarr stores. At a top level, VirtualiZarr provides almost all of the same features as Kerchunk. The primary difference is that Kerchunk supports non-Zarr-like virtual format, while VirtualiZarr is specifically focused on the Zarr format. Additionally, Kerchunk creates the virtual Zarr store and represents it in memory using json formatting (the format used for Zarr metadata). Alternatively, VirtualiZarr represents the store as array-level abstractions (which can be converted to json format). These abstractions can be cleanly wrapped by xarray for easy use of xarray.concat and xarray.merge commands to combine virtual Zarr stores. A nice table comparing the two packages can be found in the VirtualiZarr FAQs, which shows how the two packages represent virtual Zarr stores and their comparative syntax.

Spin up Dask Cluster¶

To run the virtual Zarr creation in parallel, we need to spin up a Dask cluster to schedule the various workers.

%run ../../../../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari, replace the above 
## path name with a helper appropriate to your compute environment.  Examples:
# %run ../../../../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../../../../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
# %run ../../../../environment_set_up/Start_Dask_Cluster_Desktop.ipynb

Example Comparison¶

With our Dask cluster ready, let’s see how Kerchunk and VirtualiZarr can be utilized to generate a vitrual Zarr store. For this example, we will use the same daily gridMET NetCDF data as used in the Writing Chunked File tutorial. Only this time we will use all of the variables not just precipitation. These include:

precipitation,
maximum relative humidity,
minimum relative humidity,
specific humidity,
downward shortwave radiation,
minimum air temperature,
maximum air temperature,
wind direction, and
wind speed.

The data is currently hosted on the HyTEST OSN as a collection NetCDF files. To access the data with both Kerchunk and VirtualiZarr, we will use fsspec to get the list of files that we are wanting to combine into a virtual Zarr store.

First we need to create the file system for accessing the files, and a second one for outputting the virtual Zarr store.

# These reader options will be needed for VirtualiZarr
# We created them here to show how they fold into fsspec
reader_options = {
    'storage_options': {
        'anon': True, 
        'client_kwargs': {
            'endpoint_url': 'https://usgs.osn.mghpcc.org/'
        }
    }
}

fs = fsspec.filesystem(
    protocol='s3',
    **reader_options['storage_options']
)

fs_local = fsspec.filesystem('')
# Make directories to save the virtual zarr stores
fs_local.mkdirs('virtual_zarr/kerchunk', exist_ok=True)
fs_local.mkdirs('virtual_zarr/virtualizarr', exist_ok=True)

file_glob = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*198*.nc')
file_glob = [file for file in file_glob if (('2020' not in file) and ('2019' not in file))]

Now, we are ready to generate the virtual Zarr stores. For both Kerchunk and VirtualiZarr (for now), this consists of two steps:

Convert single original data files into individual virtual Zarr stores,
Combine the individual virtual Zarr stores into a single combined virtual Zarr store.

We will show these two steps seperately and how they are done for each package.

Generate Individual Virtual Zarr Stores¶

Kerchunk¶

To generate the individual virtual Zarr stores with Kerchunk, we will use kerchunk.hdf.SingleHdf5ToZarr, which translates the content of one HDF5 file into Zarr metadata. Other translators exist in Kerchunk that can convert GeoTiffs and NetCDF3 files. However, as we are looking at NetCDF4 files (a specific version of a HDF5 file), we will use the HDF5 translator. As this only translates one file, we can make a collection of dask.delayed objects that wrap the SingleHdf5ToZarr call to run it for all files in parallel.

# Make a function to run in parallel with dask
@dask.delayed
def generate_single_virtual_zarr(file):
    with fs.open(file) as hdf:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(hdf, file, inline_threshold=0)
        return h5chunks.translate()

# Time the duration for later comparison
t0 = time.time()

# Generate Dask Delayed objects
tasks = [generate_single_virtual_zarr(file) for file in file_glob]
# Compute the delayed object
single_virtual_zarrs = dask.compute(*tasks)

kerchunk_time = time.time() - t0

single_virtual_zarrs[0]

Notice that the output for a virtualization of a single NetCDF is a json style dictionary, where the coordinate data is actually kept in the dictionary, while the data is a file pointer and the byte range for each chunk.

VirtualiZarr¶

To generate the individual virtual Zarr stores with VirtualiZarr, we will use virtualizarr.open_virtual_dataset, which can infer what type of file we are reading instead of us having to specify. Like Kerchunk, this only translates one file at a time. So, we can make a collection of dask.delayed objects that wraps open_virtual_dataset to run it for all files in parallel.

Important

When reading in the individual files as virtual datasets, it is critical to include the loadable_variables keyword. The keyword should be set to a list of the coordinate names. By adding this keyword, the coordinates are read into memory rather than being loaded as virtual data. This can make a massive difference in the next steps of (1) concatenation as it gives the coordinates indexes and (2) the serialization of the virtual Zarr store as it saves the in-memory coordinates directly to the store rather than a pointer. Also, if this is not included, coordinates of different sizes will not be able to be concatenated due to potential chunking differences. The only downside is that it can slightly increase the time it takes to initially read the virtual datasets. However, this slowdown is more than worth the future convenience of having the coords in-memory when reading in the virtual Zarr store.

t0 = time.time()

tasks = [
    dask.delayed(open_virtual_dataset)(
        f's3://{file}',
        indexes={},
        loadable_variables=['day', 'lat', 'lon', 'crs'],
        decode_times=True,
        reader_options=reader_options
    )
    for file in file_glob
]

virtual_datasets = dask.compute(*tasks)

virtualizarr_time = time.time() - t0

virtual_datasets[0]

Notice that the output for a virtualization of a single NetCDF is now an xarray.Dataset, where the data is a ManifestArray object. This ManifestArray contains ChunkManifest objects that hold the same info as the Kerchunk json format (i.e., a file pointer and the byte range for each chunk), but allows for it to be nicely wrapped by xarray.

Combine Individual Virtual Zarr Stores¶

Kerchunk¶

To combine the individual virtual Zarr stores into one virtual Zarr store with Kerchunk, we will use kerchunk.combine.MultiZarrToZarr, which combines the content of multiple virtual Zarr stores into a single virtual Zarr store. This call requires feeding MultiZarrToZarr the remote access info that we needed for our file system, along with the dimension we want to combine.

t0 = time.time()

mzz = kerchunk.combine.MultiZarrToZarr(
    single_virtual_zarrs,
    remote_protocol='s3',
    remote_options=reader_options['storage_options'],
    concat_dims=["day"]
)

out = mzz.translate()

# Save the virtual Zarr store, serialized as json
with fs_local.open('virtual_zarr/kerchunk/gridmet.json', 'wb') as f:
    f.write(ujson.dumps(out).encode())

kerchunk_time += time.time() - t0

out

Again, notice the output type is in a json format with the coords in the dictionary and data chunks having pointers, but this time all chunks are in the one dictionary.

VirtualiZarr¶

To combine the virtual datasets from VirtualiZarr, we can just use xarray.combine_by_coords which will auto-magically combine the virtual datasets together.

t0 = time.time()

virtual_ds = xr.combine_by_coords(virtual_datasets, coords='minimal', compat='override', combine_attrs='override')

# Save the virtual Zarr store, serialized as json
virtual_ds.virtualize.to_kerchunk('virtual_zarr/virtualizarr/gridmet.json', format='json')

virtualizarr_time += time.time() - t0

virtual_ds

Notice that when we saved the virtual dataset that we converted it to a Kerchunk format for saving.

Opening the Virtual Zarr Stores¶

To open the virtual Zarr stores, we can use the same method for both stores as we converted to Kerchunk format when saving from VirtualiZarr.

Kerchunk¶

t0 = time.time()

ds = xr.open_dataset(
    'virtual_zarr/kerchunk/gridmet.json',
    chunks={},
    engine="kerchunk",
    backend_kwargs={
        "storage_options": {
            "remote_protocol": "s3",
            "remote_options": reader_options['storage_options']
        },
    }
)

kerchunk_read_time = time.time() - t0

ds

VirtualiZarr¶

t0 = time.time()

ds = xr.open_dataset(
    'virtual_zarr/virtualizarr/gridmet.json',
    chunks={},
    engine="kerchunk",
    backend_kwargs={
        "storage_options": {
            "remote_protocol": "s3",
            "remote_options": reader_options['storage_options']
        },
    }
)

virtualizarr_read_time = time.time() - t0

ds

Reading with `xarray.open_mfdataset`¶

As a comparison of read times, let’s also compile the dataset using xarray.open_mfdataset in parallel with Dask. This way we can see if we will be saving time in the future by having the compiled virtual Zarr for faster reads.

%%time 
t0 = time.time()

ds = xr.open_mfdataset(
    [fs.open(file) for file in file_glob],
    chunks={},
    parallel=True,
    engine='h5netcdf'
)

open_mfdataset_time = time.time() - t0

ds

Now, let’s compare the computational times!

print("Kerchunk virtual Zarr creation time: "
      f"{kerchunk_time:.0f}s ({kerchunk_time/60:.1f} min)")
print("VirtualiZarr virtual Zarr creation time: "
      f"{virtualizarr_time:.0f}s ({virtualizarr_time/60:.1f} min)")
print("open_mfdataset dataset creation time: "
      f"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)")
print(f"Time ratio: Kerchunk to open_mfdataset = {kerchunk_time/open_mfdataset_time:.2f}\n"
      f"            VirtualiZarr to open_mfdataset = {virtualizarr_time/open_mfdataset_time:.2f}\n"
      f"            Kerchunk to VirtualiZarr = {kerchunk_time/virtualizarr_time:.2f}")

As we can see the Kerchunk and VirtualiZarr take about the same amount of time to create a virtual zarr store. open_mfdataset can create the Dataset on the fly in slightly less time than it takes to create the virtual zarr stores (about 30 seconds less, in this case). However, if you use open_mfdataset, you will have to create this dataset each time you run your code, whereas Kerchunk and Virtualizarr can create the dataset once and then read it much faster on future uses. Looking at read speed after the virtual Zarr store creation:

print("Kerchunk virtual Zarr read time: "
      f"{kerchunk_read_time:.2f}s")
print("VirtualiZarr virtual Zarr read time: "
      f"{virtualizarr_read_time:.2f}s")
print("open_mfdataset dataset read/creation time: "
      f"{open_mfdataset_time:.0f}s ({open_mfdataset_time/60:.1f} min)")
print(f"Time ratio: Kerchunk to open_mfdataset = {kerchunk_read_time/open_mfdataset_time:.3f}\n"
      f"            VirtualiZarr to open_mfdataset = {virtualizarr_read_time/open_mfdataset_time:.3f}\n"
      f"            Kerchunk to VirtualiZarr = {kerchunk_read_time/virtualizarr_read_time:.3f}")

From this, it is very clear that performing more than one read using either the Kerchunk or VirtualiZarr virtual Zarr store is more efficient that reading with open_mfdataset. Additionally, the differences in read times between Kerchunk and Virtualizarr, while appearing drastic, is likely not going to be significant in any workflow.

Appending to Existing Virtual Zarr Store¶

As noted when introducing the gridMET data, we did not utilize the 2019 data in order to show how to append it to a virtual Zarr store. The ability to append more data to the virtual Zarr store is highly convienient, as plenty of datasets are continuously updated as new data becomes available. So, let’s appends some data to our virtual Zarr stores we just made.

First, we create the 2019 file glob.

file_glob_2019 = fs.glob('s3://mdmf/gdp/netcdf/gridmet/gridmet/*_2019.nc')

Create New Virtual Zarr for New File¶

Next, we need to get our 2019 NetCDFs into a virtual Zarr store.

Kerchunk¶

We will do this for Kerchunk the same way we did before, by using kerchunk.hdf.SingleHdf5ToZarr, which translates the content of one HDF5 (NetCDF4) file into Zarr metadata.

tasks = [generate_single_virtual_zarr(file) for file in file_glob_2019]
single_virtual_zarrs_2019 = dask.compute(*tasks)

VirtualiZarr¶

And for VirtualiZarr, we will use virtualizarr.open_virtual_dataset.

tasks = [
    dask.delayed(open_virtual_dataset)(
        f's3://{file}',
        indexes={},
        loadable_variables=['day', 'lat', 'lon', 'crs'],
        decode_times=True,
        reader_options=reader_options
    )
    for file in file_glob_2019
]

virtual_datasets_2019 = dask.compute(*tasks)

Append to Existing Store¶

Now, we can append the virtualized NetCDFs to our existing stores.

Kerchunk¶

For Kerchunk, we will use still kerchunk.combine.MultiZarrToZarr. However, this time we will need to use the append method to append our new data.

# Append to the existing reference file
mzz = kerchunk.combine.MultiZarrToZarr.append(
    single_virtual_zarrs_2019,
    original_refs=out,
    concat_dims=["day"],
    remote_protocol='s3',
    remote_options=reader_options['storage_options'],
)

out_2019 = mzz.translate()

# Save the virtual Zarr store, serialized as json
with fs_local.open('virtual_zarr/kerchunk/gridmet_appended.json', 'wb') as f:
    f.write(ujson.dumps(out_2019).encode())

VirtualiZarr¶

For VirtualiZarr, we can just use xarray.concat and xarray.merge like would to combine any xarray.Dataset.

virtual_ds_2019 = xr.merge(virtual_datasets_2019, compat='override', combine_attrs='override')
virtual_ds = xr.concat([virtual_ds, virtual_ds_2019], dim='day', coords='minimal', compat='override', combine_attrs='override')
virtual_ds

This simple xarray.merge and concat is the major advantage of VirtualiZarr. Rather than having to figure out Kerchunk’s syntax and commands, we can keep using xarray as we already do. Therefore, the increase in time to create the virtual Zarr store compared to Kerchunk is likely worth it due to its native compatibility with xarray.

Double Check New Stores¶

Finally, let’s read in the appended stores to make sure that we correctly appended the 2019 data.

ds = xr.open_dataset(
    'virtual_zarr/kerchunk/gridmet_appended.json',
    engine="kerchunk",
    chunks={},
    backend_kwargs={
        "storage_options": {
            "remote_protocol": "s3",
            "remote_options": reader_options['storage_options']
        },
    }
)
ds

Nice! The 2019 data is now appended and showing on the day coordinate.

Clean Up¶

Rather than deleting the virtual Zarr stores that we created, we will actually keep them for use in future tutorials. However, we will do want to conform with best practices and close our Dask client and cluster.

client.close()
cluster.close()

Generating a Virtual Zarr Store

Kerchunk vs VirtualiZarr¶

Spin up Dask Cluster¶

Example Comparison¶

Generate Individual Virtual Zarr Stores¶

Kerchunk¶

VirtualiZarr¶

Combine Individual Virtual Zarr Stores¶

Kerchunk¶

VirtualiZarr¶

Opening the Virtual Zarr Stores¶

Kerchunk¶

VirtualiZarr¶

Reading with xarray.open_mfdataset¶

Appending to Existing Virtual Zarr Store¶

Create New Virtual Zarr for New File¶

Kerchunk¶

VirtualiZarr¶

Append to Existing Store¶

Kerchunk¶

VirtualiZarr¶

Double Check New Stores¶

Clean Up¶

Reading with `xarray.open_mfdataset`¶