Adding Coordinate Reference Systems (CRS) to Zarr Datasets

Adding Coordinate Reference Systems (CRS) to Zarr Datasets#

The goal of this notebook is to learn how to add and verify the addition of coordinate reference systems (CRS) to Zarr datasets. We will be utilizing the Climate and Forecast (CF) Metadata Conventions for our CRS format and to determine where we include the CRS in the Zarr store. While these conventions were originally designed for NetCDF files, Zarr has become the cloud optimized alternative to NetCDF, and it retains the same general metadata and data representation standards. Therefore, we can easily apply the CF conventions to promote the standardization and interoperability of climate data even within Zarr stores.

import fsspec
import ujson
import xarray as xr

Example Dataset#

In this notebook, we will use the virtual Zarr store of the daily gridMET dataset that we created in the notebook on Generating a Virtual Zarr Store. Note that this means you will need to run that notebook to run this one. However, the contents of this example can easily be used for any Zarr store, virtual or not. The only difference will be how the data is read in.

On that note, let’s read in the data using xarray.open_dataset and Kerchunk as the engine (i.e., engine='kerchunk').

ds = xr.open_dataset(
    'virtual_zarr/kerchunk/gridmet.json',
    engine='kerchunk',
    chunks={},
    backend_kwargs={
        'storage_options': {
            'remote_protocol': 's3',
            'remote_options': {
                'anon': True,
                'endpoint_url': 'https://usgs.osn.mghpcc.org/'}
        },
    }
)
ds
<xarray.Dataset> Size: 166GB
Dimensions:                                    (day: 3653, lat: 585, lon: 1386,
                                                crs: 1)
Coordinates:
  * crs                                        (crs) uint16 2B 3
  * day                                        (day) datetime64[ns] 29kB 1980...
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
Data variables:
    air_temperature                            (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    precipitation_amount                       (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    relative_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    specific_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_from_direction                        (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_speed                                 (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.6
    author:                     John Abatzoglou - University of Idaho, jabatz...
    coordinate_system:          EPSG:4326
    date:                       02 July 2019
    geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
    geospatial_bounds_crs:      EPSG:4326
    ...                         ...
    geospatial_lon_units:       decimal_degrees east
    note1:                      The projection information for this file is: ...
    note2:                      Citation: Abatzoglou, J.T., 2013, Development...
    note3:                      Data in slices after last_permanent_slice (1-...
    note4:                      Data in slices after last_provisional_slice (...
    note5:                      Days correspond approximately to calendar day...

Identify the CRS#

A “coordinate reference system” (CRS) is a framework used to precisely measure locations on the surface of Earth as coordinates (wikipedia.org, 2024). To be CF-Compliant, a dataset must contain a “grid mapping” variable that is used to explicitly declare the CRS for any variables with spatial dimensions. Specifically, the grid mapping variable provides the description of the CRS via a collection of attached attributes. Common grid mapping variable names include:

  • crs,

  • polar_stereographic,

  • albers_conical_equal_area,

  • rcm_map.

Spatial variables that need the CRS to define their positions are then linked to the grid mapping variable by having a grid_mapping attribute, which is set to the name of the grid mapping variable.

To locate the grid mapping variable, the first thing is to check for a variable (or coordinate) with one of the common grid mapping variable names. If one exists, it can be confirmed to be the grid mapping by checking a spatial variable for the grid_mapping attribute and making sure it points to the grid mapping variable name. If so, you are good to go. If no obvious grid mapping variable is present, you can still check a spatial variable for its grid_mapping attribute to see what grid mapping variable it points to. If there is a grid_mapping attribute, but no variable present with the name it points to, or if there is no grid_mapping attribute at all, you will need to create the CRS given what you know about the data. However, creating a CRS is beyond this notebook and we refer readers to this notebook by Kieran Bartels and Sarah Jordan on how to add a missing CRS.

Okay, now that we know how to identify the CRS, let’s look at our data again and check for it.

ds
<xarray.Dataset> Size: 166GB
Dimensions:                                    (day: 3653, lat: 585, lon: 1386,
                                                crs: 1)
Coordinates:
  * crs                                        (crs) uint16 2B 3
  * day                                        (day) datetime64[ns] 29kB 1980...
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
Data variables:
    air_temperature                            (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    precipitation_amount                       (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    relative_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    specific_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_from_direction                        (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_speed                                 (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.6
    author:                     John Abatzoglou - University of Idaho, jabatz...
    coordinate_system:          EPSG:4326
    date:                       02 July 2019
    geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
    geospatial_bounds_crs:      EPSG:4326
    ...                         ...
    geospatial_lon_units:       decimal_degrees east
    note1:                      The projection information for this file is: ...
    note2:                      Citation: Abatzoglou, J.T., 2013, Development...
    note3:                      Data in slices after last_permanent_slice (1-...
    note4:                      Data in slices after last_provisional_slice (...
    note5:                      Days correspond approximately to calendar day...

Right away, we can see that the dataset has a crs coordinate, and a check if a spatial variable, say precipitation_amount, shows a grid_mapping attribute of crs. So, this dataset already has the CRS info, and it is stored in the crs coordinate with a dimension of crs.

Making the CRS CF compliant#

While we have the CRS info, to be CF compliant, the CRS needs to be a dimensionless data variable. Since it is currently a coordinate with a dimension, we can convert it to a variable by squeezing (squeeze) out the crs dimension and reseting the crs coordinate (reset_coords).

ds.squeeze().reset_coords('crs')
<xarray.Dataset> Size: 166GB
Dimensions:                                    (day: 3653, lat: 585, lon: 1386)
Coordinates:
  * day                                        (day) datetime64[ns] 29kB 1980...
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
Data variables:
    air_temperature                            (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    crs                                        uint16 2B 3
    precipitation_amount                       (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    relative_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    specific_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_from_direction                        (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_speed                                 (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.6
    author:                     John Abatzoglou - University of Idaho, jabatz...
    coordinate_system:          EPSG:4326
    date:                       02 July 2019
    geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
    geospatial_bounds_crs:      EPSG:4326
    ...                         ...
    geospatial_lon_units:       decimal_degrees east
    note1:                      The projection information for this file is: ...
    note2:                      Citation: Abatzoglou, J.T., 2013, Development...
    note3:                      Data in slices after last_permanent_slice (1-...
    note4:                      Data in slices after last_provisional_slice (...
    note5:                      Days correspond approximately to calendar day...

Nice, that did it! However, this actually gets us a CF compliant CRS for the in-memory dataset, and it does not actually adjust the Zarr store to be CF compliant. As we don’t want to rewrite the whole Zarr store, we can fix this by directly adjusting some of the keywords in the json serialized Zarr store.

Note: The following code is for our virtual Zarr store, which is a single json file with nested keys that make the Zarr store. See below for how to adjust the CRS in a regular, non-virtual Zarr store.

fs_local = fsspec.filesystem('')

virtual_zarr_store = ujson.load(fs_local.open('virtual_zarr/kerchunk/gridmet.json'))

zattrs = ujson.loads(virtual_zarr_store['refs']['crs/.zattrs'])
zattrs['_ARRAY_DIMENSIONS'] = []
virtual_zarr_store['refs']['crs/.zattrs'] = ujson.dumps(zattrs)

zarray = ujson.loads(virtual_zarr_store['refs']['crs/.zarray'])
zarray['chunks'] = []
zarray['shape'] = []
virtual_zarr_store['refs']['crs/.zarray'] = ujson.dumps(zarray)

with fs_local.open('virtual_zarr/kerchunk/gridmet_cf_crs.json', 'wb') as f:
    f.write(ujson.dumps(virtual_zarr_store).encode())

Now that we have updated our virtual Zarr store, let’s read it in to make sure everything looks right.

ds = xr.open_dataset(
    'virtual_zarr/kerchunk/gridmet_cf_crs.json',
    engine='kerchunk',
    chunks={},
    backend_kwargs={
        'storage_options': {
            'remote_protocol': 's3',
            'remote_options': {
                'anon': True,
                'endpoint_url': 'https://usgs.osn.mghpcc.org/'}
        },
    }
)
ds
<xarray.Dataset> Size: 166GB
Dimensions:                                    (day: 3653, lat: 585, lon: 1386)
Coordinates:
  * day                                        (day) datetime64[ns] 29kB 1980...
  * lat                                        (lat) float64 5kB 49.4 ... 25.07
  * lon                                        (lon) float64 11kB -124.8 ... ...
Data variables:
    air_temperature                            (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    crs                                        uint16 2B ...
    precipitation_amount                       (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    relative_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    specific_humidity                          (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    surface_downwelling_shortwave_flux_in_air  (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_from_direction                        (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
    wind_speed                                 (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
    Conventions:                CF-1.6
    author:                     John Abatzoglou - University of Idaho, jabatz...
    coordinate_system:          EPSG:4326
    date:                       02 July 2019
    geospatial_bounds:          POLYGON((-124.7666666333333 49.40000000000000...
    geospatial_bounds_crs:      EPSG:4326
    ...                         ...
    geospatial_lon_units:       decimal_degrees east
    note1:                      The projection information for this file is: ...
    note2:                      Citation: Abatzoglou, J.T., 2013, Development...
    note3:                      Data in slices after last_permanent_slice (1-...
    note4:                      Data in slices after last_provisional_slice (...
    note5:                      Days correspond approximately to calendar day...

Look at that, easy as can be, and we now have a CF compliant virtual Zarr store!

Adding CRS to a Regular Zarr Store#

As you may want to apply this process to a regular Zarr store, we have included some code that does the same adjustments as above, but for a regular Zarr store.

zarr_store_path = 'gridment.zarr'

with fs_local.open(f'{zarr_store_path}/crs/.zattrs', 'wb') as f:
    orig_metadata = ujson.load(f)        
    orig_metadata['_ARRAY_DIMENSIONS'] = []
    f.write(ujson.dumps(orig_metadata).encode())

with fs_local.open(f'{zarr_store_path}/crs/.zarray', 'wb') as f:
    orig_metadata = ujson.load(f)        
    orig_metadata['shape'] = []
    orig_metadata['chunks'] = []
    f.write(ujson.dumps(orig_metadata).encode())