Adding Coordinate Reference Systems (CRS) to Zarr Datasets
Contents
Adding Coordinate Reference Systems (CRS) to Zarr Datasets#
The goal of this notebook is to learn how to add and verify the addition of coordinate reference systems (CRS) to Zarr datasets.
We will be utilizing the Climate and Forecast (CF) Metadata Conventions for our CRS format and to determine where we include the CRS in the Zarr store.
While these conventions were originally designed for NetCDF files, Zarr has become the cloud optimized alternative to NetCDF, and it retains the same general metadata and data representation standards.
Therefore, we can easily apply the CF conventions to promote the standardization and interoperability of climate data even within Zarr stores.
In this notebook, we will use the virtual Zarr store of the daily gridMET dataset that we created in the notebook on Generating a Virtual Zarr Store.
Note that this means you will need to run that notebook to run this one.
However, the contents of this example can easily be used for any Zarr store, virtual or not.
The only difference will be how the data is read in.
On that note, let’s read in the data using xarray.open_dataset and Kerchunk as the engine (i.e., engine='kerchunk').
The projection information for this file is: GCS WGS 1984.
note2 :
Citation: Abatzoglou, J.T., 2013, Development of gridded surface meteorological data for ecological applications and modeling, International Journal of Climatology, DOI: 10.1002/joc.3413
note3 :
Data in slices after last_permanent_slice (1-based) are considered provisional and subject to change with subsequent updates
note4 :
Data in slices after last_provisional_slice (1-based) are considered early and subject to change with subsequent updates
note5 :
Days correspond approximately to calendar days ending at midnight, Mountain Standard Time (7 UTC the next calendar day)
A “coordinate reference system” (CRS) is a framework used to precisely measure locations on the surface of Earth as coordinates (wikipedia.org, 2024).
To be CF-Compliant, a dataset must contain a “grid mapping” variable that is used to explicitly declare the CRS for any variables with spatial dimensions.
Specifically, the grid mapping variable provides the description of the CRS via a collection of attached attributes.
Common grid mapping variable names include:
crs,
polar_stereographic,
albers_conical_equal_area,
rcm_map.
Spatial variables that need the CRS to define their positions are then linked to the grid mapping variable by having a grid_mapping attribute, which is set to the name of the grid mapping variable.
To locate the grid mapping variable, the first thing is to check for a variable (or coordinate) with one of the common grid mapping variable names.
If one exists, it can be confirmed to be the grid mapping by checking a spatial variable for the grid_mapping attribute and making sure it points to the grid mapping variable name.
If so, you are good to go.
If no obvious grid mapping variable is present, you can still check a spatial variable for its grid_mapping attribute to see what grid mapping variable it points to.
If there is a grid_mapping attribute, but no variable present with the name it points to, or if there is no grid_mapping attribute at all, you will need to create the CRS given what you know about the data.
However, creating a CRS is beyond this notebook and we refer readers to this notebook by Kieran Bartels and Sarah Jordan on how to add a missing CRS.
Okay, now that we know how to identify the CRS, let’s look at our data again and check for it.
ds
<xarray.Dataset> Size: 166GB
Dimensions: (day: 3653, lat: 585, lon: 1386,
crs: 1)
Coordinates:
* crs (crs) uint16 2B 3
* day (day) datetime64[ns] 29kB 1980...
* lat (lat) float64 5kB 49.4 ... 25.07
* lon (lon) float64 11kB -124.8 ... ...
Data variables:
air_temperature (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
precipitation_amount (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
relative_humidity (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
specific_humidity (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
surface_downwelling_shortwave_flux_in_air (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
wind_from_direction (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
wind_speed (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
Conventions: CF-1.6
author: John Abatzoglou - University of Idaho, jabatz...
coordinate_system: EPSG:4326
date: 02 July 2019
geospatial_bounds: POLYGON((-124.7666666333333 49.40000000000000...
geospatial_bounds_crs: EPSG:4326
... ...
geospatial_lon_units: decimal_degrees east
note1: The projection information for this file is: ...
note2: Citation: Abatzoglou, J.T., 2013, Development...
note3: Data in slices after last_permanent_slice (1-...
note4: Data in slices after last_provisional_slice (...
note5: Days correspond approximately to calendar day...
The projection information for this file is: GCS WGS 1984.
note2 :
Citation: Abatzoglou, J.T., 2013, Development of gridded surface meteorological data for ecological applications and modeling, International Journal of Climatology, DOI: 10.1002/joc.3413
note3 :
Data in slices after last_permanent_slice (1-based) are considered provisional and subject to change with subsequent updates
note4 :
Data in slices after last_provisional_slice (1-based) are considered early and subject to change with subsequent updates
note5 :
Days correspond approximately to calendar days ending at midnight, Mountain Standard Time (7 UTC the next calendar day)
Right away, we can see that the dataset has a crs coordinate, and a check if a spatial variable, say precipitation_amount, shows a grid_mapping attribute of crs.
So, this dataset already has the CRS info, and it is stored in the crs coordinate with a dimension of crs.
While we have the CRS info, to be CF compliant, the CRS needs to be a dimensionless data variable.
Since it is currently a coordinate with a dimension, we can convert it to a variable by squeezing (squeeze) out the crs dimension and reseting the crs coordinate (reset_coords).
ds.squeeze().reset_coords('crs')
<xarray.Dataset> Size: 166GB
Dimensions: (day: 3653, lat: 585, lon: 1386)
Coordinates:
* day (day) datetime64[ns] 29kB 1980...
* lat (lat) float64 5kB 49.4 ... 25.07
* lon (lon) float64 11kB -124.8 ... ...
Data variables:
air_temperature (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
crs uint16 2B 3
precipitation_amount (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
relative_humidity (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
specific_humidity (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
surface_downwelling_shortwave_flux_in_air (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
wind_from_direction (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
wind_speed (day, lat, lon) float64 24GB dask.array<chunksize=(61, 98, 231), meta=np.ndarray>
Attributes: (12/19)
Conventions: CF-1.6
author: John Abatzoglou - University of Idaho, jabatz...
coordinate_system: EPSG:4326
date: 02 July 2019
geospatial_bounds: POLYGON((-124.7666666333333 49.40000000000000...
geospatial_bounds_crs: EPSG:4326
... ...
geospatial_lon_units: decimal_degrees east
note1: The projection information for this file is: ...
note2: Citation: Abatzoglou, J.T., 2013, Development...
note3: Data in slices after last_permanent_slice (1-...
note4: Data in slices after last_provisional_slice (...
note5: Days correspond approximately to calendar day...
The projection information for this file is: GCS WGS 1984.
note2 :
Citation: Abatzoglou, J.T., 2013, Development of gridded surface meteorological data for ecological applications and modeling, International Journal of Climatology, DOI: 10.1002/joc.3413
note3 :
Data in slices after last_permanent_slice (1-based) are considered provisional and subject to change with subsequent updates
note4 :
Data in slices after last_provisional_slice (1-based) are considered early and subject to change with subsequent updates
note5 :
Days correspond approximately to calendar days ending at midnight, Mountain Standard Time (7 UTC the next calendar day)
Nice, that did it!
However, this actually gets us a CF compliant CRS for the in-memory dataset, and it does not actually adjust the Zarr store to be CF compliant.
As we don’t want to rewrite the whole Zarr store, we can fix this by directly adjusting some of the keywords in the json serialized Zarr store.
Note: The following code is for our virtual Zarr store, which is a single json file with nested keys that make the Zarr store.
See below for how to adjust the CRS in a regular, non-virtual Zarr store.
The projection information for this file is: GCS WGS 1984.
note2 :
Citation: Abatzoglou, J.T., 2013, Development of gridded surface meteorological data for ecological applications and modeling, International Journal of Climatology, DOI: 10.1002/joc.3413
note3 :
Data in slices after last_permanent_slice (1-based) are considered provisional and subject to change with subsequent updates
note4 :
Data in slices after last_provisional_slice (1-based) are considered early and subject to change with subsequent updates
note5 :
Days correspond approximately to calendar days ending at midnight, Mountain Standard Time (7 UTC the next calendar day)
Look at that, easy as can be, and we now have a CF compliant virtual Zarr store!
As you may want to apply this process to a regular Zarr store, we have included some code that does the same adjustments as above, but for a regular Zarr store.