Explore CONUS404 Dataset#

This dataset was created by extracting specified variables from a collection of wrf2d output files, rechunking to better facilitate data extraction for a variety of use cases, and adding CF conventions to allow easier analysis, visualization and data extraction using Xarray and Holoviz.

import os
os.environ['USE_PYGEOS'] = '0'

import fsspec
import xarray as xr
import hvplot.xarray
import intake
import metpy
import cartopy.crs as ccrs

1) Select the Dataset from HyTEST’s Intake Catalog#

# open the hytest data intake catalog
hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
list(hytest_cat)
['conus404-catalog',
 'benchmarks-catalog',
 'conus404-drb-eval-tutorial-catalog',
 'nhm-v1.0-daymet-catalog',
 'nhm-v1.1-c404-bc-catalog',
 'nhm-v1.1-gridmet-catalog',
 'trends-and-drivers-catalog',
 'nhm-prms-v1.1-gridmet-format-testing-catalog',
 'nwis-streamflow-usgs-gages-onprem',
 'nwis-streamflow-usgs-gages-osn',
 'nwm21-streamflow-usgs-gages-onprem',
 'nwm21-streamflow-usgs-gages-osn',
 'nwm21-streamflow-cloud',
 'geofabric_v1_1-zip-osn',
 'geofabric_v1_1_POIs_v1_1-osn',
 'geofabric_v1_1_TBtoGFv1_POIs-osn',
 'geofabric_v1_1_nhru_v1_1-osn',
 'geofabric_v1_1_nhru_v1_1_simp-osn',
 'geofabric_v1_1_nsegment_v1_1-osn',
 'gages2_nndar-osn',
 'wbd-zip-osn',
 'huc12-geoparquet-osn',
 'huc12-gpkg-osn',
 'nwm21-scores',
 'lcmap-cloud',
 'rechunking-tutorial-osn',
 'pointsample-tutorial-sites-osn',
 'pointsample-tutorial-output-osn']
# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)
['conus404-hourly-onprem-hw',
 'conus404-hourly-cloud',
 'conus404-hourly-osn',
 'conus404-daily-diagnostic-onprem-hw',
 'conus404-daily-diagnostic-cloud',
 'conus404-daily-diagnostic-osn',
 'conus404-daily-onprem-hw',
 'conus404-daily-cloud',
 'conus404-daily-osn',
 'conus404-monthly-onprem-hw',
 'conus404-monthly-cloud',
 'conus404-monthly-osn',
 'conus404-hourly-ba-onprem-hw',
 'conus404-hourly-ba-osn',
 'conus404-daily-ba-onprem',
 'conus404-daily-ba-osn',
 'conus404-pgw-hourly-onprem-hw',
 'conus404-pgw-hourly-osn',
 'conus404-pgw-daily-diagnostic-onprem-hw',
 'conus404-pgw-daily-diagnostic-osn']

Select a dataset: If you are unsure of which copy of a particular dataset to use (e.g. conus404-hourly-?), please review the HyTEST JupyterBook

## Select the dataset you want to read into your notebook and preview its metadata
dataset = 'conus404-hourly-osn' 
cat[dataset]
conus404-hourly-osn:
  args:
    consolidated: true
    storage_options:
      anon: true
      client_kwargs:
        endpoint_url: https://usgs.osn.mghpcc.org/
      requester_pays: false
    urlpath: s3://hytest/conus404/conus404_hourly.zarr
  description: "CONUS404 Hydro Variable subset, hourly values. These files were created\
    \ wrfout model output files (see ScienceBase data release for more details: https://doi.org/10.5066/P9PHPK4F).\
    \ This data is stored on HyTEST\u2019s Open Storage Network (OSN) pod. This data\
    \ can be read with the S3 API and is free to work with in any computing environment\
    \ (there are no egress fees)."
  driver: intake_xarray.xzarr.ZarrSource
  metadata:
    catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs

2) Set Up AWS Credentials (Optional)#

This notebook reads data from the OSN pod. The OSN pod is object store data on a high speed internet connection with free access from any computing environment. If you change this notebook to use one of the CONUS404 datasets stored on S3 (options ending in -cloud), you will be pulling data from a requester-pays S3 bucket. This means you have to set up your AWS credentials before you are able to load the data. Please note that reading the -cloud data from S3 may incur charges if you are reading data outside of AWS’s us-west-2 region or running the notebook outside of the cloud altogether. If you would like to access one of the -cloud options, uncomment and run the following code snippet to set up your AWS credentials. You can find more info about this AWS helper function here.

# uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'default'
# %run ../environment_set_up/Help_AWS_Credentials.ipynb

4) Explore the dataset#

# read in the dataset and use metpy to parse the crs information on the dataset
print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
ds
Reading conus404-hourly-osn metadata...
/home/conda/global/16102bfe-1731002172-4-pangeo/lib/python3.11/site-packages/intake_xarray/base.py:21: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  'dims': dict(self._ds.dims),
<xarray.Dataset> Size: 222TB
Dimensions:         (y: 1015, x: 1367, time: 376945, bottom_top_stag: 51,
                     bottom_top: 50, soil_layers_stag: 4, x_stag: 1368,
                     y_stag: 1016, snow_layers_stag: 3, snso_layers_stag: 7)
Coordinates:
    lat             (y, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
    lon             (y, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
  * time            (time) datetime64[ns] 3MB 1979-10-01 ... 2022-10-01
  * x               (x) float64 11kB -2.732e+06 -2.728e+06 ... 2.732e+06
  * y               (y) float64 8kB -2.028e+06 -2.024e+06 ... 2.028e+06
    metpy_crs       object 8B Projection: lambert_conformal_conic
    lat_u           (y, x_stag) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
    lon_u           (y, x_stag) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
    lat_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
    lon_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
Dimensions without coordinates: bottom_top_stag, bottom_top, soil_layers_stag,
                                x_stag, y_stag, snow_layers_stag,
                                snso_layers_stag
Data variables: (12/153)
    ACDEWC          (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ACDRIPR         (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ACDRIPS         (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ACECAN          (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ACEDIR          (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ACETLSM         (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    ...              ...
    ZNU             (bottom_top) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
    ZNW             (bottom_top_stag) float32 204B dask.array<chunksize=(51,), meta=np.ndarray>
    ZS              (soil_layers_stag) float32 16B dask.array<chunksize=(4,), meta=np.ndarray>
    ZSNSO           (time, snso_layers_stag, y, x) float32 15TB dask.array<chunksize=(144, 7, 175, 175), meta=np.ndarray>
    ZWT             (time, y, x) float32 2TB dask.array<chunksize=(144, 175, 175), meta=np.ndarray>
    crs             int64 8B ...
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...
# Examine the grid data structure for SNOW: 
ds.SNOW
<xarray.DataArray 'SNOW' (time: 376945, y: 1015, x: 1367)> Size: 2TB
dask.array<open_dataset-SNOW, shape=(376945, 1015, 1367), dtype=float32, chunksize=(144, 175, 175), chunktype=numpy.ndarray>
Coordinates:
    lat        (y, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
    lon        (y, x) float32 6MB dask.array<chunksize=(175, 175), meta=np.ndarray>
  * time       (time) datetime64[ns] 3MB 1979-10-01 ... 2022-10-01
  * x          (x) float64 11kB -2.732e+06 -2.728e+06 ... 2.728e+06 2.732e+06
  * y          (y) float64 8kB -2.028e+06 -2.024e+06 ... 2.024e+06 2.028e+06
    metpy_crs  object 8B Projection: lambert_conformal_conic
Attributes:
    description:   SNOW WATER EQUIVALENT
    grid_mapping:  crs
    long_name:     Snow water equivalent
    units:         kg m-2

Looks like this dataset is organized in three coordinates (x, y, and time), and we have used the metpy package to pase the crs information into the metpy_crs variable:

crs = ds['SNOW'].metpy.cartopy_crs
crs
2024-12-18T10:09:48.554719 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
<cartopy.crs.LambertConformal object at 0x7f19cf57a350>

Example A: Load the entire spatial domain for a variable at a specific time step#

%%time
da = ds.SNOW_ACC_NC.sel(time='2009-12-24 00:00').load()
### NOTE: the `load()` is dask-aware, so will operate in parallel if
### a cluster has been started. 
CPU times: user 3 s, sys: 356 ms, total: 3.36 s
Wall time: 6.88 s
da.hvplot.quadmesh(x='lon', y='lat', rasterize=True, geo=True, tiles='OSM', cmap='viridis').opts('Image', alpha=0.5)

Example B: Load a time series for a variable at a specific grid cell for a specified time range#

We will identify a point that we want to pull data for using lat/lon coordinates.

The CONUS404 data is in a Lambert Conformal Conic projection, so we need to re-project/transform using the built-in crs we examined earlier.

lat,lon = 39.978322,-105.2772194    
x, y = crs.transform_point(lon, lat, src_crs=ccrs.PlateCarree())   
print(x,y) # these vals are in LCC
-618215.7570892773 121899.89692719541
%%time
# pull out a particulat time slice at the specified coordinates
da = ds.PREC_ACC_NC.sel(x=x, y=y, method='nearest').sel(time=slice('2013-01-01 00:00','2013-12-31 00:00')).load()
CPU times: user 1.45 s, sys: 211 ms, total: 1.66 s
Wall time: 3.59 s
# plot your time series
da.hvplot(x='time', grid=True)

Stop cluster#

Uncomment the line below if you started a dask cluster to shut it down.

#client.close(); cluster.shutdown()