CONUS404 Regridding (Curvilinear => Rectilinear)

Create a rectilinear grid (1D lon/lat coordinates) for a specific region. Extract spatial and temporal subset of regridded data to a netcdf file. (Extraction to netcdf may also be done for curvilinear grid.)

%%time
import os 
# Needed when boto3 >= 1.36.0 or the rechunking process will fail
# This needs to be set before the boto3 library gets loaded
# See: https://github.com/aws/aws-cli/issues/9214#issuecomment-2606619168
os.environ['AWS_REQUEST_CHECKSUM_CALCULATION'] = 'when_required'
os.environ['AWS_RESPONSE_CHECKSUM_VALIDATION'] = 'when_required'
import xarray as xr
import xesmf as xe
import numpy as np
import fsspec
import hvplot.xarray
import geoviews as gv
from matplotlib import path 
import pystac
from packaging.version import Version
import zarr

Open dataset from WMA STAC Catalog¶

def get_children(catalog, collection_id=None):
    """
    This function retrieves a specified collection from a STAC catalog/collection and prints key metadata 
    for exploring/accessing the datasets contained within it.
    If there is no collection ID provided, the collections in the top level of the catalog will be printed.
    If a collection ID is provided, it will retrieve the collection with that ID from the input catalog/collection.
    If the collection ID points to a dataset, it will print the assets available for the dataset.
    If the collection ID points to another collection, it will list the child collections in the IDed collection.

    Args:
        catalog (pystac.Catalog | pystac.Collection): The STAC catalog/collection object.
        collection_id (str): The ID of the collection or dataset to retrieve from catalog.
    
    Returns:
        collection (pystac.Catalog | pystac.Collection): The collection object corresponding to the provided ID
                                                         or the top-level catalog if no ID is provided.
    """
    dataset = False
    if collection_id:
        collection = catalog.get_child(collection_id)
        if collection.assets:
            dataset = True
            print(f"{collection_id} is a dataset. Please review the assets below and select one to open.")

        else:
            print(f"{collection_id} is a collection. Please review the child items and select one to open in the next cell.")
    else:
        collection = catalog
    if dataset==True:
        # List the assets
        for asset in collection.assets:
            print(f"Asset ID: {asset}")
            print(f"    Title: {collection.assets[asset].title}")
            print(f"    Description: {collection.assets[asset].description}")
    else:
        collections = list(collection.get_collections())
        print(f"Number of collections: {len(collections)}")
        print("Collections IDs:")
        for child_collection in collections:
            id = child_collection.id
            cite_as = "Not available"
            for link in child_collection.links:
                if link.rel == "cite-as":
                    cite_as = link.target
            print(f"- {id}, Source: {cite_as}")
    return collection

# url for the WMA STAC Catalog
catalog_url = "https://api.water.usgs.gov/gdp/pygeoapi/stac/stac-collection/"

# use pystac to read the catalog
catalog = pystac.Catalog.from_file(catalog_url)

# list the collections in the catalog
catalog = get_children(catalog)

# select a collection from the catalog, replace the collection ID with the one you want to use:
collection = get_children(catalog, collection_id="conus404")

# select a collection from the catalog, replace the collection ID with the one you want to use:
collection = get_children(collection, collection_id="conus404_hourly")

# replace with the asset ID you want to use:
selected_asset_id = "zarr-s3-osn"

# read the asset metadata
asset = collection.assets[selected_asset_id]

2) Set Up AWS Credentials (Optional)¶

This notebook reads data from the OSN pod by default, which is object store data on a high speed internet connection that is free to access from any environment. If you change this notebook to use one of the CONUS404 datasets stored on S3 (options ending in -cloud), you will be pulling data from a requester-pays S3 bucket. This means you have to set up your AWS credentials, else we won’t be able to load the data. Please note that reading the -cloud data from S3 may incur charges if you are reading data outside of the us-west-2 region or running the notebook outside of the cloud altogether. If you would like to access one of the -cloud options, uncomment and run the following code snippet to set up your AWS credentials. You can find more info about this AWS helper function here.

# uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'default'
# %run ../../environment_set_up/Help_AWS_Credentials.ipynb

Parallelize with Dask¶

Some of the steps we will take are aware of parallel clustered compute environments using dask. We’re going to start a cluster now so that future steps can take advantage of this ability.

This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.

We have documentation on how to start a Dask Cluster in different computing environments here.

%run ../../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/, replace the above 
## path name with a helper appropriate to your compute environment.  Examples:
# %run ../../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
# %run ../../environment_set_up/Start_Dask_Cluster_Desktop.ipynb

if Version(zarr.__version__) < Version("3.0.0"):
    ds = xr.open_dataset(
        asset.href,
        storage_options=asset.extra_fields['xarray:storage_options'],
        **asset.extra_fields['xarray:open_kwargs']
    )
else:
    ds = xr.open_dataset(
    asset.href,
    storage_options=asset.extra_fields['xarray:storage_options'],
    **asset.extra_fields['xarray:open_kwargs'],
    zarr_format=2
    )

ds

nc_outfile = 'CONUS404_DRB_rectilinear.nc'
bbox = [-75.9, -74.45, 38.7, 42.55]
dx = dy = 3./111.    # 3km grid
vars_out = ['T2', 'SNOW']
start = '2017-04-01 00:00'
stop  = '2017-05-01 00:00'

Use xESMF to regrid¶

xESMF is a xarray-enabled interface to the ESMF regridder from NCAR. ESMF has options for regridding between curvilinear, rectilinear, and unstructured grids, with conservative regridding options, and much more

def bbox2ij(lon,lat,bbox=[-160., -155., 18., 23.]):
    """Return indices for i,j that will completely cover the specified bounding box.     
    i0,i1,j0,j1 = bbox2ij(lon,lat,bbox)
    lon,lat = 2D arrays that are the target of the subset
    bbox = list containing the bounding box: [lon_min, lon_max, lat_min, lat_max]

    Example
    -------  
    >>> i0,i1,j0,j1 = bbox2ij(lon_rho,[-71, -63., 39., 46])
    >>> h_subset = nc.variables['h'][j0:j1,i0:i1]       
    """
    bbox=np.array(bbox)
    mypath=np.array([bbox[[0,1,1,0]],bbox[[2,2,3,3]]]).T
    p = path.Path(mypath)
    points = np.vstack((lon.ravel(),lat.ravel())).T   
    n,m = np.shape(lon)
    inside = p.contains_points(points).reshape((n,m))
    ii,jj = np.meshgrid(range(m),range(n))
    return min(ii[inside]),max(ii[inside]),min(jj[inside]),max(jj[inside])

Before we regrid to rectilinear, let’s subset a region that covers our area of interest. Becuase lon,lat are 2D arrays, we can’t just use xarray to slice these coordinate variables. So we have a routine that finds the i,j locations of a specified bounding box, and then slice on those.

i0,i1,j0,j1 = bbox2ij(ds['lon'].values, ds['lat'].values, bbox=bbox)
print(i0,i1,j0,j1)

ds_subset = ds.isel(x=slice(i0-1,i1+1), y=slice(j0-1,j1+1))

ds_subset = ds_subset.sel(time=slice(start,stop))

ds_subset

ds_subset.nbytes/1e9

da = ds_subset.T2.sel(time='2017-04-25 00:00', method='nearest')
viz = da.hvplot.quadmesh(x='lon', y='lat', geo=True, rasterize=True, cmap='turbo')
base = gv.tile_sources.OSM
base * viz.opts(alpha=0.5)

ds_subset.nbytes/1e9

%%time
ds_subset = ds_subset.chunk({'x':-1, 'y':-1, 'time':24})

%%time
ds_out = xr.Dataset({'lon': (['lon'], np.arange(bbox[0], bbox[1], dx)),
                     'lat': (['lat'], np.arange(bbox[2], bbox[3], dy))})

regridder = xe.Regridder(ds_subset, ds_out, 'bilinear')
regridder

%%time
ds_out = regridder(ds_subset[vars_out])
print(ds_out)

ds_out['SNOW']

list(ds_out.variables)

list(ds_out.data_vars)

ds_out['T2'].encoding

ds_out.time

encoding={}
for var in ds_out.variables:
    encoding[var] = dict(zlib=True, complevel=2, 
                         fletcher32=False, shuffle=True,
                         _FillValue=None
                        )

# you will need to update the filepaths and uncomment the following line to save out your data.
ds_out.load().to_netcdf(nc_outfile, encoding=encoding, mode='w')

ds_nc = xr.open_dataset(nc_outfile)

ds_nc

(ds_nc['T2']-273.15).hvplot(x='lon',y='lat', geo=True,
                rasterize=True, cmap='turbo', 
                tiles='OSM', clim=(2,15))

ds_outcl = ds_subset[vars_out]

list(ds_outcl.data_vars)

encoding={}
for var in ds_outcl.variables:
    encoding[var] = dict(zlib=True, complevel=2, 
                         fletcher32=False, shuffle=True,
                         _FillValue=None
                        )

# you will need to update the filepaths and uncomment the following line to save out your data.
# ds_outcl.load().to_netcdf('CONUS404_DRB_curvilinear.nc', encoding=encoding, mode='w')

client.close(); cluster.shutdown()