Pangeo CONUS404 Spatial Aggregation over DRB-extent HUC12s#

In this notebook, we will be showing how to aggregate gridded data to polygons. The method aggregates gridded data conservatively, i.e. by exactly partitioning each grid cell into the precise region boundaries. The method makes use of two key packages xarray and geopandas. Our implementation is based off of this Pangeo Discourse, which we have updated using more streamlined methods.

The overall approach consists of:

  • Represent both the original gridded data and target polygons as geopandas.GeoSeries objects (with vector geometries).

  • Compute their area overlay and turn it into a sparse matrix of cell weights.

  • Perform weighted aggregation using xarray.Dataset.weighted along the spatial dimensions.

It is quite fast and transparent.

The spatial polygons used in this notebook come from the NHDPlusV2 snapshot of the Watershed Boundary Dataset HUC12 boundaries provided through the PyGeoHydro python package.

We use the HyTest intake catalog to access CONUS404 from the OSN pod. This notebook provides a relatively simple and efficient workflow that can be easily run on a local computer.

%xmode minimal
import os
import xarray as xr
import geopandas as gp
import pandas as pd
import numpy as np
import sparse

import hvplot.pandas
import hvplot.xarray
import dask
import cf_xarray

from pynhd import NLDI, WaterData
from pygeohydro import watershed
import intake
import cartopy.crs as ccrs
from shapely.geometry import Polygon
Exception reporting mode: Minimal

Open Dataset from Intake Catalog#

First, let’s begin by loading the CONUS404 daily data.

hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
catalog = hytest_cat['conus404-catalog']
list(catalog)
['conus404-hourly-onprem-hw',
 'conus404-hourly-cloud',
 'conus404-hourly-osn',
 'conus404-daily-diagnostic-onprem-hw',
 'conus404-daily-diagnostic-cloud',
 'conus404-daily-diagnostic-osn',
 'conus404-daily-onprem-hw',
 'conus404-daily-cloud',
 'conus404-daily-osn',
 'conus404-monthly-onprem-hw',
 'conus404-monthly-cloud',
 'conus404-monthly-osn',
 'conus404-hourly-ba-onprem-hw',
 'conus404-hourly-ba-osn',
 'conus404-daily-ba-onprem',
 'conus404-daily-ba-osn',
 'conus404-pgw-hourly-onprem-hw',
 'conus404-pgw-hourly-osn',
 'conus404-pgw-daily-diagnostic-onprem-hw',
 'conus404-pgw-daily-diagnostic-osn',
 'conus404-pgw-daily-onprem-hw',
 'conus404-pgw-daily-osn']

As we can see there are three different locations for the conus404-daily data set. The locations are (1) -onprem-hw meaning it is stored on the USGS Hovenweep HPC, (2) -cloud meaning it is store in an S3 bucket, or (3) -osn meaning the data is on the USGS open storage network (OSN). As the OSN is free to access from any environment, we will use that for this example, but the location can easily be changed depending on your needs. We have a writeup of our different storage locations used in the intake catalog here.

If you change this notebook to use the CONUS404 dataset stored on S3 (options ending in -cloud), you will be pulling data from a requester-pays S3 bucket. This means you have to set up your AWS credentials, else we won’t be able to load the data. Please note that reading the -cloud data from S3 may incur charges if you are reading data outside of the us-west-2 region or running the notebook outside of the cloud altogether. If you would like to access one of the -cloud options, uncomment and run the following code cell to set up your AWS credentials. You can find more info about this AWS helper function here.

# uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'default'
# %run ../environment_set_up/Help_AWS_Credentials.ipynb

Finally, read in the daily CONUS404 data set.

dataset = 'conus404-daily-osn'
conus404 = catalog[dataset].to_dask()
conus404
<xarray.Dataset> Size: 9TB
Dimensions:         (time: 15707, y: 1015, x: 1367, bottom_top_stag: 51,
                     bottom_top: 50, soil_layers_stag: 4, x_stag: 1368,
                     y_stag: 1016, snow_layers_stag: 3, snso_layers_stag: 7)
Coordinates:
    lat             (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    lat_u           (y, x_stag) float32 6MB dask.array<chunksize=(350, 175), meta=np.ndarray>
    lat_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 350), meta=np.ndarray>
    lon             (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    lon_u           (y, x_stag) float32 6MB dask.array<chunksize=(350, 175), meta=np.ndarray>
    lon_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 350), meta=np.ndarray>
  * time            (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * x               (x) float64 11kB -2.732e+06 -2.728e+06 ... 2.732e+06
  * y               (y) float64 8kB -2.028e+06 -2.024e+06 ... 2.028e+06
Dimensions without coordinates: bottom_top_stag, bottom_top, soil_layers_stag,
                                x_stag, y_stag, snow_layers_stag,
                                snso_layers_stag
Data variables: (12/153)
    ACDEWC          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACDRIPR         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACDRIPS         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACECAN          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACEDIR          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACETLSM         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ...              ...
    ZNU             (bottom_top) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
    ZNW             (bottom_top_stag) float32 204B dask.array<chunksize=(51,), meta=np.ndarray>
    ZS              (soil_layers_stag) float32 16B dask.array<chunksize=(4,), meta=np.ndarray>
    ZSNSO           (time, snso_layers_stag, y, x) float32 610GB dask.array<chunksize=(36, 7, 350, 350), meta=np.ndarray>
    ZWT             (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    crs             int64 8B ...
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...

Parallelize with Dask (optional)#

Some of the steps we will take are aware of parallel clustered compute environments using dask. We can start a cluster now so that future steps take advantage of this ability. This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.

We have documentation on how to start a Dask Cluster in different computing environments here. Uncomment the cluster start up that works for your compute environment.

%run ../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/ESIP, replace the above 
## path name with a helper appropriate to your compute environment.  Examples:
# %run ../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Desktop.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_PangeoCHS.ipynb
The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' 
The link to view the client dashboard is:
>  https://hytestnebari.dev-wma.chs.usgs.gov/gateway/clusters/dev.c676ff22d4694540af65a7193eaf2d4d/status

Load the Feature Polygons#

Now that we have read in the CONUS404 data, we need to read in some polygons to aggregate the data. For this example, we will use the HUC12 basins within the Delaware River Basin. To get these HUC12 polygons, we can use pygeohydro.watershed to query the Hydro Network Linked Data Index (NLDI). All we need to get the basins is the general IDs of the HUC12 basins. For the Delaware Basin those are ones that start with 020401 or 020402.

%%time
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData('wbd12').bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins['huc12'].str.startswith(('020401', '020402'))]
huc12_basins
CPU times: user 3.04 s, sys: 204 ms, total: 3.24 s
Wall time: 13.4 s
geometry tnmid metasourceid sourcedatadesc sourceoriginator sourcefeatureid loaddate gnis_id areaacres areasqkm states huc12 name hutype humod tohuc noncontributingareaacres noncontributingareasqkm globalid
0 MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ... {001055CD-7B5E-4F96-ABB1-22923FD2B454} None None None None 2013-01-18T07:08:41Z None 16178.10 65.47 NJ 020402060505 White Marsh Run-Maurice River S NM 020402060507 0 0 {A70A97FA-E29C-11E2-8094-0021280458E6}
1 MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ... {009DA440-393A-46A0-BBA3-C38139823414} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:06Z None 29107.93 117.80 NJ 020401050502 Lubbers Run-Musconetcong River S NM 020401050503 0 0 {B830B4F8-E29C-11E2-8094-0021280458E6}
2 MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ... {00F73634-FA00-464F-93F0-F0BCCB464567} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:58Z None 35229.84 142.57 PA 020402030305 Upper Maiden Creek S NM 020402030307 0 0 {BAF5E77A-E29C-11E2-8094-0021280458E6}
3 MULTIPOLYGON (((-75.13279 39.88601, -75.13439 ... {017B8A58-E959-4600-B212-8EB921F0F9F7} {E9F5C988-2313-440E-A05E-C5871E2773A6} None None None 2017-10-03T20:11:04Z None 24920.90 100.85 NJ,PA 020402020507 Woodbury Creek-Delaware River S TF 020402020607 0 0 {B122D6B7-E29C-11E2-8094-0021280458E6}
4 MULTIPOLYGON (((-74.92817 40.55256, -74.92833 ... {01B4598B-0623-407F-9133-3040D99A8D1A} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:52Z None 15086.80 61.05 NJ 020401050905 Lockatong Creek S NM 020401050908 0 0 {A6AB04BE-E29C-11E2-8094-0021280458E6}
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
656 MULTIPOLYGON (((-75.12376 40.12389, -75.12455 ... {FE1811B2-B495-4065-8624-913739AF724E} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:07Z None 23146.21 93.67 PA 020402030902 Lower Wissahickon Creek S NM 020402031007 0 0 {BB87766E-E29C-11E2-8094-0021280458E6}
659 MULTIPOLYGON (((-75.75805 40.65996, -75.75893 ... {FF44E71D-0E10-4CEF-ACB5-A148FCC1D201} None None None None 2013-01-18T07:08:53Z None 18124.40 73.35 PA 020402030301 Ontelaunee Creek S NM 020402030305 0 0 {BAD53EAD-E29C-11E2-8094-0021280458E6}
660 MULTIPOLYGON (((-75.19691 40.5796, -75.19719 4... {FF630558-25A7-462F-9E12-F4166478902C} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:58Z None 18929.21 76.60 PA 020401050604 Cooks Creek S NM 020401050605 0 0 {BA28799D-E29C-11E2-8094-0021280458E6}
661 MULTIPOLYGON (((-75.53503 39.11019, -75.53441 ... {FF6AB0DD-4B75-40D8-81D9-7AC2F0ACE65E} None None None None 2013-01-18T07:08:10Z None 6290.20 25.46 DE 020402070303 Tidbury Creek S NM 020402070304 0 0 {963110E8-E29C-11E2-8094-0021280458E6}
662 MULTIPOLYGON (((-74.85144 41.01532, -74.8516 4... {FF9C7C31-4BE0-426E-9CEA-FC14A0F526C7} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:56Z None 21760.30 88.06 NJ 020401050103 Middle Paulins Kill S NM 020401050104 0 0 {B82FC756-E29C-11E2-8094-0021280458E6}

427 rows × 19 columns

Let’s plot the HUC12 basins to see how they look.

huc12_basins.hvplot(
    c='huc12', title="Delaware River HUC12 basins",
    coastline='50m', geo=True,
    aspect='equal', legend=False, frame_width=300
)
/home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_physical/ne_50m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)

An important thing to note is that all geodataframes should have a coordinate reference system (CRS). Let’s check to make sure our geodataframe has a CRS.

huc12_basins.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Limit CONUS404 Spatial Range#

With the HUC12 basins read in, we only need the CONUS404 data that spans these polygons as they are the regions we will be aggregating. So, let’s limit the CONUS404 spatial range to that of the basins. This will save on memory and computation. Note doing this is mainly useful when the region’s footprint is much smaller than the footprint of the gridded model.

To limit the spatial range, we first need to convert the CRS of the basins to that of CONUS404. Then extract the bounding box of the basins.

huc12_basins_conus404_crs = huc12_basins.to_crs(conus404.crs.crs_wkt)
bbox = huc12_basins_conus404_crs.total_bounds
bbox
array([1768709.12577942,  198277.21996997, 1952739.24390951,
        614126.29932052])

Then select the CONUS404 data within the bounding box. However, when we do this, we will extend the bounds out by 5% of their range to ensure all of our basins are within the spatially limited data. We do this as the reprojections of the CRS can cause slight distortions that make polygons on the bounds not fall fully within the data.

bbox_x_range = bbox[2] - bbox[0]
bbox_y_range = bbox[3] - bbox[1]
x_range = slice(bbox[0] - bbox_x_range * 0.05,
                bbox[2] + bbox_x_range * 0.05)
y_range = slice(bbox[1] - bbox_y_range * 0.05,
                bbox[3] + bbox_y_range * 0.05)

conus404 = conus404.sel(x=x_range, y=y_range)
conus404
<xarray.Dataset> Size: 51GB
Dimensions:         (time: 15707, y: 114, x: 51, bottom_top_stag: 51,
                     bottom_top: 50, soil_layers_stag: 4, x_stag: 1368,
                     y_stag: 1016, snow_layers_stag: 3, snso_layers_stag: 7)
Coordinates:
    lat             (y, x) float32 23kB dask.array<chunksize=(114, 51), meta=np.ndarray>
    lat_u           (y, x_stag) float32 624kB dask.array<chunksize=(114, 175), meta=np.ndarray>
    lat_v           (y_stag, x) float32 207kB dask.array<chunksize=(175, 51), meta=np.ndarray>
    lon             (y, x) float32 23kB dask.array<chunksize=(114, 51), meta=np.ndarray>
    lon_u           (y, x_stag) float32 624kB dask.array<chunksize=(114, 175), meta=np.ndarray>
    lon_v           (y_stag, x) float32 207kB dask.array<chunksize=(175, 51), meta=np.ndarray>
  * time            (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * x               (x) float64 408B 1.76e+06 1.764e+06 ... 1.956e+06 1.96e+06
  * y               (y) float64 912B 1.8e+05 1.84e+05 ... 6.28e+05 6.32e+05
Dimensions without coordinates: bottom_top_stag, bottom_top, soil_layers_stag,
                                x_stag, y_stag, snow_layers_stag,
                                snso_layers_stag
Data variables: (12/153)
    ACDEWC          (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ACDRIPR         (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ACDRIPS         (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ACECAN          (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ACEDIR          (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ACETLSM         (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    ...              ...
    ZNU             (bottom_top) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
    ZNW             (bottom_top_stag) float32 204B dask.array<chunksize=(51,), meta=np.ndarray>
    ZS              (soil_layers_stag) float32 16B dask.array<chunksize=(4,), meta=np.ndarray>
    ZSNSO           (time, snso_layers_stag, y, x) float32 3GB dask.array<chunksize=(36, 7, 114, 51), meta=np.ndarray>
    ZWT             (time, y, x) float32 365MB dask.array<chunksize=(36, 114, 51), meta=np.ndarray>
    crs             int64 8B ...
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...

To make sure this worked as intended, let’s plot the full basin over the extracted CONUS404 data footprint.

# Get the footprint of the grid
cutout = xr.ones_like(conus404.isel(time=0).drop_vars(['lat', 'lon'])['ACDEWC'])
# We need to write the CRS to the CONUS404 dataset and
# reproject to the crs of the HUC12 basins dataframe for clean plotting with hvplot
cutout = cutout.rio.write_crs(conus404.crs.crs_wkt).rio.reproject('EPSG:4326')

cutout_plt = cutout.hvplot(
    coastline='50m', geo=True,
    aspect='equal', frame_width=300, colorbar=False
)
huc12_plt = huc12_basins.hvplot(
    geo=True, c='r'
)
cutout_plt * huc12_plt

Looks good!

Aggregate CONUS404 to HUC12 Polygons#

Now that we have our dataset and basin polygons prepared, we are ready to aggregate.

Create Grid Polygons#

The first step here is to extract the CONUS404 grid information, which consists of getting the grid center points and the grid bounds.

grid = conus404[['x', 'y']].drop_vars(['lat', 'lon']).reset_coords()
grid = grid.cf.add_bounds(['x', 'y'])
grid
<xarray.Dataset> Size: 4kB
Dimensions:   (x: 51, y: 114, bounds: 2)
Coordinates:
  * x         (x) float64 408B 1.76e+06 1.764e+06 ... 1.956e+06 1.96e+06
  * y         (y) float64 912B 1.8e+05 1.84e+05 1.88e+05 ... 6.28e+05 6.32e+05
    y_bounds  (y, bounds) float64 2kB 1.78e+05 1.82e+05 ... 6.3e+05 6.34e+05
    x_bounds  (x, bounds) float64 816B 1.758e+06 1.762e+06 ... 1.962e+06
Dimensions without coordinates: bounds
Data variables:
    *empty*
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...

Then, we “stack” the data into a single 1D array. This creates an index of (x, y) pairs of the center points that links to the bounds. This will make generating polygons of the grid cells from the bounds much simpler than any manual looping.

points = grid.stack(point=('y', 'x'))
points
<xarray.Dataset> Size: 326kB
Dimensions:   (bounds: 2, point: 5814)
Coordinates:
    y_bounds  (bounds, point) float64 93kB 1.78e+05 1.78e+05 ... 6.34e+05
    x_bounds  (bounds, point) float64 93kB 1.758e+06 1.762e+06 ... 1.962e+06
  * point     (point) object 47kB MultiIndex
  * y         (point) float64 47kB 1.8e+05 1.8e+05 1.8e+05 ... 6.32e+05 6.32e+05
  * x         (point) float64 47kB 1.76e+06 1.764e+06 ... 1.956e+06 1.96e+06
Dimensions without coordinates: bounds
Data variables:
    *empty*
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...

Next, we can use the point pairs we just created to make polygons from the bounds. To do this, we we will make a simple function that takes the x and y bound to generate a polygon for the given grid cell. We can then apply it in parallel using xarray.apply_ufunc. Note that this step will get slower as we increase the grid size from our limited range. Perhaps could be vectorized using pygeos

%%time
def bounds_to_poly(x_bounds, y_bounds):
    return Polygon([
        (x_bounds[0], y_bounds[0]),
        (x_bounds[0], y_bounds[1]),
        (x_bounds[1], y_bounds[1]),
        (x_bounds[1], y_bounds[0])
    ])
    
boxes = xr.apply_ufunc(
    bounds_to_poly,
    points.x_bounds,
    points.y_bounds,
    input_core_dims=[("bounds",),  ("bounds",)],
    output_dtypes=[np.dtype('O')],
    vectorize=True
)
boxes
CPU times: user 123 ms, sys: 3.78 ms, total: 127 ms
Wall time: 127 ms
<xarray.DataArray (point: 5814)> Size: 47kB
array([<POLYGON ((1758000 178000, 1758000 182000, 1762000 182000, 1762000 178000, 1...>,
       <POLYGON ((1762000 178000, 1762000 182000, 1766000 182000, 1766000 178000, 1...>,
       <POLYGON ((1766000 178000, 1766000 182000, 1770000 182000, 1770000 178000, 1...>,
       ...,
       <POLYGON ((1950000 630000, 1950000 634000, 1954000 634000, 1954000 630000, 1...>,
       <POLYGON ((1954000 630000, 1954000 634000, 1958000 634000, 1958000 630000, 1...>,
       <POLYGON ((1958000 630000, 1958000 634000, 1962000 634000, 1962000 630000, 1...>],
      dtype=object)
Coordinates:
  * point    (point) object 47kB MultiIndex
  * y        (point) float64 47kB 1.8e+05 1.8e+05 1.8e+05 ... 6.32e+05 6.32e+05
  * x        (point) float64 47kB 1.76e+06 1.764e+06 ... 1.956e+06 1.96e+06

Finally, we convert this xarray.DataArray to a geopandas.GeoDataframe, specifying the projected CRS to be the same as the CONUS404 dataset.

grid_polygons= gp.GeoDataFrame(
    data={"geometry": boxes.values, "y": boxes['y'], "x": boxes['x']},
    index=boxes.indexes["point"],
    crs=conus404.crs.crs_wkt
)
grid_polygons
geometry y x
y x
180000.0 1760000.0 POLYGON ((1758000 178000, 1758000 182000, 1762... 180000.0 1760000.0
1764000.0 POLYGON ((1762000 178000, 1762000 182000, 1766... 180000.0 1764000.0
1768000.0 POLYGON ((1766000 178000, 1766000 182000, 1770... 180000.0 1768000.0
1772000.0 POLYGON ((1770000 178000, 1770000 182000, 1774... 180000.0 1772000.0
1776000.0 POLYGON ((1774000 178000, 1774000 182000, 1778... 180000.0 1776000.0
... ... ... ... ...
632000.0 1944000.0 POLYGON ((1942000 630000, 1942000 634000, 1946... 632000.0 1944000.0
1948000.0 POLYGON ((1946000 630000, 1946000 634000, 1950... 632000.0 1948000.0
1952000.0 POLYGON ((1950000 630000, 1950000 634000, 1954... 632000.0 1952000.0
1956000.0 POLYGON ((1954000 630000, 1954000 634000, 1958... 632000.0 1956000.0
1960000.0 POLYGON ((1958000 630000, 1958000 634000, 1962... 632000.0 1960000.0

5814 rows × 3 columns

Key Step: Overlay the Two Geometries#

We are finally ready for the magic of this method, the weight generation using geopandas. To calculate the weights, we will use the overlay method in geopandas. It calculates the area overlap between polygons in two different GeoDataFrames, i.e. the original grid polygons and the HUC12 polygons. An important thing to note is that when generating the weights we need to use an equal area projection (i.e., equal area CRS). So, before we overlay, we will need to convert the CRS to an area preserving projection. Here we use the NSIDC EASE-Grid 2.0 grid for the Northern Hemisphere.

As long as the feature polygons only cover a few 100s of grid polygons, this is an extremely fast operation. However, if 1000s of grid cells are covered this can be a wasteful calculation, as we really only need it for the grid polygons that are partially covered. Otherwise, we could use a faster computational method for fully covered cells, but we will leave that complex topic for another notebook.

%%time
crs_area = "EPSG:6931" # Good for northern hemisphere
# crs_area = "EPSG:5070" # Good for CONUS

huc12_basins_area = huc12_basins.to_crs(crs_area)
grid_polygons = grid_polygons.to_crs(crs_area)

# overlay the polygons
overlay = grid_polygons.overlay(huc12_basins_area, keep_geom_type=True)
CPU times: user 1.08 s, sys: 12 ms, total: 1.09 s
Wall time: 1.09 s

This is essentially already a sparse matrix, mapping one grid space to the other. How sparse? Let’s check.

sparsity = len(overlay) / (len(grid_polygons) * len(huc12_basins_area))
sparsity
0.002265789836210584

Let’s explore these overlays a little bit more. Mainly, we can verify that each basin’s area is preserved in the overlay operation.

# calculate areas of HUC12s from overlay and original polygons
overlay_area = overlay.geometry.area.groupby(overlay['huc12']).sum()
huc12_area = huc12_basins_area.geometry.area.groupby(huc12_basins_area['huc12']).sum()
# find the max fractional difference
(np.abs(overlay_area - huc12_area) / huc12_area).max()
1.6621210868136367e-14

Nice! So, it worked and only have differences within machine precision.

Calculate the Weights (i.e., Area Fraction for each Region)#

Now that we have the overlay of the grid polygons with the HUC12 polygons, we only need to transform this to weights for each grid cell to aggregate. This transform tells us how much of a HUC12 polygon’s total area is within each of the grid cells. This is accurate because we used an area-preserving CRS. Calculating this fractional area is again simple with geopandas.

grid_cell_fraction = overlay.geometry.area.groupby(overlay['huc12']).transform(lambda x: x / x.sum())
grid_cell_fraction
0       0.123877
1       0.069204
2       0.062925
3       0.000998
4       0.004910
          ...   
5620    0.122028
5621    0.007074
5622    0.088679
5623    0.029890
5624    0.000531
Length: 5625, dtype: float64

We can verify that these all sum up to one.

grid_cell_fraction.groupby(overlay['huc12']).sum().unique()
array([1., 1., 1., 1., 1.])

However, in their current Series form, the weights aren’t very useful. What we need is to convert them to a sparse array within xarray. Thankfully, xarray can easily do this if we add a MultiIndex for the grid cells’ center points and HUC12 IDs to the cell fraction DataFrame.

multi_index = overlay.set_index(['y', 'x', 'huc12']).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
df_weights
weights
y x huc12
200000.0 1904000.0 020402070601 0.123877
1908000.0 020402070601 0.069204
1912000.0 020402070603 0.062925
020402070601 0.000998
204000.0 1900000.0 020402070601 0.004910
... ... ... ...
608000.0 1864000.0 020401010101 0.122028
1868000.0 020401010101 0.007074
612000.0 1860000.0 020401010101 0.088679
1864000.0 020401010101 0.029890
616000.0 1860000.0 020401010101 0.000531

5625 rows × 1 columns

We can bring this directly into xarray as a 1D Dataset and then unstack it into a sparse array.

ds_weights = xr.Dataset(df_weights)
weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights
weights_sparse
<xarray.DataArray 'weights' (y: 105, x: 47, huc12: 427)> Size: 79kB
<COO: shape=(105, 47, 427), dtype=float64, nnz=5625, fill_value=0.0>
Coordinates:
  * y        (y) float64 840B 2e+05 2.04e+05 2.08e+05 ... 6.12e+05 6.16e+05
  * x        (x) float64 376B 1.768e+06 1.772e+06 ... 1.948e+06 1.952e+06
  * huc12    (huc12) object 3kB '020401010101' '020401010102' ... '020402070605'

Again, we can clearly see that this is a sparse matrix from the density. This is now all we need to do our aggregation.

Perform the Aggregation#

Unlike deriving the weights, actually performing the aggregation is a simple one line of code. This is because we utilize xarray.Dataset.weighted to do our weighted calculations. It will happily take a sparse array as weights and compute the aggregation.

However, rather than aggregating all variables, let’s only aggregate two in order to reduce the computational time.

%%time
# Note the .compute() at the end to actually do the computation here vs lazy computing
huc12_aggregation = conus404[['T2', 'PREC_ACC_NC']].weighted(weights_sparse).sum(dim=['x', 'y']).compute()
huc12_aggregation
CPU times: user 5.09 s, sys: 569 ms, total: 5.66 s
Wall time: 1min 41s
<xarray.Dataset> Size: 289MB
Dimensions:      (time: 15707, huc12: 427)
Coordinates:
  * time         (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * huc12        (huc12) object 3kB '020401010101' ... '020402070605'
Data variables:
    T2           (time, huc12) float64 161MB <COO: nnz=6706462, fill_value=0.0>
    PREC_ACC_NC  (time, huc12) float64 127MB <COO: nnz=5311162, fill_value=0.0>

Note that our aggregations are still sparse arrays, which is the cost of using a sparse array as our weights. However, the density of these sparse arrays is large, meaning we want to convert them out of sparse arrays and back to dense arrays. To do this, we can use apply_ufunc with a lambda function to convert.

huc12_aggregation = xr.apply_ufunc(lambda data: data.todense(), huc12_aggregation)

It important to note that we used a sum aggregation above rather than a mean. In theory, the two methods should be the same as a weighted sum is a weighted mean if the sum of weights is one (which they are in our case):

\[\textrm{weighted mean:\ }\frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}\]
\[\textrm{weighted sum:\ } \sum_{i=1}^n w_i x_i\]
\[\frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} = \sum_{i=1}^n w_i x_i \ \textrm{if} \sum_{i=1}^n w_i = 1\]

However, in practice, this will matter if your data contains NaN values. When using a sum with NaNs, the NaNs will effectively be treated as zeros, meaning any all NaN aggregation will result in a 0 value. For the mean, NaNs are ignored in the calculation, but all NaN aggregations will result in a NaN value being returned. Therefore, the two methods are the same when you have data mixed with NaNs, but if you want all NaN aggregations to return NaN use a mean. Otherwise, if you want it to return 0, use a sum.

Explore the Aggregation#

Now that we have the aggregated data and converted it to dense form, let’s make some plots!

Mean of Variable by HUC12#

First, we will calculate and plot mean value over all time steps for every HU12.

df_mean = huc12_aggregation.mean(dim='time').to_dataframe()
df_mean
T2 PREC_ACC_NC
huc12
020401010101 278.641501 3.321954
020401010102 278.726672 3.249965
020401010103 278.759801 3.348091
020401010104 278.762431 3.313887
020401010105 278.791650 3.425962
... ... ...
020402070601 286.098301 3.526245
020402070602 285.965278 3.521383
020402070603 285.991546 3.529709
020402070604 286.081634 3.549834
020402070605 285.935665 3.400082

427 rows × 2 columns

We need to merge this with the HUC12 basin GeoDataFrame to get the geometry info.

df_mean = pd.merge(huc12_basins, df_mean, left_on='huc12', right_on='huc12')
df_mean.head(3)
geometry tnmid metasourceid sourcedatadesc sourceoriginator sourcefeatureid loaddate gnis_id areaacres areasqkm ... huc12 name hutype humod tohuc noncontributingareaacres noncontributingareasqkm globalid T2 PREC_ACC_NC
0 MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ... {001055CD-7B5E-4F96-ABB1-22923FD2B454} None None None None 2013-01-18T07:08:41Z None 16178.10 65.47 ... 020402060505 White Marsh Run-Maurice River S NM 020402060507 0 0 {A70A97FA-E29C-11E2-8094-0021280458E6} 285.346164 3.352189
1 MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ... {009DA440-393A-46A0-BBA3-C38139823414} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:06Z None 29107.93 117.80 ... 020401050502 Lubbers Run-Musconetcong River S NM 020401050503 0 0 {B830B4F8-E29C-11E2-8094-0021280458E6} 282.001939 3.555950
2 MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ... {00F73634-FA00-464F-93F0-F0BCCB464567} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:58Z None 35229.84 142.57 ... 020402030305 Upper Maiden Creek S NM 020402030307 0 0 {BAF5E77A-E29C-11E2-8094-0021280458E6} 282.713378 3.603080

3 rows × 21 columns

Time to plot our two example variables!

temp_plt = df_mean.hvplot(
    c='T2', geo=True, coastline='50m', cmap='viridis',
    title='Mean Temperature at 2m [K]', frame_width=300, aspect='equal'
)
precip_plt = df_mean.hvplot(
    c='PREC_ACC_NC', geo=True, coastline='50m', cmap='viridis',
    title='Mean 24hr Accumulated Precipitation [mm]', frame_width=300,
    aspect='equal'
)

temp_plt + precip_plt

Mean Monthly Time Series#

Finally, let’s plot the mean monthly time series for each HUC12.

monthly_timeseries = huc12_aggregation.resample(time="MS").mean()
monthly_timeseries
<xarray.Dataset> Size: 4MB
Dimensions:      (time: 517, huc12: 427)
Coordinates:
  * huc12        (huc12) object 3kB '020401010101' ... '020402070605'
  * time         (time) datetime64[ns] 4kB 1979-10-01 1979-11-01 ... 2022-10-01
Data variables:
    T2           (time, huc12) float64 2MB 279.5 279.7 279.6 ... 0.0 0.0 0.0
    PREC_ACC_NC  (time, huc12) float64 2MB 2.242 2.108 2.198 ... 0.0 0.0 0.0
monthly_timeseries.hvplot(x='time', grid=True)

Shut Down the Dask Client#

If utilized, we should shut down the dask client.

client.close()