CONUS404 Spatial Aggregation

CONUS404 Spatial Aggregation#

In this notebook, we will be showing how to aggregate gridded data to polygons. The method aggregates gridded data conservatively, i.e. by exactly partitioning each grid cell into the precise region boundaries. The method makes use of two key packages xarray and geopandas. Our implementation is based off of this Pangeo Discourse, which we have updated using more streamlined methods.

The overall approach consists of:

Represent both the original gridded data and target polygons as geopandas.GeoSeries objects (with vector geometries).
Compute their area overlay and turn it into a sparse matrix of cell weights.
Perform weighted aggregation using xarray.Dataset.weighted along the spatial dimensions.

It is quite fast and transparent.

%xmode minimal
import os
import xarray as xr
import geopandas as gp
import pandas as pd
import numpy as np
import sparse

import hvplot.pandas
import hvplot.xarray
import dask
import cf_xarray

from pynhd import NLDI, WaterData
from pygeohydro import watershed
import intake
import cartopy.crs as ccrs
from shapely.geometry import Polygon

Exception reporting mode: Minimal

Open Dataset from Intake Catalog#

First, let’s begin by loading the CONUS404 daily data.

hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
catalog = hytest_cat['conus404-catalog']
list(catalog)

['conus404-hourly-onprem-hw',
 'conus404-hourly-cloud',
 'conus404-hourly-osn',
 'conus404-daily-diagnostic-onprem-hw',
 'conus404-daily-diagnostic-cloud',
 'conus404-daily-diagnostic-osn',
 'conus404-daily-onprem-hw',
 'conus404-daily-cloud',
 'conus404-daily-osn',
 'conus404-monthly-onprem-hw',
 'conus404-monthly-cloud',
 'conus404-monthly-osn',
 'conus404-hourly-ba-onprem-hw',
 'conus404-hourly-ba-osn',
 'conus404-daily-ba-onprem',
 'conus404-daily-ba-osn',
 'conus404-pgw-hourly-onprem-hw',
 'conus404-pgw-hourly-osn',
 'conus404-pgw-daily-diagnostic-onprem-hw',
 'conus404-pgw-daily-diagnostic-osn',
 'conus404-pgw-daily-onprem-hw',
 'conus404-pgw-daily-osn']

As we can see there are three different locations for the conus404-daily data set. The locations are (1) -onprem-hw meaning it is stored on the USGS Hovenweep HPC, (2) -cloud meaning it is store in an S3 bucket, or (3) -osn meaning the data is on the USGS open storage network (OSN). As the OSN is free to access from any environment, we will use that for this example, but the location can easily be changed depending on your needs. We have a writeup of our different storage locations used in the intake catalog here.

If you change this notebook to use the CONUS404 dataset stored on S3 (options ending in -cloud), you will be pulling data from a requester-pays S3 bucket. This means you have to set up your AWS credentials, else we won’t be able to load the data. Please note that reading the -cloud data from S3 may incur charges if you are reading data outside of the us-west-2 region or running the notebook outside of the cloud altogether. If you would like to access one of the -cloud options, uncomment and run the following code cell to set up your AWS credentials. You can find more info about this AWS helper function here.

# uncomment the lines below to read in your AWS credentials if you want to access data from a requester-pays bucket (-cloud)
# os.environ['AWS_PROFILE'] = 'default'
# %run ../environment_set_up/Help_AWS_Credentials.ipynb

Finally, read in the daily CONUS404 data set.

dataset = 'conus404-daily-osn'
conus404 = catalog[dataset].to_dask()
conus404

<xarray.Dataset> Size: 9TB
Dimensions:         (time: 15707, y: 1015, x: 1367, bottom_top_stag: 51,
                     bottom_top: 50, soil_layers_stag: 4, x_stag: 1368,
                     y_stag: 1016, snow_layers_stag: 3, snso_layers_stag: 7)
Coordinates:
    lat             (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    lat_u           (y, x_stag) float32 6MB dask.array<chunksize=(350, 175), meta=np.ndarray>
    lat_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 350), meta=np.ndarray>
    lon             (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    lon_u           (y, x_stag) float32 6MB dask.array<chunksize=(350, 175), meta=np.ndarray>
    lon_v           (y_stag, x) float32 6MB dask.array<chunksize=(175, 350), meta=np.ndarray>
  * time            (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * x               (x) float64 11kB -2.732e+06 -2.728e+06 ... 2.732e+06
  * y               (y) float64 8kB -2.028e+06 -2.024e+06 ... 2.028e+06
Dimensions without coordinates: bottom_top_stag, bottom_top, soil_layers_stag,
                                x_stag, y_stag, snow_layers_stag,
                                snso_layers_stag
Data variables: (12/153)
    ACDEWC          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACDRIPR         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACDRIPS         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACECAN          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACEDIR          (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ACETLSM         (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    ...              ...
    ZNU             (bottom_top) float32 200B dask.array<chunksize=(50,), meta=np.ndarray>
    ZNW             (bottom_top_stag) float32 204B dask.array<chunksize=(51,), meta=np.ndarray>
    ZS              (soil_layers_stag) float32 16B dask.array<chunksize=(4,), meta=np.ndarray>
    ZSNSO           (time, snso_layers_stag, y, x) float32 610GB dask.array<chunksize=(36, 7, 350, 350), meta=np.ndarray>
    ZWT             (time, y, x) float32 87GB dask.array<chunksize=(36, 350, 350), meta=np.ndarray>
    crs             int64 8B ...
Attributes: (12/148)
    AER_ANGEXP_OPT:                  1
    AER_ANGEXP_VAL:                  1.2999999523162842
    AER_AOD550_OPT:                  1
    AER_AOD550_VAL:                  0.11999999731779099
    AER_ASY_OPT:                     1
    AER_ASY_VAL:                     0.8999999761581421
    ...                              ...
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1
    W_DAMPING:                       1
    YSU_TOPDOWN_PBLMIX:              0
    history:                         Tue Mar 29 16:35:22 2022: ncrcat -A -vW ...
    history_of_appended_files:       Tue Mar 29 16:35:22 2022: Appended file ...

Parallelize with Dask (optional)#

Some of the steps we will take are aware of parallel clustered compute environments using dask. We can start a cluster now so that future steps take advantage of this ability. This is an optional step, but speed ups data loading significantly, especially when accessing data from the cloud.

We have documentation on how to start a Dask Cluster in different computing environments here. Uncomment the cluster start up that works for your compute environment.

%run ../environment_set_up/Start_Dask_Cluster_Nebari.ipynb
## If this notebook is not being run on Nebari/ESIP, replace the above 
## path name with a helper appropriate to your compute environment.  Examples:
# %run ../environment_set_up/Start_Dask_Cluster_Denali.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Tallgrass.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_Desktop.ipynb
# %run ../environment_set_up/Start_Dask_Cluster_PangeoCHS.ipynb

The 'cluster' object can be used to adjust cluster behavior.  i.e. 'cluster.adapt(minimum=10)'
The 'client' object can be used to directly interact with the cluster.  i.e. 'client.submit(func)' 
The link to view the client dashboard is:
>  https://hytestnebari.dev-wma.chs.usgs.gov/gateway/clusters/dev.c676ff22d4694540af65a7193eaf2d4d/status

Load the Feature Polygons#

Now that we have read in the CONUS404 data, we need to read in some polygons to aggregate the data. For this example, we will use the HUC12 basins within the Delaware River Basin. To get these HUC12 polygons, we can use pygeohydro.watershed to query the Hydro Network Linked Data Index (NLDI). All we need to get the basins is the general IDs of the HUC12 basins. For the Delaware Basin those are ones that start with 020401 or 020402.

%%time
wbd = watershed.WBD("huc4")
delaware_basin = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData('wbd12').bygeom(delaware_basin.iloc[0].geometry)
huc12_basins = huc12_basins[huc12_basins['huc12'].str.startswith(('020401', '020402'))]
huc12_basins

CPU times: user 3.04 s, sys: 204 ms, total: 3.24 s
Wall time: 13.4 s

	geometry	tnmid	metasourceid	sourcedatadesc	sourceoriginator	sourcefeatureid	loaddate	gnis_id	areaacres	areasqkm	states	huc12	name	hutype	humod	tohuc	noncontributingareaacres	noncontributingareasqkm	globalid
0	MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ...	{001055CD-7B5E-4F96-ABB1-22923FD2B454}	None	None	None	None	2013-01-18T07:08:41Z	None	16178.10	65.47	NJ	020402060505	White Marsh Run-Maurice River	S	NM	020402060507	0	0	{A70A97FA-E29C-11E2-8094-0021280458E6}
1	MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ...	{009DA440-393A-46A0-BBA3-C38139823414}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:11:06Z	None	29107.93	117.80	NJ	020401050502	Lubbers Run-Musconetcong River	S	NM	020401050503	0	0	{B830B4F8-E29C-11E2-8094-0021280458E6}
2	MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ...	{00F73634-FA00-464F-93F0-F0BCCB464567}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:58Z	None	35229.84	142.57	PA	020402030305	Upper Maiden Creek	S	NM	020402030307	0	0	{BAF5E77A-E29C-11E2-8094-0021280458E6}
3	MULTIPOLYGON (((-75.13279 39.88601, -75.13439 ...	{017B8A58-E959-4600-B212-8EB921F0F9F7}	{E9F5C988-2313-440E-A05E-C5871E2773A6}	None	None	None	2017-10-03T20:11:04Z	None	24920.90	100.85	NJ,PA	020402020507	Woodbury Creek-Delaware River	S	TF	020402020607	0	0	{B122D6B7-E29C-11E2-8094-0021280458E6}
4	MULTIPOLYGON (((-74.92817 40.55256, -74.92833 ...	{01B4598B-0623-407F-9133-3040D99A8D1A}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:52Z	None	15086.80	61.05	NJ	020401050905	Lockatong Creek	S	NM	020401050908	0	0	{A6AB04BE-E29C-11E2-8094-0021280458E6}
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
656	MULTIPOLYGON (((-75.12376 40.12389, -75.12455 ...	{FE1811B2-B495-4065-8624-913739AF724E}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:11:07Z	None	23146.21	93.67	PA	020402030902	Lower Wissahickon Creek	S	NM	020402031007	0	0	{BB87766E-E29C-11E2-8094-0021280458E6}
659	MULTIPOLYGON (((-75.75805 40.65996, -75.75893 ...	{FF44E71D-0E10-4CEF-ACB5-A148FCC1D201}	None	None	None	None	2013-01-18T07:08:53Z	None	18124.40	73.35	PA	020402030301	Ontelaunee Creek	S	NM	020402030305	0	0	{BAD53EAD-E29C-11E2-8094-0021280458E6}
660	MULTIPOLYGON (((-75.19691 40.5796, -75.19719 4...	{FF630558-25A7-462F-9E12-F4166478902C}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:58Z	None	18929.21	76.60	PA	020401050604	Cooks Creek	S	NM	020401050605	0	0	{BA28799D-E29C-11E2-8094-0021280458E6}
661	MULTIPOLYGON (((-75.53503 39.11019, -75.53441 ...	{FF6AB0DD-4B75-40D8-81D9-7AC2F0ACE65E}	None	None	None	None	2013-01-18T07:08:10Z	None	6290.20	25.46	DE	020402070303	Tidbury Creek	S	NM	020402070304	0	0	{963110E8-E29C-11E2-8094-0021280458E6}
662	MULTIPOLYGON (((-74.85144 41.01532, -74.8516 4...	{FF9C7C31-4BE0-426E-9CEA-FC14A0F526C7}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:56Z	None	21760.30	88.06	NJ	020401050103	Middle Paulins Kill	S	NM	020401050104	0	0	{B82FC756-E29C-11E2-8094-0021280458E6}

427 rows × 19 columns

Let’s plot the HUC12 basins to see how they look.

huc12_basins.hvplot(
    c='huc12', title="Delaware River HUC12 basins",
    coastline='50m', geo=True,
    aspect='equal', legend=False, frame_width=300
)

/home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/cartopy/io/__init__.py:241: DownloadWarning: Downloading: https://naturalearth.s3.amazonaws.com/50m_physical/ne_50m_coastline.zip
  warnings.warn(f'Downloading: {url}', DownloadWarning)

An important thing to note is that all geodataframes should have a coordinate reference system (CRS). Let’s check to make sure our geodataframe has a CRS.

huc12_basins.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Limit CONUS404 Spatial Range#

With the HUC12 basins read in, we only need the CONUS404 data that spans these polygons as they are the regions we will be aggregating. So, let’s limit the CONUS404 spatial range to that of the basins. This will save on memory and computation. Note doing this is mainly useful when the region’s footprint is much smaller than the footprint of the gridded model.

To limit the spatial range, we first need to convert the CRS of the basins to that of CONUS404. Then extract the bounding box of the basins.

huc12_basins_conus404_crs = huc12_basins.to_crs(conus404.crs.crs_wkt)
bbox = huc12_basins_conus404_crs.total_bounds
bbox

array([1768709.12577942,  198277.21996997, 1952739.24390951,
        614126.29932052])

Then select the CONUS404 data within the bounding box. However, when we do this, we will extend the bounds out by 5% of their range to ensure all of our basins are within the spatially limited data. We do this as the reprojections of the CRS can cause slight distortions that make polygons on the bounds not fall fully within the data.

To make sure this worked as intended, let’s plot the full basin over the extracted CONUS404 data footprint.

# Get the footprint of the grid
cutout = xr.ones_like(conus404.isel(time=0).drop_vars(['lat', 'lon'])['ACDEWC'])
# We need to write the CRS to the CONUS404 dataset and
# reproject to the crs of the HUC12 basins dataframe for clean plotting with hvplot
cutout = cutout.rio.write_crs(conus404.crs.crs_wkt).rio.reproject('EPSG:4326')

cutout_plt = cutout.hvplot(
    coastline='50m', geo=True,
    aspect='equal', frame_width=300, colorbar=False
)
huc12_plt = huc12_basins.hvplot(
    geo=True, c='r'
)
cutout_plt * huc12_plt

Looks good!

Aggregate CONUS404 to HUC12 Polygons#

Now that we have our dataset and basin polygons prepared, we are ready to aggregate.

Create Grid Polygons#

The first step here is to extract the CONUS404 grid information, which consists of getting the grid center points and the grid bounds.

Then, we “stack” the data into a single 1D array. This creates an index of (x, y) pairs of the center points that links to the bounds. This will make generating polygons of the grid cells from the bounds much simpler than any manual looping.

Next, we can use the point pairs we just created to make polygons from the bounds. To do this, we we will make a simple function that takes the x and y bound to generate a polygon for the given grid cell. We can then apply it in parallel using xarray.apply_ufunc. Note that this step will get slower as we increase the grid size from our limited range. Perhaps could be vectorized using pygeos…

Finally, we convert this xarray.DataArray to a geopandas.GeoDataframe, specifying the projected CRS to be the same as the CONUS404 dataset.

grid_polygons= gp.GeoDataFrame(
    data={"geometry": boxes.values, "y": boxes['y'], "x": boxes['x']},
    index=boxes.indexes["point"],
    crs=conus404.crs.crs_wkt
)
grid_polygons

		geometry	y	x
y	x
180000.0	1760000.0	POLYGON ((1758000 178000, 1758000 182000, 1762...	180000.0	1760000.0
	1764000.0	POLYGON ((1762000 178000, 1762000 182000, 1766...	180000.0	1764000.0
	1768000.0	POLYGON ((1766000 178000, 1766000 182000, 1770...	180000.0	1768000.0
	1772000.0	POLYGON ((1770000 178000, 1770000 182000, 1774...	180000.0	1772000.0
	1776000.0	POLYGON ((1774000 178000, 1774000 182000, 1778...	180000.0	1776000.0
...	...	...	...	...
632000.0	1944000.0	POLYGON ((1942000 630000, 1942000 634000, 1946...	632000.0	1944000.0
	1948000.0	POLYGON ((1946000 630000, 1946000 634000, 1950...	632000.0	1948000.0
	1952000.0	POLYGON ((1950000 630000, 1950000 634000, 1954...	632000.0	1952000.0
	1956000.0	POLYGON ((1954000 630000, 1954000 634000, 1958...	632000.0	1956000.0
	1960000.0	POLYGON ((1958000 630000, 1958000 634000, 1962...	632000.0	1960000.0

5814 rows × 3 columns

Key Step: Overlay the Two Geometries#

We are finally ready for the magic of this method, the weight generation using geopandas. To calculate the weights, we will use the overlay method in geopandas. It calculates the area overlap between polygons in two different GeoDataFrames, i.e. the original grid polygons and the HUC12 polygons. An important thing to note is that when generating the weights we need to use an equal area projection (i.e., equal area CRS). So, before we overlay, we will need to convert the CRS to an area preserving projection. Here we use the NSIDC EASE-Grid 2.0 grid for the Northern Hemisphere.

As long as the feature polygons only cover a few 100s of grid polygons, this is an extremely fast operation. However, if 1000s of grid cells are covered this can be a wasteful calculation, as we really only need it for the grid polygons that are partially covered. Otherwise, we could use a faster computational method for fully covered cells, but we will leave that complex topic for another notebook.

%%time
crs_area = "EPSG:6931" # Good for northern hemisphere
# crs_area = "EPSG:5070" # Good for CONUS

huc12_basins_area = huc12_basins.to_crs(crs_area)
grid_polygons = grid_polygons.to_crs(crs_area)

# overlay the polygons
overlay = grid_polygons.overlay(huc12_basins_area, keep_geom_type=True)

CPU times: user 1.08 s, sys: 12 ms, total: 1.09 s
Wall time: 1.09 s

This is essentially already a sparse matrix, mapping one grid space to the other. How sparse? Let’s check.

sparsity = len(overlay) / (len(grid_polygons) * len(huc12_basins_area))
sparsity

0.002265789836210584

Let’s explore these overlays a little bit more. Mainly, we can verify that each basin’s area is preserved in the overlay operation.

# calculate areas of HUC12s from overlay and original polygons
overlay_area = overlay.geometry.area.groupby(overlay['huc12']).sum()
huc12_area = huc12_basins_area.geometry.area.groupby(huc12_basins_area['huc12']).sum()
# find the max fractional difference
(np.abs(overlay_area - huc12_area) / huc12_area).max()

1.6621210868136367e-14

Nice! So, it worked and only have differences within machine precision.

Calculate the Weights (i.e., Area Fraction for each Region)#

Now that we have the overlay of the grid polygons with the HUC12 polygons, we only need to transform this to weights for each grid cell to aggregate. This transform tells us how much of a HUC12 polygon’s total area is within each of the grid cells. This is accurate because we used an area-preserving CRS. Calculating this fractional area is again simple with geopandas.

grid_cell_fraction = overlay.geometry.area.groupby(overlay['huc12']).transform(lambda x: x / x.sum())
grid_cell_fraction

     0.123877
     0.069204
     0.062925
     0.000998
     0.004910
          ...   
  0.122028
  0.007074
  0.088679
  0.029890
  0.000531
Length: 5625, dtype: float64

We can verify that these all sum up to one.

grid_cell_fraction.groupby(overlay['huc12']).sum().unique()

array([1., 1., 1., 1., 1.])

However, in their current Series form, the weights aren’t very useful. What we need is to convert them to a sparse array within xarray. Thankfully, xarray can easily do this if we add a MultiIndex for the grid cells’ center points and HUC12 IDs to the cell fraction DataFrame.

multi_index = overlay.set_index(['y', 'x', 'huc12']).index
df_weights = pd.DataFrame({"weights": grid_cell_fraction.values}, index=multi_index)
df_weights

			weights
y	x	huc12
200000.0	1904000.0	020402070601	0.123877
	1908000.0	020402070601	0.069204
	1912000.0	020402070603	0.062925
	1912000.0	020402070601	0.000998
204000.0	1900000.0	020402070601	0.004910
...	...	...	...
608000.0	1864000.0	020401010101	0.122028
608000.0	1868000.0	020401010101	0.007074
612000.0	1860000.0	020401010101	0.088679
612000.0	1864000.0	020401010101	0.029890
616000.0	1860000.0	020401010101	0.000531

5625 rows × 1 columns

We can bring this directly into xarray as a 1D Dataset and then unstack it into a sparse array.

ds_weights = xr.Dataset(df_weights)
weights_sparse = ds_weights.unstack(sparse=True, fill_value=0.).weights
weights_sparse

<xarray.DataArray 'weights' (y: 105, x: 47, huc12: 427)> Size: 79kB
<COO: shape=(105, 47, 427), dtype=float64, nnz=5625, fill_value=0.0>
Coordinates:
  * y        (y) float64 840B 2e+05 2.04e+05 2.08e+05 ... 6.12e+05 6.16e+05
  * x        (x) float64 376B 1.768e+06 1.772e+06 ... 1.948e+06 1.952e+06
  * huc12    (huc12) object 3kB '020401010101' '020401010102' ... '020402070605'

Again, we can clearly see that this is a sparse matrix from the density. This is now all we need to do our aggregation.

Perform the Aggregation#

Unlike deriving the weights, actually performing the aggregation is a simple one line of code. This is because we utilize xarray.Dataset.weighted to do our weighted calculations. It will happily take a sparse array as weights and compute the aggregation.

However, rather than aggregating all variables, let’s only aggregate two in order to reduce the computational time.

%%time
# Note the .compute() at the end to actually do the computation here vs lazy computing
huc12_aggregation = conus404[['T2', 'PREC_ACC_NC']].weighted(weights_sparse).sum(dim=['x', 'y']).compute()
huc12_aggregation

CPU times: user 5.09 s, sys: 569 ms, total: 5.66 s
Wall time: 1min 41s

<xarray.Dataset> Size: 289MB
Dimensions:      (time: 15707, huc12: 427)
Coordinates:
  * time         (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * huc12        (huc12) object 3kB '020401010101' ... '020402070605'
Data variables:
    T2           (time, huc12) float64 161MB <COO: nnz=6706462, fill_value=0.0>
    PREC_ACC_NC  (time, huc12) float64 127MB <COO: nnz=5311162, fill_value=0.0>

Note that our aggregations are still sparse arrays, which is the cost of using a sparse array as our weights. However, the density of these sparse arrays is large, meaning we want to convert them out of sparse arrays and back to dense arrays. To do this, we can use apply_ufunc with a lambda function to convert.

huc12_aggregation = xr.apply_ufunc(lambda data: data.todense(), huc12_aggregation)

It important to note that we used a sum aggregation above rather than a mean. In theory, the two methods should be the same as a weighted sum is a weighted mean if the sum of weights is one (which they are in our case):

\[\textrm{weighted mean:\ }\frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}\]

\[\textrm{weighted sum:\ } \sum_{i=1}^n w_i x_i\]

\[\frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i} = \sum_{i=1}^n w_i x_i \ \textrm{if} \sum_{i=1}^n w_i = 1\]

However, in practice, this will matter if your data contains NaN values. When using a sum with NaNs, the NaNs will effectively be treated as zeros, meaning any all NaN aggregation will result in a 0 value. For the mean, NaNs are ignored in the calculation, but all NaN aggregations will result in a NaN value being returned. Therefore, the two methods are the same when you have data mixed with NaNs, but if you want all NaN aggregations to return NaN use a mean. Otherwise, if you want it to return 0, use a sum.

Explore the Aggregation#

Now that we have the aggregated data and converted it to dense form, let’s make some plots!

Mean of Variable by HUC12#

First, we will calculate and plot mean value over all time steps for every HU12.

df_mean = huc12_aggregation.mean(dim='time').to_dataframe()
df_mean

	T2	PREC_ACC_NC
huc12
020401010101	278.641501	3.321954
020401010102	278.726672	3.249965
020401010103	278.759801	3.348091
020401010104	278.762431	3.313887
020401010105	278.791650	3.425962
...	...	...
020402070601	286.098301	3.526245
020402070602	285.965278	3.521383
020402070603	285.991546	3.529709
020402070604	286.081634	3.549834
020402070605	285.935665	3.400082

427 rows × 2 columns

We need to merge this with the HUC12 basin GeoDataFrame to get the geometry info.

df_mean = pd.merge(huc12_basins, df_mean, left_on='huc12', right_on='huc12')
df_mean.head(3)

	geometry	tnmid	metasourceid	sourcedatadesc	sourceoriginator	sourcefeatureid	loaddate	gnis_id	areaacres	areasqkm	...	huc12	name	hutype	humod	tohuc	globalid	T2	PREC_ACC_NC
0	MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ...	{001055CD-7B5E-4F96-ABB1-22923FD2B454}	None	None	None	None	2013-01-18T07:08:41Z	None	16178.10	65.47	...	020402060505	White Marsh Run-Maurice River	S	NM	020402060507	{A70A97FA-E29C-11E2-8094-0021280458E6}	285.346164	3.352189
1	MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ...	{009DA440-393A-46A0-BBA3-C38139823414}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:11:06Z	None	29107.93	117.80	...	020401050502	Lubbers Run-Musconetcong River	S	NM	020401050503	{B830B4F8-E29C-11E2-8094-0021280458E6}	282.001939	3.555950
2	MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ...	{00F73634-FA00-464F-93F0-F0BCCB464567}	{ED602145-9201-4827-9CE1-05D252484579}	None	None	None	2017-10-03T20:10:58Z	None	35229.84	142.57	...	020402030305	Upper Maiden Creek	S	NM	020402030307	{BAF5E77A-E29C-11E2-8094-0021280458E6}	282.713378	3.603080

3 rows × 21 columns

Time to plot our two example variables!

temp_plt = df_mean.hvplot(
    c='T2', geo=True, coastline='50m', cmap='viridis',
    title='Mean Temperature at 2m [K]', frame_width=300, aspect='equal'
)
precip_plt = df_mean.hvplot(
    c='PREC_ACC_NC', geo=True, coastline='50m', cmap='viridis',
    title='Mean 24hr Accumulated Precipitation [mm]', frame_width=300,
    aspect='equal'
)

temp_plt + precip_plt

Shut Down the Dask Client#

If utilized, we should shut down the dask client.

client.close()

Format	coo
Data Type	float64
Shape	(15707, 427)
nnz	6706462
Density	0.9999363341185459
Read-only	True
Size	153.5M
Storage ratio	3.00

Format	coo
Data Type	float64
Shape	(15707, 427)
nnz	5311162
Density	0.7918965111842465
Read-only	True
Size	121.6M
Storage ratio	2.38