gdptools CONUS404 Spatial Aggregation over DRB-extent HUC12s#

This tutorial demonstrates the use of gdptools, a python package for area-weighted interpolation of source gridded datasets, such as conus404, to target polygonal geospatial fabrics. Source datasets can be any gridded dataset that can be opened in XArray. However it’s important to note that gdptools, operations on XArray Datasets or DataArrays with dimensions of (Y,X,Time) generally. As such climate datasets that have ensemble dimensions will require subsetting by ensemble to obtain the a dataset with the proper dimensions. The target dataset can be any polygonal dataset that can be read by GeoPandas. GDPtools also has capabilities of interpolating gridded data to lines as well, but our focus here is interpolating to polygons.

In this workflow, CONUS404 is aggregated to Deleware River Basin (DRB) HUC12s. The spatial polygons used in this notebook come from the NHDPlusV2 snapshot of the Watershed Boundary Dataset HUC12 boundaries provided through the PyGeoHydro python package.

We use the HyTest intake catalog to access CONUS404 from the OSN pod. This notebook provides a relatively simple and efficient workflow that can be easily run on a local computer.

# Common python packages
import xarray as xr
import hvplot.xarray
import hvplot.pandas
import hvplot.dask
import intake
import warnings
import intake_xarray
import datetime
import holoviews as hv
import geoviews as gv
from holoviews import opts
import cartopy.crs as ccrs
import panel as pn
import numpy as np
import pandas as pd
import geopandas as gpd

# HyRiver packages
from pynhd import NLDI, WaterData
import pygeohydro as gh
# GDPTools packages
from gdptools import AggGen, UserCatData, WeightGen
import os
# Until gdptools updates it's numpy dependency to v2, the environment statement below is required
os.environ["HYRIVER_CACHE_DISABLE"] = "true"

hv.extension("bokeh")
pn.extension()

warnings.filterwarnings('ignore')

Here we setup a variable the sets our local context, working on the HPC or working locally on your Desktop. This just modifies the access point of the conus404 data, using the Hovenweep access for HPC and the OSN pod access for the Desktop.

t_sys = "Desktop"  # "HPC"  # or "Desktop"

We can use a subset of the HyRiver Python packages to access HUC12 geometries representing the Delaware River Basin. The process involves several steps:

  1. Select the HUC4 Mid-Atlantic Region:

    • This region encompasses the Delaware River Basin (HUC4 code: 0204).

  2. Retrieve HUC12 Basins within the Selected HUC4:

    • Obtain all HUC12 basins that fall within the HUC4 Mid-Atlantic region.

  3. Filter HUC12 Basins:

    • Focus on the HUC12 basins within the two HUC6 regions whose drainages terminate in the Delaware River Basin (DRB).

    • Exclude basins with drainages that terminate at the coast.

We used Science in Your Watershed to help identify the HUC6 regions that drain directly into the DRB.

wbd = gh.WBD("huc4")
del_huc4 = wbd.byids(field="huc4", fids="0204")
huc12_basins = WaterData('wbd12').bygeom(del_huc4.geometry[0])
filtered_gdf = huc12_basins[huc12_basins['huc12'].str.startswith(('020401', '020402'))]
filtered_gdf
geometry tnmid metasourceid sourcedatadesc sourceoriginator sourcefeatureid loaddate gnis_id areaacres areasqkm states huc12 name hutype humod tohuc noncontributingareaacres noncontributingareasqkm globalid
0 MULTIPOLYGON (((-75.02421 39.44397, -75.02353 ... {001055CD-7B5E-4F96-ABB1-22923FD2B454} None None None None 2013-01-18T07:08:41Z None 16178.10 65.47 NJ 020402060505 White Marsh Run-Maurice River S NM 020402060507 0 0 {A70A97FA-E29C-11E2-8094-0021280458E6}
1 MULTIPOLYGON (((-74.62113 41.00565, -74.62139 ... {009DA440-393A-46A0-BBA3-C38139823414} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:06Z None 29107.93 117.80 NJ 020401050502 Lubbers Run-Musconetcong River S NM 020401050503 0 0 {B830B4F8-E29C-11E2-8094-0021280458E6}
2 MULTIPOLYGON (((-75.75805 40.65996, -75.75783 ... {00F73634-FA00-464F-93F0-F0BCCB464567} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:58Z None 35229.84 142.57 PA 020402030305 Upper Maiden Creek S NM 020402030307 0 0 {BAF5E77A-E29C-11E2-8094-0021280458E6}
3 MULTIPOLYGON (((-75.13279 39.88601, -75.13439 ... {017B8A58-E959-4600-B212-8EB921F0F9F7} {E9F5C988-2313-440E-A05E-C5871E2773A6} None None None 2017-10-03T20:11:04Z None 24920.90 100.85 NJ,PA 020402020507 Woodbury Creek-Delaware River S TF 020402020607 0 0 {B122D6B7-E29C-11E2-8094-0021280458E6}
4 MULTIPOLYGON (((-74.92817 40.55256, -74.92833 ... {01B4598B-0623-407F-9133-3040D99A8D1A} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:52Z None 15086.80 61.05 NJ 020401050905 Lockatong Creek S NM 020401050908 0 0 {A6AB04BE-E29C-11E2-8094-0021280458E6}
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
656 MULTIPOLYGON (((-75.12376 40.12389, -75.12455 ... {FE1811B2-B495-4065-8624-913739AF724E} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:07Z None 23146.21 93.67 PA 020402030902 Lower Wissahickon Creek S NM 020402031007 0 0 {BB87766E-E29C-11E2-8094-0021280458E6}
659 MULTIPOLYGON (((-75.75805 40.65996, -75.75893 ... {FF44E71D-0E10-4CEF-ACB5-A148FCC1D201} None None None None 2013-01-18T07:08:53Z None 18124.40 73.35 PA 020402030301 Ontelaunee Creek S NM 020402030305 0 0 {BAD53EAD-E29C-11E2-8094-0021280458E6}
660 MULTIPOLYGON (((-75.19691 40.5796, -75.19719 4... {FF630558-25A7-462F-9E12-F4166478902C} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:58Z None 18929.21 76.60 PA 020401050604 Cooks Creek S NM 020401050605 0 0 {BA28799D-E29C-11E2-8094-0021280458E6}
661 MULTIPOLYGON (((-75.53503 39.11019, -75.53441 ... {FF6AB0DD-4B75-40D8-81D9-7AC2F0ACE65E} None None None None 2013-01-18T07:08:10Z None 6290.20 25.46 DE 020402070303 Tidbury Creek S NM 020402070304 0 0 {963110E8-E29C-11E2-8094-0021280458E6}
662 MULTIPOLYGON (((-74.85144 41.01532, -74.8516 4... {FF9C7C31-4BE0-426E-9CEA-FC14A0F526C7} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:56Z None 21760.30 88.06 NJ 020401050103 Middle Paulins Kill S NM 020401050104 0 0 {B82FC756-E29C-11E2-8094-0021280458E6}

427 rows × 19 columns

from holoviews.element.tiles import OSM
drb = filtered_gdf.hvplot(
    geo=True, coastline='50m', alpha=0.2,  frame_width=300,
    xlabel="longitude", ylabel="latitude",
    title="Delaware River HUC12 basins", aspect='equal'
)
OSM() * drb

Access conus404 via the HyTest intake catalog.

# open the hytest data intake catalog
hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
list(hytest_cat)
['conus404-catalog',
 'benchmarks-catalog',
 'conus404-drb-eval-tutorial-catalog',
 'nhm-v1.0-daymet-catalog',
 'nhm-v1.1-c404-bc-catalog',
 'nhm-v1.1-gridmet-catalog',
 'trends-and-drivers-catalog',
 'nhm-prms-v1.1-gridmet-format-testing-catalog',
 'nwis-streamflow-usgs-gages-onprem',
 'nwis-streamflow-usgs-gages-osn',
 'nwm21-streamflow-usgs-gages-onprem',
 'nwm21-streamflow-usgs-gages-osn',
 'nwm21-streamflow-cloud',
 'geofabric_v1_1-zip-osn',
 'geofabric_v1_1_POIs_v1_1-osn',
 'geofabric_v1_1_TBtoGFv1_POIs-osn',
 'geofabric_v1_1_nhru_v1_1-osn',
 'geofabric_v1_1_nhru_v1_1_simp-osn',
 'geofabric_v1_1_nsegment_v1_1-osn',
 'gages2_nndar-osn',
 'wbd-zip-osn',
 'huc12-geoparquet-osn',
 'huc12-gpkg-osn',
 'nwm21-scores',
 'lcmap-cloud',
 'rechunking-tutorial-osn',
 'pointsample-tutorial-sites-osn',
 'pointsample-tutorial-output-osn']
# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)
['conus404-hourly-onprem-hw',
 'conus404-hourly-cloud',
 'conus404-hourly-osn',
 'conus404-daily-diagnostic-onprem-hw',
 'conus404-daily-diagnostic-cloud',
 'conus404-daily-diagnostic-osn',
 'conus404-daily-onprem-hw',
 'conus404-daily-cloud',
 'conus404-daily-osn',
 'conus404-monthly-onprem-hw',
 'conus404-monthly-cloud',
 'conus404-monthly-osn',
 'conus404-hourly-ba-onprem-hw',
 'conus404-hourly-ba-osn',
 'conus404-daily-ba-onprem',
 'conus404-daily-ba-osn',
 'conus404-pgw-hourly-onprem-hw',
 'conus404-pgw-hourly-osn',
 'conus404-pgw-daily-diagnostic-onprem-hw',
 'conus404-pgw-daily-diagnostic-osn',
 'conus404-pgw-daily-onprem-hw',
 'conus404-pgw-daily-osn']

There are a couple of options for accessing conus404:

  1. HPC Setting (t_sys = HPC):

    • Assumption: The notebook is run on the USGS HPC Hovenweep.

    • Access Method: Utilizes the on-premises version of the data.

    • Benefits:

      • Workflow Association: The workflow is directly linked to the data.

      • Speed: Eliminates the need to download data, significantly reducing access and processing time.

  2. Desktop Setting (t_sys = Desktop):

    • Use Case: Suitable for workflows that do not require HPC resources or for developing workflows locally before deploying them to the HPC.

    • Access Method: Connects to the conus404 data via the OSN pod.

    • Benefits:

      • Flexibility: Allows for local development and testing.

      • Performance: Provides a fast connection to the data.

## Select the dataset you want to read into your notebook and preview its metadata
if t_sys == "HPC":
    dataset = 'conus404-daily-diagnostic-onprem-hw'
elif t_sys == "Desktop":
    dataset = 'conus404-daily-diagnostic-osn' 
else:
    print("Please set the variable t_sys above to one of 'HPC' or 'Desktop'")        
cat[dataset]
<intake_xarray.xzarr.ZarrSource at 0x7f954c6a1b50>
# read in the dataset and use metpy to parse the crs information on the dataset
print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
ds
Reading conus404-daily-diagnostic-osn metadata...
<xarray.Dataset> Size: 3TB
Dimensions:       (y: 1015, x: 1367, time: 15707)
Coordinates:
    lat           (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    lon           (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
  * x             (x) float64 11kB -2.732e+06 -2.728e+06 ... 2.728e+06 2.732e+06
  * y             (y) float64 8kB -2.028e+06 -2.024e+06 ... 2.024e+06 2.028e+06
  * time          (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
Data variables: (12/38)
    LANDMASK      (y, x) float32 6MB dask.array<chunksize=(350, 350), meta=np.ndarray>
    Q2MAX         (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    Q2MEAN        (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    Q2MIN         (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    Q2STD         (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    RAINCVMAX     (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    ...            ...
    U10MEAN       (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    U10STD        (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    V10MAX        (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    V10MEAN       (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    V10STD        (time, y, x) float32 87GB dask.array<chunksize=(24, 350, 350), meta=np.ndarray>
    crs           int64 8B ...
Attributes: (12/80)
    BL_PBL_PHYSICS:                  1
    BOTTOM-TOP_GRID_DIMENSION:       51
    BOTTOM-TOP_PATCH_END_STAG:       51
    BOTTOM-TOP_PATCH_END_UNSTAG:     50
    BOTTOM-TOP_PATCH_START_STAG:     1
    BOTTOM-TOP_PATCH_START_UNSTAG:   1
    ...                              ...
    USE_THETA_M:                     0
    WEST-EAST_GRID_DIMENSION:        1368
    WEST-EAST_PATCH_END_STAG:        1368
    WEST-EAST_PATCH_END_UNSTAG:      1367
    WEST-EAST_PATCH_START_STAG:      1
    WEST-EAST_PATCH_START_UNSTAG:    1

GDPTools Background#

In this section, we utilize three data classes from the gdptools package: UserCatData, WeightGen, and AggGen.

  • UserCatData:
    Serves as a data container for both the source and target datasets, along with their associated metadata. The instantiated object user_data is employed by both the WeightGen and AggGen classes.

  • WeightGen:
    Responsible for calculating the intersected areas between the source and target datasets. It generates normalized area-weights, which are subsequently used by the AggGen class to compute interpolated values between the datasets.

  • AggGen:
    Facilitates the interpolation of target data to match the source data using the areal weights calculated by WeightGen. This process is conducted over the time period specified in the UserCatData object.

Instantiation of the UserCatData class.#

# Coordinate Reference System (CRS) of the conus404 dataset
source_crs = ds.crs.crs_wkt

# Coordinate names of the conus404 dataset
x_coord = "x"
y_coord = "y"
t_coord = "time"

# Time period of interest for areal interpolation of conus404 to DRB HUC12s
# using the AggGen class below. Note: The dates follow the same format as the
# time values in the conus404 dataset.
sdate = "1979-10-01T00:00:00.000000000"
edate = "2022-10-01T00:00:00.000000000"

# Variables from the conus404 dataset used for areal interpolation
variables = ["T2MIN", "T2MAX", "RAINNCVMEAN"]

# CRS of the DRB HUC12 polygons
target_crs = 5070

# Column name for the unique identifier associated with target polygons.
# This ID is used in both the generated weights file and the areal interpolated output.
target_poly_idx = "huc12"

# Common equal-area CRS for reprojecting both source and target data.
# This CRS is used for calculating areal weights in the WeightGen class.
weight_gen_crs = 5070

# Instantiate the UserCatData class, which serves as a container for both
# source and target datasets, along with associated metadata. The UserCatData
# object provides methods used by the WeightGen and AggGen classes to subset
# and reproject the data.
user_data = UserCatData(
    ds=ds,
    proj_ds=source_crs,
    x_coord=x_coord,
    y_coord=y_coord,
    t_coord=t_coord,
    var=variables,
    f_feature=filtered_gdf,
    proj_feature=target_crs,
    id_feature=target_poly_idx,
    period=[sdate, edate],
)
bounds:  <class 'numpy.ndarray'> [1760709.12577942  190277.21996997 1960739.24390951  622126.29932052]

Weight Generation with WeightGen#

In this section, we utilize the WeightGen class from the gdptools package to calculate the normalized areal weights necessary for interpolating the source gridded data (conus404) to the target polygonal boundaries (DRB HUC12s). The areal weights represent the proportion of each grid cell that overlaps with each polygon, facilitating accurate areal interpolation of the data. These weights are calculated using the calculate_weights() method.

Weight Calculation Process:

  1. Subset Source Data: The source data is subset based on the bounds of the target data, with an additional small buffer to ensure coverage. The buffer size is determined based on [specific criteria or methodology].

  2. Create cell boundary GeoDataFrame: A GeoDataFrame of the cell boundaries is created for each node in the subsetted source data, enabling spatial operations.

  3. Validate Geometries: The target file is checked for invalid geometries, which can occur due to various reasons such as topology errors. Invalid geometries are fixed using Shapely’s make_valid() method to prevent failures during intersection calculations.

  4. Calculate and Normalize Areas: For each polygon, gdptools calculates the area of each intersecting grid cell and normalizes it by the total area of the target polygon. This ensures that the weights for each polygon sum to 1, provided the polygon is entirely covered by the source data.

    • Validation: A quick check on the weights can be performed by grouping the resulting weights by the target_poly_idx and calculating the sum. For all polygons completely covered by the source data, the weights will sum to 1.

Note: The method parameter can be set to one of "serial", "parallel", or "dask". Given the scale of the gridded conus404 data (4 km × 4 km) and the number and spatial footprint of the DRB HUC12s, using "serial" in this case is the most efficient method. In subsequent sections, we will explore how the "parallel" and "dask" methods can provide speed-ups in the areal interpolation process, as well as in the computation of weights for broader CONUS-wide targets.

%%time
wght_gen = WeightGen(
    user_data=user_data,
    method="serial",
    output_file="wghts_drb_ser_c404daily.csv",
    weight_gen_crs=6931
)

wdf = wght_gen.calculate_weights()
Using serial engine
Generating grid-cell polygons finished in 0.88 second(s)
Data preparation finished in 1.0052 seconds
     - validating target polygons
     - fixing 0 invalid polygons.
     - validating source polygons
     - fixing 0 invalid polygons.
Validate polygons finished in 0.0429 seconds
     - reprojecting and validating source polygons
     - checking the source polygons for invalid polygons
     - checking source for empty polygons
     - reprojecting and validating target polygons
     - checking the target polygons for invalid polygons
     - checking target for empty polygons
Reprojecting to: EPSG:6931 and validating polygons finished in 0.13 seconds
Intersections finished in 0.7916 seconds
Weight gen finished in 0.7931 seconds
CPU times: user 1.99 s, sys: 15.5 ms, total: 2 s
Wall time: 2.03 s

Areal Interpolation with the AggGen Class#

In this section, we demonstrate the use of the AggGen class and its calculate_agg() method from the gdptools package to perform areal interpolation. We will explore all three agg_engine options: "serial", "parallel", and "dask". The following links provide detailed documentation on the available parameter options:

When using AggGen and the calculate_agg() method, it is important to consider the overlap between the source and target data when selecting the stat_method parameter value. All statistical methods have a masked variant in addition to the standard method; for example, "mean" and "masked_mean". In cases where the source data has partial overlap with a target polygon, the "mean" method will return a missing value for the polygon, whereas the "masked_mean" method will calculate the statistic based on the available overlapping source cells. These considerations help users determine whether using a masked statistic is desirable or if a missing value would be preferred, allowing for post-processing of missing values (e.g., using nearest-neighbor or other approaches to handle the lack of overlap). In the case here conus404 completely covers the footprint of the DRB HUC12s, as such the "mean" method would be sufficient.

%%time
agg_gen = AggGen(
    user_data=user_data,
    stat_method="mean",
    agg_engine="serial",
    agg_writer="netcdf",
    weights='wghts_drb_ser_c404daily.csv',
    out_path='.',
    file_prefix="serial_weights",
    precision=8
)
ngdf, ds_out = agg_gen.calculate_agg()
Processing: T2MIN
    Data prepped for aggregation in 0.0064 seconds
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
File <timed exec>:11

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/gdptools/agg_gen.py:261, in AggGen.calculate_agg(self)
    240 def calculate_agg(
    241     self,
    242 ) -> Tuple[gpd.GeoDataFrame, xr.Dataset]:
    243     """Calculate and return aggregations for target polygon data based on source gridded data.
    244 
    245     Interpolates source gridded data to target polygon data for a specified period.
   (...)
    259         >>> gdf, ds = obj.calculate_agg()
    260     """
--> 261     self._agg_data, new_gdf, agg_vals = self.agg().calc_agg_from_dictmeta(
    262         user_data=self._user_data,
    263         weights=self._weights,
    264         stat=self._stat,
    265         jobs=self._jobs,
    266     )
    267     if self._agg_writer != "none":
    268         self.__writer().save_file(
    269             agg_data=self._agg_data,
    270             feature=new_gdf,
   (...)
    275             precision=self._precision,
    276         )

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/gdptools/agg/agg_engines.py:101, in AggEngine.calc_agg_from_dictmeta(self, user_data, weights, stat, jobs)
     98 self._jobs = int(os.cpu_count() / 2) if jobs == -1 else jobs
     99 # logger.info(f"  ParallelWghtGenEngine using {self._jobs} jobs")
--> 101 return self.agg_w_weights()

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/gdptools/agg/agg_engines.py:167, in SerialAgg.agg_w_weights(self)
    165 print(f"    Data prepped for aggregation in {tend - tstrt:0.4f} seconds")
    166 tstrt = time.perf_counter()
--> 167 newgdf, nvals = self.calc_agg(key=key, data=agg_data)
    168 tend = time.perf_counter()
    169 print(f"    Data aggregated in {tend - tstrt:0.4f} seconds")

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/gdptools/agg/agg_engines.py:246, in SerialAgg.calc_agg(self, key, data)
    243     i_ind = np.array(weight_id_rows.i.values).astype(int)
    244     j_ind = np.array(weight_id_rows.j.values).astype(int)
--> 246     val_interp[:, i] = self.stat(array=da.values[:, i_ind, j_ind], weights=tw, def_val=dfval).get_stat()
    248 return gdf, val_interp

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/xarray/core/dataarray.py:814, in DataArray.values(self)
    801 @property
    802 def values(self) -> np.ndarray:
    803     """
    804     The array's data converted to numpy.ndarray.
    805 
   (...)
    812     to this array may be reflected in the DataArray as well.
    813     """
--> 814     return self.variable.values

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/xarray/core/variable.py:507, in Variable.values(self)
    504 @property
    505 def values(self) -> np.ndarray:
    506     """The variable's data as a numpy.ndarray"""
--> 507     return _as_array_or_item(self._data)

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/xarray/core/variable.py:301, in _as_array_or_item(data)
    287 def _as_array_or_item(data):
    288     """Return the given values as a numpy array, or as an individual item if
    289     it's a 0d datetime64 or timedelta64 array.
    290 
   (...)
    299     TODO: remove this (replace with np.asarray) once these issues are fixed
    300     """
--> 301     data = np.asarray(data)
    302     if data.ndim == 0:
    303         kind = data.dtype.kind

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/dask/array/core.py:1709, in Array.__array__(self, dtype, **kwargs)
   1708 def __array__(self, dtype=None, **kwargs):
-> 1709     x = self.compute()
   1710     if dtype and x.dtype != dtype:
   1711         x = x.astype(dtype)

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/dask/base.py:372, in DaskMethodsMixin.compute(self, **kwargs)
    348 def compute(self, **kwargs):
    349     """Compute this dask collection
    350 
    351     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    370     dask.compute
    371     """
--> 372     (result,) = compute(self, traverse=False, **kwargs)
    373     return result

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/site-packages/dask/base.py:660, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    657     postcomputes.append(x.__dask_postcompute__())
    659 with shorten_traceback():
--> 660     results = schedule(dsk, keys, **kwargs)
    662 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/queue.py:171, in Queue.get(self, block, timeout)
    169 elif timeout is None:
    170     while not self._qsize():
--> 171         self.not_empty.wait()
    172 elif timeout < 0:
    173     raise ValueError("'timeout' must be a non-negative number")

File /home/conda/global/75e44cfd-1739550182-21-pangeo/lib/python3.11/threading.py:327, in Condition.wait(self, timeout)
    325 try:    # restore state no matter what (e.g., KeyboardInterrupt)
    326     if timeout is None:
--> 327         waiter.acquire()
    328         gotit = True
    329     else:

KeyboardInterrupt: 

Output#

The calculate_agg() method returns two objects: ngdf and ds_out.

  • ngdf: A GeoDataFrame derived from the target GeoDataFrame (filtered_gdf) specified in the UserCatData container. This GeoDataFrame has been both sorted and dissolved based on the identifiers in the "huc12" column, as defined by the target_poly_idx parameter.

  • ds_out: The areally weighted interpolation output as an XArray Dataset. The Dataset consists of dimensions time and huc12, and the data variables are T2MIN, T2MAX, and RAINNCVMEAN.

A preview of the ngdf GeoDataFrame below shows that it is sorted by "huc12". In this case, there are no duplicate "huc12" values, resulting in the original and output GeoDataFrames having the same number of rows. Some target datasets such as the GFv1.1, will result in many dissolved geometries.

ngdf.head()
huc12 geometry index tnmid metasourceid sourcedatadesc sourceoriginator sourcefeatureid loaddate gnis_id areaacres areasqkm states name hutype humod tohuc noncontributingareaacres noncontributingareasqkm globalid
0 020401010101 POLYGON ((-74.59363 42.42361, -74.59395 42.423... 185 {41B9E008-2FEC-46FB-B1FD-4E528336838C} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:59Z None 20587.09 83.31 NY Town Brook-West Branch Delaware River S NM 020401010102 0 0 {B807DA53-E29C-11E2-8094-0021280458E6}
1 020401010102 POLYGON ((-74.69614 42.42722, -74.69579 42.427... 234 {540CAB52-1FFB-4705-B8E8-A98EC764FC89} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:59Z None 16095.45 65.14 NY Betty Brook-West Branch Delaware River S NM 020401010103 0 0 {B807F145-E29C-11E2-8094-0021280458E6}
2 020401010103 POLYGON ((-74.77568 42.37921, -74.7753 42.3792... 616 {EFAB07B9-8C31-4969-97DB-0FF2970DDE41} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:56Z None 16327.17 66.07 NY Rose Brook-West Branch Delaware River S NM 020401010104 0 0 {B8080F3D-E29C-11E2-8094-0021280458E6}
3 020401010104 POLYGON ((-74.80026 42.40748, -74.79917 42.407... 485 {B719C153-A842-4D6E-BA24-3356413DD99D} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:11:09Z None 17537.20 70.97 NY Kidd Brook-West Branch Delaware River S NM 020401010106 0 0 {B808262F-E29C-11E2-8094-0021280458E6}
4 020401010105 POLYGON ((-74.73673 42.31516, -74.73631 42.315... 393 {928F950F-9E3D-43C0-82B9-050D9C33E38F} {ED602145-9201-4827-9CE1-05D252484579} None None None 2017-10-03T20:10:48Z None 33390.30 135.13 NY Little Delaware River S NM 020401010106 0 0 {B8084427-E29C-11E2-8094-0021280458E6}
ds_out
<xarray.Dataset> Size: 161MB
Dimensions:      (time: 15707, huc12: 427)
Coordinates:
  * time         (time) datetime64[ns] 126kB 1979-10-01 ... 2022-10-01
  * huc12        (huc12) object 3kB '020401010101' ... '020402070605'
Data variables:
    T2MIN        (time, huc12) float64 54MB 0.0 0.0 0.0 ... 288.7 289.1 290.0
    T2MAX        (time, huc12) float64 54MB 0.0 0.0 0.0 ... 292.9 292.9 292.3
    RAINNCVMEAN  (time, huc12) float64 54MB 0.0 0.0 0.0 ... 7.12e-06 4.23e-06
Attributes:
    Conventions:  CF-1.8
    featureType:  timeSeries
    history:      2024_10_24_13_33_11 Original filec created  by gdptools pac...

Plot the results as a quick sanity check#

Here we plot the results along with the corresponding conus404 values. To make the plot a little more interesting we choose the time step with the most precipitation. This provides a quick qualitative sanity check. If one is intersted in looking in more detail in a graphic presentation of a target polygon, overlayed on the intersecting grid cells, with the grid-cell values and weights shown for each intersection, please look at the tail end of notebook ClimateR-Catalog - Terraclime data for some example code to generate a plot which give a more quantitative expression of the result.

Format the interpolated results for plotting

# Convert processed xarray Dataset to DataFrame
df = ds_out.to_dataframe().reset_index()
# Pivot the DataFrame to have variables as separate columns
df_pivot = df.pivot_table(index=['time', 'huc12'], values=['T2MAX', 'T2MIN', 'RAINNCVMEAN']).reset_index()

# Merge GeoDataFrame with DataFrame
merged_gdf = ngdf.to_crs(5070).merge(df_pivot, on='huc12')

# Convert RAINNCVMEAN from kg/m²/s to mm/day
merged_gdf['RAINNCVMEAN_mm_day'] = merged_gdf['RAINNCVMEAN'] * 86400
# Convert T2MAX and T2MIN from Kelvin to Celsius
merged_gdf['T2MAX_C'] = merged_gdf['T2MAX'] - 273.15
merged_gdf['T2MIN_C'] = merged_gdf['T2MIN'] - 273.15

# Calculate total precipitation for each time step
rain_sum = merged_gdf.groupby('time')['RAINNCVMEAN_mm_day'].sum()

# Identify the time step with the maximum total precipitation
max_rain_time = rain_sum.idxmax()
print(f"Time step with maximum total precipitation: {max_rain_time}")

# Subset the GeoDataFrame for the selected time step
subset = merged_gdf[merged_gdf['time'] == max_rain_time]
Time step with maximum total precipitation: 1999-09-17 00:00:00

Process the sub-setted conus404 data for plotting

# We can use our agg_gen object to retrieve the subsetted conus404 data
da_t2max = agg_gen.agg_data.get("T2MAX").da
da_t2min = agg_gen.agg_data.get("T2MIN").da
da_rain = agg_gen.agg_data.get("RAINNCVMEAN").da

# Get the subsetted raw conus404 cubes used in the areal interpolation
da_t2max = agg_gen.agg_data.get("T2MAX").da.sel(time=max_rain_time) - 273.15
da_t2min = agg_gen.agg_data.get("T2MIN").da.sel(time=max_rain_time) - 273.15
da_rain = agg_gen.agg_data.get("RAINNCVMEAN").da.sel(time=max_rain_time) * 86400

Generate the plot

  • Here we use GeoViews and HoloViews

# Define Cartopy CRS using EPSG code 5070 (NAD83 / Conus Albers)
crs_cartopy = ccrs.epsg(5070)

# Define color maps for each variable
color_maps = {
    'T2MAX_C': 'Reds',
    'T2MIN_C': 'Blues',
    'RAINNCVMEAN_mm_day': 'Greens'
}
# Define color maps for raw data
color_maps_raw = {
    'T2MAX': 'Reds',
    'T2MIN': 'Blues',
    'RAINNCVMEAN': 'Greens'
}
# Create Polygons for T2MAX_C
map_T2MAX = gv.Polygons(
    subset.to_crs(5070), 
    vdims=['T2MAX_C'],
    crs=crs_cartopy  # Use Cartopy CRS here
).opts(
    cmap=color_maps['T2MAX_C'],
    colorbar=True,
    tools=['hover'],
    title='T2MAX (°C)',  # Included units
    alpha=0.7,
    frame_width=200,
    aspect='equal',
    padding=0,
)

# Create Polygons for T2MIN_C
map_T2MIN = gv.Polygons(
    subset.to_crs(5070), 
    vdims=['T2MIN_C'],
    crs=crs_cartopy
).opts(
    cmap=color_maps['T2MIN_C'],
    colorbar=True,
    tools=['hover'],
    title='T2MIN (°C)',  # Included units
    alpha=0.7,
    frame_width=200,
    aspect='equal',
    padding=0,
)

# Create Polygons for RAINNCVMEAN_mm_day
map_RAINNCVMEAN = gv.Polygons(
    subset.to_crs(5070), 
    vdims=['RAINNCVMEAN_mm_day'],
    crs=crs_cartopy
).opts(
    cmap=color_maps['RAINNCVMEAN_mm_day'],
    colorbar=True,
    tools=['hover'],
    title='RAINNCVMEAN (mm/day)',  # Included units
    alpha=0.7,
    frame_width=200,
    aspect='equal',
    padding=0,
)

# Create raw sub-setted data frames
# Assuming 'lon' and 'lat' are the coordinate names; adjust if necessary
coord_x = 'x'  # Replace with actual x-coordinate name
coord_y = 'y'  # Replace with actual y-coordinate name

# Create HoloViews Image for T2MAX (Raw)
image_T2MAX_raw = hv.Image(
    da_t2max, 
    kdims=[coord_x, coord_y],
    vdims=['T2MAX']
).opts(
    cmap=color_maps_raw['T2MAX'],
    colorbar=True,
    tools=['hover'],
    title='T2MAX Raw (°C)',
    alpha=0.7,
    frame_width=200,
    aspect='equal'
)

# Create HoloViews Image for T2MIN (Raw)
image_T2MIN_raw = hv.Image(
    da_t2min, 
    kdims=[coord_x, coord_y],
    vdims=['T2MIN']
).opts(
    cmap=color_maps_raw['T2MIN'],
    colorbar=True,
    tools=['hover'],
    title='T2MIN Raw (°C)',
    alpha=0.7,
    frame_width=200,
    aspect='equal'
)

# Create HoloViews Image for RAINNCVMEAN (Raw)
image_RAIN_raw = hv.Image(
    da_rain, 
    kdims=[coord_x, coord_y],
    vdims=['RAINNCVMEAN']
).opts(
    cmap=color_maps_raw['RAINNCVMEAN'],
    colorbar=True,
    tools=['hover'],
    title='RAINNCVMEAN Raw (mm/day)',
    alpha=0.7,
    frame_width=200,
    aspect='equal'
)

# Create a GridSpec with 3 rows and 3 columns
grid = pn.GridSpec(ncols=3, nrows=2, sizing_mode='fixed', width=900, height=1200)

# Add title spanning all columns in the first row
title = pn.pane.Markdown(
    f"<h2 style='text-align: center;'>Areal Interpolation for {max_rain_time.strftime('%Y-%m-%d')}</h2>",
    width=900,
    height=20
)

# Add raw data plots in the first row
grid[0, 0] = image_T2MAX_raw
grid[0, 1] = image_T2MIN_raw
grid[0, 2] = image_RAIN_raw

# Add processed data plots in the second row
grid[1, 0] = map_T2MAX
grid[1, 1] = map_T2MIN
grid[1, 2] = map_RAINNCVMEAN


# Display the grid
# final_layout
grid


# Combine the three maps into a single layout arranged horizontally
maped_layout = pn.Row(
    map_T2MAX,
    pn.Spacer(sizing_mode="stretch_width"),
    map_T2MIN,
    pn.Spacer(sizing_mode="stretch_width"),
    map_RAINNCVMEAN,
    width=800
)

# Combine raw data plots horizontally
raw_layout = pn.Row(
    image_T2MAX_raw,
    pn.Spacer(sizing_mode="stretch_width"),
    image_T2MIN_raw,
    pn.Spacer(sizing_mode="stretch_width"),
    image_RAIN_raw,
    width=800
)

# Create a main title using HTML for center alignment within Markdown
title = pn.pane.Markdown(
    f"<h3 style='text-align: center;'>Areal Interpolation for {max_rain_time.strftime('%Y-%m-%d')}</h3>",
    # width=1800  # Adjust width as needed
)

# Combine both rows vertically
combined_layout = pn.Column(
    raw_layout,
    maped_layout
)
# Combine the title and the layout vertically
final_layout = pn.Column(
    title,
    combined_layout,
    width=850
)

final_layout

There is a clear increasing gradient in temperature from north to south that is visible in the interpolated results.

Parallel and Dask Methods#

The domain of this workflow is small enough that using either the parallel or dask methods are not necessary. However there is a speedup that we illustrate. The parallel and dask engines used in the AggGen object operate in a similar manner using multiprocessing and dask bag respectivly. Using the jobs parameter the user can specify the number of processes to run. The target data is chunked by the number of processes and each processor recieves a chunked GeoDataFrame along with a copy of the sub-setted source data. This creates an overhead that can determine how effiently the parallel processing runs.

The tradeoff in using parallel processing lies in the balance between the number of processors and the overhead of copying data. While increasing the number of processors can significantly reduce computation time by dividing the workload, it also increases the amount of memory used for duplicate datasets and the coordination time between processes. There is a ‘sweet spot’ where the number of processors maximizes performance but beyond this point, additional processors may slow down the operation due to the overhead of managing more processes. The optimal number of processors depends on the size of the data, available memory, and system architecture, and can typically be found through experimentation.

Importantly, most of the time in processing here is dominated by downloading the data, so the speedup is relatively small. For larger domains the processing will be a larger percentage of the total time and the speedup should be more pronounced. Well explore that in the CONUS scale processing of conus404 on Hovenweep.

%%time
gg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="parallel",
    agg_writer="netcdf",
    weights='wghts_drb_ser_c404daily.csv',
    out_path='.',
    file_prefix="testing_p2",
    precision=8,
    jobs=4
)
ngdf, ds_out = agg_gen.calculate_agg()
Processing: T2MIN
    Data prepped for aggregation in 0.0041 seconds
    Data aggregated in 72.7608 seconds
Processing: T2MAX
    Data prepped for aggregation in 0.0031 seconds
    Data aggregated in 74.7672 seconds
Processing: RAINNCVMEAN
    Data prepped for aggregation in 0.0030 seconds
    Data aggregated in 90.6005 seconds
Saving netcdf file to serial_weights.nc
CPU times: user 2min 18s, sys: 29.1 s, total: 2min 48s
Wall time: 3min 58s
%%time
agg_gen = AggGen(
    user_data=user_data,
    stat_method="masked_mean",
    agg_engine="dask",
    agg_writer="netcdf",
    weights='wghts_drb_ser_c404daily.csv',
    out_path='.',
    file_prefix="testing_p3",
    precision=8,
    jobs=4
)
ngdf, ds_out = agg_gen.calculate_agg()
Processing: T2MIN
    Data prepped for aggregation in 0.0041 seconds
    Data aggregated in 77.8485 seconds
Processing: T2MAX
    Data prepped for aggregation in 0.0032 seconds
    Data aggregated in 76.2073 seconds
Processing: RAINNCVMEAN
    Data prepped for aggregation in 0.0031 seconds
    Data aggregated in 139.9508 seconds
Saving netcdf file to testing_p3.nc
CPU times: user 2min 41s, sys: 32 s, total: 3min 13s
Wall time: 4min 54s
ds_agg = xr.open_dataset("testing_p2.nc")
ds_agg
ds_agg.T2MAX.values