gdptools CONUS404 Spatial Aggregation over CONUS-extent HUC12s

gdptools CONUS404 Spatial Aggregation over CONUS-extent HUC12s#

This tutorial demonstrates the use of gdptools, a python package for area-weighted interpolation of source gridded datasets, such as conus404, to target polygonal geospatial fabrics. Source datasets can be any gridded dataset that can be opened in XArray. However it’s important to note that gdptools, operations on XArray Datasets or DataArrays with dimensions of (Y,X,Time) generally. As such climate datasets that have ensemble dimensions will require subsetting by ensemble to obtain the a dataset with the proper dimensions. The target dataset can be any polygonal dataset that can be read by GeoPandas. GDPtools also has capabilities of interpolating gridded data to lines as well, but our focus here is interpolating to polygons.

In this workflow, CONUS404 is aggregated to NHDPlusV2 snapshot of the Watershed Boundary Dataset HUC12 boundaries. This is a CONUS scale spatial fabric with ~102,000 polygons. Access to this dataset is provided through a copy of the data release stored on the OSN pod ('huc12-geoparquet-osn').

We use the HyTest intake catalog to access the conus404-daily-diagnostic-onprem-hw version of CONUS404 on Hovenweep so that we could also run the workflow there to be co-located with the data. However, a user could adapt this workflow to run in other computing environments if they use the version of CONUS404 on the OSN pod instead.

Compared to the gdptools CONUS404 Spatial Aggregation over DRB-extent HUC12s tutorial, the main difference is that to manage file size and memory overhead we process CONUS404 by year, generating 43 annual netcdf files of the interpolated data.

# Common python packages
import xarray as xr
import hvplot.xarray
import hvplot.pandas
import hvplot.dask
import intake
import warnings
import intake_xarray
import intake_parquet
import intake_geopandas
import datetime
import holoviews as hv
import numpy as np
import pandas as pd
import geopandas as gpd

# HyRiver packages
from pynhd import NLDI, WaterData
import pygeohydro as gh
# GDPTools packages
from gdptools import AggGen, UserCatData, WeightGen
import os
os.environ["HYRIVER_CACHE_DISABLE"] = "true"

hv.extension("bokeh")
warnings.filterwarnings('ignore')

Here we setup a variable the sets our local context, working on the HPC or working locally on your Desktop. This just modifies the access point of the conus404 data, using the Hovenweep access for HPC and the OSN pod access for the Desktop.

t_sys = "HPC"  # "HPC" or "Desktop"

Access data with HyTest intake catalog.#

Use the huc12-geoparquet-osn to read the NHDPlusV2 snapshot of the Watershed Boundary Dataset HUC12 boundaries
Use the conus404-daily-diagnostic-onprem-hw to read conus404

# open the hytest data intake catalog
# hytest_cat = intake.open_catalog("../dataset_catalog/hytest_intake_catalog.yml")
hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
list(hytest_cat)

We need a column to use as our identifyer. Printing huc12.columns below to view all the possible columns, we choose the HUC12 column as our identifyer.

# open the huc12-geoparquet-osn
huc12_access = hytest_cat['huc12-geoparquet-osn']
huc12 = huc12_access.read()
print(huc12.columns)
huc12

Load the conus404 dataset using the HyTest catalog#

In this case we are running this notebook on Hovenweep.

# open the hytest data intake catalog
hytest_cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/hytest_intake_catalog.yml")
list(hytest_cat)

# open the conus404 sub-catalog
cat = hytest_cat['conus404-catalog']
list(cat)

There are a couple of options for accessing conus404:

HPC Setting (t_sys = HPC):
- Assumption: The notebook is run on the USGS HPC Hovenweep.
- Access Method: Utilizes the on-premises version of the data.
- Benefits:
  - Workflow Association: The workflow is directly linked to the data.
  - Speed: Eliminates the need to download data, significantly reducing access and processing time.
Desktop Setting (t_sys = Desktop):
- Use Case: Suitable for workflows that do not require HPC resources or for developing workflows locally before deploying them to the HPC.
- Access Method: Connects to the conus404 data via the OSN pod.
- Benefits:
  - Flexibility: Allows for local development and testing.
  - Performance: Provides a fast connection to the data.

## Select the dataset you want to read into your notebook and preview its metadata
if t_sys == "HPC":
    dataset = 'conus404-daily-diagnostic-onprem-hw'
elif t_sys == "Desktop":
    dataset = 'conus404-daily-diagnostic-osn' 
else:
    print("Please set the variable t_sys above to one of 'HPC' or 'Desktop'")        
cat[dataset]

# read in the dataset and use metpy to parse the crs information on the dataset
print(f"Reading {dataset} metadata...", end='')
ds = cat[dataset].to_dask().metpy.parse_cf()
ds

GDPTools Background#

In this section, we utilize three data classes from the gdptools package: UserCatData, WeightGen, and AggGen.

UserCatData:
Serves as a data container for both the source and target datasets, along with their associated metadata. The instantiated object user_data is employed by both the WeightGen and AggGen classes.
WeightGen:
Responsible for calculating the intersected areas between the source and target datasets. It generates normalized area-weights, which are subsequently used by the AggGen class to compute interpolated values between the datasets.
AggGen:
Facilitates the interpolation of target data to match the source data using the areal weights calculated by WeightGen. This process is conducted over the time period specified in the UserCatData object.

Instantiation of the `UserCatData` class.#

# Coordinate Reference System (CRS) of the conus404 dataset
source_crs = ds.crs.crs_wkt

# Coordinate names of the conus404 dataset
x_coord = "x"
y_coord = "y"
t_coord = "time"

# Time period of interest for areal interpolation of conus404 to DRB HUC12s
# using the AggGen class below. Note: The dates follow the same format as the
# time values in the conus404 dataset.
sdate = "1979-10-01T00:00:00.000000000"
edate = "2022-10-01T00:00:00.000000000"

# Variables from the conus404 dataset used for areal interpolation
variables = ["T2MIN", "T2MAX", "RAINNCVMEAN"]

# CRS of the DRB HUC12 polygons
target_crs = 5070

# Column name for the unique identifier associated with target polygons.
# This ID is used in both the generated weights file and the areal interpolated output.
target_poly_idx = "HUC12"

# Common equal-area CRS for reprojecting both source and target data.
# This CRS is used for calculating areal weights in the WeightGen class.
weight_gen_crs = 5070

# Instantiate the UserCatData class, which serves as a container for both
# source and target datasets, along with associated metadata. The UserCatData
# object provides methods used by the WeightGen and AggGen classes to subset
# and reproject the data.
user_data = UserCatData(
    ds=ds,  # conus404 read from the intake catalog
    proj_ds=source_crs,
    x_coord=x_coord,
    y_coord=y_coord,
    t_coord=t_coord,
    var=variables,
    f_feature=huc12,  # huc121 read above from the intake catalog
    proj_feature=target_crs,
    id_feature=target_poly_idx,
    period=[sdate, edate],
)

Weight Generation with `WeightGen`#

In this section, we utilize the WeightGen class from the gdptools package to calculate the normalized areal weights necessary for interpolating gridded data (conus404) to polygonal boundaries (CONUS HUC12s). The areal weights represent the proportion of each grid cell that overlaps with each polygon, facilitating accurate areal interpolation of the data. These weights are calculated using the calculate_weights() method.

Weight Calculation Process:

Subset Source Data: The source data is subset based on the bounds of the target data, with an additional small buffer to ensure coverage. The buffer size is determined based on [specific criteria or methodology].
Create cell boundary GeoDataFrame: A GeoDataFrame of the cell boundaries is created for each node in the subsetted source data, enabling spatial operations.
Validate Geometries: The target file is checked for invalid geometries, which can occur due to various reasons such as topology errors. Invalid geometries are fixed using Shapely’s make_valid() method to prevent failures during intersection calculations.
Calculate and Normalize Areas: For each polygon, gdptools calculates the area of each intersecting grid cell and normalizes it by the total area of the target polygon. This ensures that the weights for each polygon sum to 1, provided the polygon is entirely covered by the source data.
- Validation: A quick check on the weights can be performed by grouping the resulting weights by the target_poly_idx and calculating the sum. For all polygons completely covered by the source data, the weights will sum to 1.

Note: The method parameter in calculate_weights() can be set to one of "serial", "parallel", or "dask". Given the scale of the gridded conus404 data (4 km × 4 km) and the spatial footprint of the CONUS HUC12s, using "parallel"or "dask" in this case is the most efficient method.

Parallel and Dask Methods#

The domain in this workflow is large as defined by the number of polygons, the polygon complexity, and the relatively small scale of the conus404 cell geometries. We can take advantage of the parallel methods to improve performance in both the weight calculation and the interpolation. The parallel and dask engines used in the WeightGen class operate in a similar manner, utilizing Python’s multiprocessing module and dask.bag, respectively.

Using the jobs parameter, users can specify the number of processes to run. The target data is divided into chunks based on the number of processes, and each processor receives a chunked GeoDataFrame along with a copy of the subsetted source data. This setup introduces overhead that can affect how efficiently the parallel processing runs.

Trade-offs in Parallel Processing:

The use of parallel processing involves balancing the number of processors with the overhead of copying data:

Benefits: Increasing the number of processors can reduce computation time by dividing the workload.
Costs: More processors increase memory usage due to duplicate datasets and add coordination overhead between processes.
Optimal Performance: There is a ‘sweet spot’ where the number of processors maximizes performance. Beyond this point, additional processors may slow down the operation due to overhead.

The optimal number of processors depends on factors such as data size, available memory, and system architecture. It often requires experimentation to determine the most efficient configuration.

%%time
wght_gen = WeightGen(
    user_data=user_data,
    method="parallel",
    output_file="wghts_huc12_c404daily_p.csv",
    weight_gen_crs=weight_gen_crs,
    jobs=4
)

wdf = wght_gen.calculate_weights()

Compute the areal weighted spatial interpolation#

Because the result will be rather large. To manage the file size and memory requirements for processing we process by year. Additionaly, The conus404 data starts and ends on the water year dates, so we chose to process by water year in this case. The code below generates a list of start_dates, end_dates, and years that we iterate over to process the data by year.

t_start_series = pd.date_range(pd.to_datetime("1979-10-01"), periods=43, freq="YS-OCT")
t_end_series = pd.date_range(pd.to_datetime("1980-09-30"), periods=43, freq="Y-SEP ")
f_time_series = pd.date_range(pd.to_datetime("1980"), periods=43, freq="Y")

time_start = [t.strftime("%Y-%m-%dT%H:%M:%S.%f") for t in t_start_series]
time_end = [t.strftime("%Y-%m-%dT%H:%M:%S.%f") for t in t_end_series]
file_time = [t.strftime("%Y") for t in f_time_series]
time_start[:4], time_end[:4]

Areal Interpolation with the `AggGen` Class#

In this section, we demonstrate the use of the AggGen class and its calculate_agg() method from the gdptools package to perform areal interpolation. We will explore all three agg_engine options: "serial", "parallel", and "dask". The following links provide detailed documentation on the available parameter options:

When using AggGen and the calculate_agg() method, it is important to consider the overlap between the source and target data when selecting the stat_method parameter value. All statistical methods have a masked variant in addition to the standard method; for example, "mean" and "masked_mean". In cases where the source data has partial overlap with a target polygon, the "mean" method will return a missing value for the polygon, whereas the "masked_mean" method will calculate the statistic based on the available overlapping source cells. These considerations help users determine whether using a masked statistic is desirable or if a missing value would be preferred, allowing for post-processing of missing values (e.g., using nearest-neighbor or other approaches to handle the lack of overlap). In the case here conus404 completely covers the footprint of the DRB HUC12s, as such the "mean" method would be sufficient.

Because we are processing by year, we have to create a new UserCatData object for each year processed.

%%time
for index, _ts in enumerate(time_start):
    sdate = time_start[index]
    edate = time_end[index]
    print(sdate, edate)
    user_data = UserCatData(
        ds=ds,  # conus404 read from the intake catalog
        proj_ds=source_crs,
        x_coord=x_coord,
        y_coord=y_coord,
        t_coord=t_coord,
        var=variables,
        f_feature=huc12,  # GFv1.1 read above from the intake catalog
        proj_feature=target_crs,
        id_feature=target_poly_idx,
        period=[sdate, edate],
    )
    
    agg_gen = AggGen(
        user_data=user_data,
        stat_method="mean",
        agg_engine="parallel",
        agg_writer="netcdf",
        weights='wghts_huc12_c404daily_p.csv',
        out_path='.',
        file_prefix=f"{file_time[index]}_huc12_c404_daily_diagnostic",
        jobs=4
    )
    ngdf, ds_out = agg_gen.calculate_agg()