Authors: Sam Lamont, Matt Denno, Katie van Werkhoven – RTI Center for Water Resources
Title: Benchmarking open-source technologies for storing and querying time series data at scale
Abstract: Tools for Exploratory Evaluation in Hydrologic Research (TEEHR) is a python tool set for loading, storing, processing, and visualizing hydrologic data, for the purpose of exploring and evaluating the datasets to assess model skill and performance. TEEHR enables model evaluation through the calculation of a suite of metrics (i.e., RMSE, KGE, NSE, etc.) describing the performance of simulated (“secondary”) data compared to observed (“primary”) data at multiple locations and over variable time periods. Some of the key underlying principles guiding TEEHR’s development include: performance/efficiency (users should get results quickly), interoperability (it should be easy to use and complement existing tools), open access (tools, results, and data should be easy to access and share), and scalability (it should take advantage of cloud computing resources to enable analysis of larger-than-memory data sets). Developing a solution that satisfies all these principles through a single approach can be challenging. Luckily, many open-source technologies exist with varying approaches to compute and data storage. To help understand the pros and cons of these technologies we present a set of benchmark tests using a common data set and cloud-based compute-environment. This dataset consists of three years of hourly NWM (National Water Model) retrospective streamflow simulation (v2.0 and v2.1) and corresponding USGS (United States Geological Survey) gage observations at approximately 7500 locations, which were compiled to provide the base data set for the tests. Several approaches based on open-source software packages (DuckDB, Dask, Spark) and data formats (DuckDB, Parquet, Zarr) are employed to calculate a subset of evaluation metrics (including RMSE, KGE, NSE, Bias). We compare the computational efficiency, memory requirements, interoperability, and scalability of each approach and discuss the pros and cons of each.