Download this notebook: 1a_basics.ipynb
Running a Hydrometeorological Ensemble Verification Pipeline with veriflow#
Forecast verification helps us evaluate the quality and behaviour of forecasts. This can include comparing forecasts with observations, assessing skill across lead times, examining forecast uncertainty, or checking whether forecast distributions are realistic under different hydrological conditions.
veriflow is a Python-based framework for running forecast verification workflows in a consistent and reproducible way. It uses a configuration-driven approach to load data, define verification experiments, compute verification metrics and statistical criteria, and return results. It can be used to verify both deterministic and probabilistic ensemble forecasts.
1. What this notebook does#
This notebook focuses on the workflow, with an initial visualization of verification results:
Define a verification configuration
Run the veriflow pipeline
Inspect the returned results
Explore the available data and diagnostics
Visualize the results with the interactive app
The aim of this notebook is to demonstrate usage of veriflow, not to provide a complete overview of forecast verification theory.
A second notebook focuses on the interpretation of the data, verification results, and forecast quality.
2. Dataset and case study#
This example uses hydrometeorological data for the River Rhine basin. We focus on discharge and compare three discharge forecast datasets. Each dataset consists of a 5-member ensemble reforecast. The discharge forecasts were generated by forcing the HBV hydrological model with raw or post-processed ECMWF ensemble reforecasts of precipitation and temperature.
The forecast datasets are:
``raw_raw``: baseline streamflow forecast driven by raw meteorological ensemble forecasts
``lin_log``: streamflow forecast driven by meteorological ensembles post-processed using a linear-log transformation
``qqt_qqt``: streamflow forecast driven by meteorological ensembles post-processed using a quantile-to-quantile transformation
In this notebook, all forecast datasets are evaluated against the same observation dataset. We focus on a key location, Lobith.
The data are read directly from a remote Zarr store, so no local download is required. For more information, see the References section.
3. Imports and environment setup#
We start by importing the required Python libraries and veriflow components.
The imports include:
Required Python tools
veriflow components, including configuration objects, Zarr datasource for reading the example data remotely, and pipeline runner
A local app for interactive visualization
[ ]:
# Add automatic reloading of modules in case of changes
%load_ext autoreload
%autoreload 2
from datetime import datetime, timezone
import logging
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from pathlib import Path
from veriflow.configuration import Config, GeneralInfoConfig
from veriflow.configuration.utils import (
LeadTimes,
Range,
TimeUnits,
VerificationPair,
VerificationPeriod,
)
from veriflow.constants import DataType
from veriflow.datasources import ZarrConfig
from veriflow.pipeline import run_pipeline
from veriflow.scores import CrpsForEnsembleConfig, RankHistogramConfig
import sys
sys.path.append(str(Path("..").resolve()))
from verification_plots import (
crps_plot,
find_crps_variable,
rank_histogram_3d_plot,
rank_histogram_plot,
scatter_plot,
)
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
4. Defining the verification configuration#
veriflow is a configuration-driven workflow. The configuration tells the pipeline:
which data sources to read
which forecasts and observations to compare
which lead times to evaluate
which verification metrics and statistical criteria to compute
We define these parts below and then run the full workflow with a single pipeline call.
4.1. General verification settings#
The general settings define the structure of the verification experiment:
the verification period
the forecast lead times
the verification pairs
Each verification pair represents a single comparison between a forecast dataset and the observations.
[3]:
general = GeneralInfoConfig(
# Evaluate forecasts issued during this period.
verification_period=VerificationPeriod(
start=datetime(1988, 1, 1, tzinfo=timezone.utc),
end=datetime(2008, 12, 31, tzinfo=timezone.utc),
dimension="forecast_reference_time",
),
# Define which forecast datasets are compared against observations.
verification_pairs=[
VerificationPair(id="raw_raw", obs="obs", sim="raw_raw", variable="Q"),
VerificationPair(id="lin_log", obs="obs", sim="lin_log", variable="Q"),
VerificationPair(id="qqt_qqt", obs="obs", sim="qqt_qqt", variable="Q"),
],
# Evaluate lead times from 0 to 10 days.
lead_times=LeadTimes(
unit=TimeUnits.day,
values=Range(start=0, end=10, step=1),
),
)
4.2. Data sources#
The example data is stored in remote Zarr datasets. Zarr is well suited for cloud-based, multidimensional data access.
Here we define one observation source and three forecast sources. The source names must match the names used in the verification pairs above.
[4]:
datasources = [
ZarrConfig(
general=general,
import_adapter="zarr",
source="obs",
data_type=DataType.observed_historical,
consolidated=True,
path="https://s3.deltares.nl/deltares-verification-assets/rhine_dataset_verkade_2013/obs_Q.zarr",
),
ZarrConfig(
general=general,
import_adapter="zarr",
source="raw_raw",
data_type=DataType.simulated_forecast_ensemble,
consolidated=True,
path="https://s3.deltares.nl/deltares-verification-assets/rhine_dataset_verkade_2013/case-raw-raw_Q.zarr",
),
ZarrConfig(
general=general,
import_adapter="zarr",
source="lin_log",
data_type=DataType.simulated_forecast_ensemble,
consolidated=True,
path="https://s3.deltares.nl/deltares-verification-assets/rhine_dataset_verkade_2013/case-lin-log_Q.zarr",
),
ZarrConfig(
general=general,
import_adapter="zarr",
source="qqt_qqt",
data_type=DataType.simulated_forecast_ensemble,
consolidated=True,
path="https://s3.deltares.nl/deltares-verification-assets/rhine_dataset_verkade_2013/case-qqt-qqt_Q.zarr",
),
]
4.3. Verification metrics and statistical criteria#
In this notebook, we compute two commonly used criteria for ensemble forecasts:
Continuous Ranked Probability Score (CRPS) assesses how well an ensemble forecast matches the observed outcome.
A CRPS of 0 indicates a perfect forecast.
Lower values indicate better probabilistic forecast quality, indicating the ensemble is centred around the observation and does not unnecessarily spread out.
CRPS increases when the forecast distribution is biased, too wide, or too narrow relative to the observation.
Rank Histogram assesses statistical reliability by showing where observations fall within the ensemble distribution.
A roughly uniform histogram indicates a well-calibrated ensemble.
Systematic patterns can reveal biases or issues with ensemble spread.
The interactive app later in this notebook can be used to explore both reuslting CRPS and rank histogram.
[5]:
scores = [
# CRPS
CrpsForEnsembleConfig(
score_adapter="crps_for_ensemble",
general=general,
reduce_dims=[],
),
# Rank histogram
RankHistogramConfig(
score_adapter="rank_histogram",
general=general,
reduce_dims=["forecast_reference_time"],
),
]
config = Config(
fileversion="0.1.0",
general=general,
datasources=datasources,
scores=scores,
)
5. Running the verification pipeline#
With the configuration in place, we can run the veriflow pipeline, which combines the following steps into a workflow:
Load the input datasets
Pair forecasts and observations
Compute the requested verification metrics and statistical criteria
Return everything in an
OutputDataset
[6]:
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
handlers=[logging.StreamHandler()],
)
output_dataset = run_pipeline(config)
2026-06-17 20:40:24,863 INFO veriflow.pipeline: Successfully initialized the configuration.
verification_period_start = 1988-01-01 00:00:00
verification_period_end = 2008-12-31 00:00:00
2026-06-17 20:40:24,865 INFO veriflow.pipeline: Start getting data from Zarr.
2026-06-17 20:40:26,498 INFO veriflow.pipeline: Successfully got data from Zarr.
2026-06-17 20:40:26,499 INFO veriflow.pipeline: Start getting data from Zarr.
2026-06-17 20:40:27,018 INFO veriflow.pipeline: Successfully got data from Zarr.
2026-06-17 20:40:27,019 INFO veriflow.pipeline: Start getting data from Zarr.
2026-06-17 20:40:27,483 INFO veriflow.pipeline: Successfully got data from Zarr.
2026-06-17 20:40:27,485 INFO veriflow.pipeline: Start getting data from Zarr.
2026-06-17 20:40:27,956 INFO veriflow.pipeline: Successfully got data from Zarr.
2026-06-17 20:40:27,960 INFO veriflow.pipeline: Successfully loaded all data from sources.
2026-06-17 20:40:29,248 INFO veriflow.pipeline: Successfully computed CrpsForEnsemble for verification pair raw_raw.
2026-06-17 20:40:30,363 INFO veriflow.pipeline: Successfully computed CrpsForEnsemble for verification pair lin_log.
2026-06-17 20:40:31,397 INFO veriflow.pipeline: Successfully computed CrpsForEnsemble for verification pair qqt_qqt.
2026-06-17 20:40:34,965 INFO veriflow.pipeline: Successfully computed RankHistogram for verification pair raw_raw.
2026-06-17 20:40:37,746 INFO veriflow.pipeline: Successfully computed RankHistogram for verification pair lin_log.
2026-06-17 20:40:40,612 INFO veriflow.pipeline: Successfully computed RankHistogram for verification pair qqt_qqt.
2026-06-17 20:40:40,614 INFO veriflow.pipeline: Verification pipeline completed successfully.
6. Inspecting the OutputDataset#
The pipeline returns an ``OutputDataset``, which provides access to the verification results. Results are organized by verification pair, where each pair defines a comparison between an observation dataset and a forecast dataset (for example, raw_raw versus observations). Each verification pair contains the input data together with the computed verification metrics.
We start by inspecting the output structure and then extract the results for a single verification pair.
[7]:
type(output_dataset)
[7]:
veriflow.datamodel.main.OutputDataset
[8]:
output_dataset.verification_pairs
[8]:
[VerificationPair(id='raw_raw', obs='obs', sim='raw_raw', variable='Q'),
VerificationPair(id='lin_log', obs='obs', sim='lin_log', variable='Q'),
VerificationPair(id='qqt_qqt', obs='obs', sim='qqt_qqt', variable='Q')]
The output contains three verification pairs, corresponding to the three forecast datasets defined in the configuration.
We can now select one verification pair and inspect its dataset. The returned object is an xarray.Dataset containing:
the forecast data
the corresponding observations
the computed verification metrics and statistical criteria (in this case, CRPS and rank histogram)
metadata and coordinates (e.g., station, lead time, and forecast reference time)
[9]:
output_dataset.get(output_dataset.verification_pairs[0])
[9]:
<xarray.Dataset> Size: 144MB
Dimensions: (station: 88, forecast_reference_time: 2924,
lead_time: 10, realization: 5, rank: 6)
Coordinates:
* station (station) <U11 4kB 'H-RN-0001' ... 'H-RN-WURZ'
lat (station) float64 704B 0.0 0.0 0.0 ... 0.0 0.0 0.0
lon (station) float64 704B 0.0 0.0 0.0 ... 0.0 0.0 0.0
* forecast_reference_time (forecast_reference_time) datetime64[ns] 23kB 19...
* lead_time (lead_time) timedelta64[ns] 80B 1 days ... 10 days
time (forecast_reference_time, lead_time) datetime64[ns] 234kB ...
* realization (realization) int64 40B 0 1 2 3 4
* rank (rank) float64 48B 1.0 2.0 3.0 4.0 5.0 6.0
Data variables:
obs (station, forecast_reference_time, lead_time) float64 21MB ...
raw_raw (forecast_reference_time, lead_time, station, realization) float64 103MB ...
crps_for_ensemble (forecast_reference_time, lead_time, station) float64 21MB ...
histogram_rank (station, lead_time, rank) float64 42kB 1.142e+0...
Attributes:
data_type: observed_historical
units: m^3/s7. Visualization of verification results#
Now that we have an overview of the observed and ensemble forecast time series and the verification results through the ``OutputDataset``, we can visualize the results using a small collection of reusable plotting functions in verification_plots.py.
Each function returns a Plotly figure for a pre-sliced dataset. There is no interactive layer: selections that used to be interactive controls - verification pair, station, and lead time - are now made explicitly by slicing the xarray.Dataset before calling a plot function. This makes the plots easy to reuse in notebooks, scripts, or reports.
The available functions are:
scatter_plot- observation vs. simulation scatter for a single station and lead timecrps_plot- CRPS across lead times for one or more pre-selected stationsrank_histogram_plot- rank histogram for a single station and lead timerank_histogram_3d_plot- 3D rank histogram showing all ranks across lead times for a single station
The interpretation of these plots and diagnostics is covered in the second notebook.
[10]:
first_ds = output_dataset.get(output_dataset.verification_pairs[0])
stations = first_ds.coords["station"].values
lead_times = first_ds.coords["lead_time"].values
print("Stations:", stations)
Stations: ['H-RN-0001' 'H-RN-0024' 'H-RN-0026' 'H-RN-0028' 'H-RN-0029' 'H-RN-0031'
'H-RN-0036' 'H-RN-0038' 'H-RN-0039' 'H-RN-0052' 'H-RN-0053' 'H-RN-0627'
'H-RN-0659' 'H-RN-0668' 'H-RN-0689' 'H-RN-0693' 'H-RN-0808' 'H-RN-0847'
'H-RN-0888' 'H-RN-0900' 'H-RN-0908' 'H-RN-0913' 'H-RN-0943' 'H-RN-0947'
'H-RN-0950' 'H-RN-0957' 'H-RN-0984' 'H-RN-1025' 'H-RN-1026' 'H-RN-1027'
'H-RN-2289' 'H-RN-BFG001' 'H-RN-BFG002' 'H-RN-BFG003' 'H-RN-BFG005'
'H-RN-BFG007' 'H-RN-BFG008' 'H-RN-BFG009' 'H-RN-BFG014' 'H-RN-BFG015'
'H-RN-BFG017' 'H-RN-BFG018' 'H-RN-BFG019' 'H-RN-BFG021' 'H-RN-BFG023'
'H-RN-BFG025' 'H-RN-BFG026' 'H-RN-BFG028' 'H-RN-BFG029' 'H-RN-BFG031'
'H-RN-BFG033' 'H-RN-BFG035' 'H-RN-BFG036' 'H-RN-BFG037' 'H-RN-BFG039'
'H-RN-BFG040' 'H-RN-BFG041' 'H-RN-BFG042' 'H-RN-BFG043' 'H-RN-BFG044'
'H-RN-BFG045' 'H-RN-BFG046' 'H-RN-BFG048' 'H-RN-BFG049' 'H-RN-BFG050'
'H-RN-BFG052' 'H-RN-BFG058' 'H-RN-BFG059' 'H-RN-BFG060' 'H-RN-BFG061'
'H-RN-BFG063' 'H-RN-BFG066' 'H-RN-BFG068' 'H-RN-BFG069' 'H-RN-BFG071'
'H-RN-BFG072' 'H-RN-BFG073' 'H-RN-BFG074' 'H-RN-BFG075' 'H-RN-BFG076'
'H-RN-BFG077' 'H-RN-BFG078' 'H-RN-BFG079' 'H-RN-BOLL' 'H-RN-DIET'
'H-RN-PLOC' 'H-RN-TRIE' 'H-RN-WURZ']
[11]:
# Scatter: pre-slice to a single station and a single lead time.
scatter_plot(output_dataset, station=stations[0], lead_time=lead_times[5]).show()
[12]:
crps_plot(output_dataset, stations=["H-RN-0001"]).show()
[14]:
# Rank histogram: pre-slice to a single station and a single lead time.
rank_histogram_plot(output_dataset, station="H-RN-0001", lead_time=lead_times[5]).show()
[ ]:
# Advanced and abstract.
rank_histogram_3d_plot(output_dataset, station="H-RN-0001").show()
8. Next step#
You have now configured and run a complete veriflow verification workflow, inspected the returned OutputDataset, and opened the interactive visualization app.
The second notebook, Interpreting Hydrological Ensemble Verification Results, builds directly on this output. It focuses on how to inspect the aligned data, interpret scatter plots, CRPS curves, and rank histograms, and draw conclusions about forecast performance, uncertainty, and ensemble reliability.
Optional: YAML-based configuration#
This notebook uses Python objects to define the configuration because it is convenient for interactive work.
veriflow can also support YAML-based configuration files, which are useful for reproducible and shareable workflows. A YAML example can be added here once the preferred release configuration format is finalized.
References#
The dataset used in the notebook was originally used in: > Verkade et al. (2013), Post-processing ECMWF precipitation and temperature ensemble reforecasts for operational hydrologic forecasting at various spatial scales, Journal of Hydrology. https://doi.org/10.1016/j.jhydrol.2013.07.039