Preparing a Data Catalog#

Steps in brief:

  1. Have your (local) dataset ready in one of the supported raster, vector or geospatial time-series

  2. Create your own yaml file with a reference to your prepared dataset following the HydroMT data conventions, see examples below.

A detailed description of the yaml file is given below. For more information see from_yml() and examples per data type

Data catalog yaml file#

Each data source, is added to a data catalog yaml file with a user-defined name.

A blue print for a dataset called my_dataset is shown below. The uri, data_type and driver options are required and the metadata option with the shown keys is highly recommended. The rename, nodata, unit_add and unit_mult options are set per variable (or attribute table column in case of a GeoDataFrame).

  meta:
    roots:
      - /linux/path/to/data_root/
      - C:\Windows\path\to\data_root
      - .
    version: version
    name: data_catalog_name

  era5:
    data_type: RasterDataset
    variants:
    - provider: netcdf
      uri: meteo/era5_daily/nc_merged/era5_{year}_daily.nc
      driver:
        name: raster_xarray
        options:
          chunks:
            latitude: 250
            longitude: 240
            time: 30
          combine: by_coords
          decode_times: true
          parallel: true
    - provider: zarr
      uri: meteo/era5_daily.zarr
      driver:
        name: raster_xarray
        options:
          chunks: auto
    metadata:
      category: meteo
      notes: Extracted from Copernicus Climate Data Store; resampled by Deltares to
        daily frequency
      paper_doi: 10.1002/qj.3803
      paper_ref: Hersbach et al. (2019)
      url: https://doi.org/10.24381/cds.bd0915c6
      version: ERA5 daily data on pressure levels
      license: https://cds.climate.copernicus.eu/cdsapp/#!/terms/licence-to-use-copernicus-products
      crs: 4326
      temporal_extent:
        start: '1950-01-02'
        end: '2023-11-30'
      spatial_extent:
        West: -0.125
        South: -90.125
        East: 359.875
        North: 90.125
    data_adapter:
      unit_add:
        temp: -273.15
        temp_dew: -273.15
        temp_max: -273.15
        temp_min: -273.15
      unit_mult:
        kin: 0.000277778
        kout: 0.000277778
        ssr: 0.000277778
        press_msl: 0.01
      rename:
        d2m: temp_dew
        msl: press_msl
        ssrd: kin
        t2m: temp
        tisr: kout
        tmax: temp_max
        tmin: temp_min
        tp: precip
        u10: wind10_u
        v10: wind10_v

The yaml file has an optional global metadata data section:

  • roots (optional): root folders for all the data sources in the yaml file. If not provided the folder of where the yaml file is located will be used as root. This is used in combination with each data source uri argument to avoid repetition. The roots listed will be checked in the order they are provided. The first one to be found to exist will be used as the actual root. This should be used for cross platform and cross machine compatibility only, as can be seen above. Note that in the end only one of the roots will be used, so all data should still be located in the same folder tree.

  • version (recommended): data catalog version

  • hydromt_version (recommended): range of hydromt version that can read this catalog. Format should be acording to PEP 440.

  • category (optional): used if all data source in catalog belong to the same category. Usual categories within HydroMT are geography, topography, hydrography, meteo, landuse, ocean, socio-economic, observed data but the user is free to define its own categories.

The following are required data source arguments:

  • data_type: type of input data. Either RasterDataset, GeoDataset, Dataset GeoDataFrame or DataFrame.

  • driver: data_type specific Driver to read a dataset. If the default settings of a driver are sufficient, then a string with the name of the driver is enough. Otherwise, a dictionary with the driver class properties can be used. Refer to the Driver documentation to see which options are available.

  • uri: URI pointing to where the data can be queried. Relative paths are combined with the global root option of the yaml file (if available) or the directory of the yaml file itself. To read multiple files in a single dataset (if supported by the driver) a string glob in the form of "path/to/my/files/*.nc" can be used. The filenames can be further specified with {variable}, {year} and {month} keys to limit which files are being read based on the get_data request in the form of "path/to/my/files/{variable}_{year}_{month}.nc". Note that month is by default not zero-padded (e.g. January 2012 is stored as "path/to/my/files/{variable}_2012_1.nc"). Users can optionally add a formatting string to define how the key should be read. For example, in a path written as "path/to/my/files/{variable}_{year}_{month:02d}.nc", the month always has two digits and is zero-padded for Jan-Sep (e.g. January 2012 is stored as "path/to/my/files/{variable}_2012_01.nc").

A full list of optional data source arguments is given below

  • version (recommended): data source version

  • provider (recommended): data source provider

  • metadata (recommended): additional information on the dataset. In SourceMetaData there are many different metadata options available. Some metadata properties, like the crs, nodata or temporal_extent and spatial_extent can help HydroMT more efficiently read the data. Good meta data includes a source_url, source_license, source_version, paper_ref, paper_doi, category, etc. These are added to the data attributes. Usual categories within HydroMT are geography, topography, hydrography, meteo, landuse, ocean, socio-economic, observed data but the user is free to define its own categories.

  • data_adapter: the data adapter harmonizes the data so that within HydroMT, there are strong conventions on for example variable naming, HydroMT variable naming conventions and variable names. recognized dimension names. There are multiple different parameters available for each DataAdapter.

  • placeholder (optional): this argument can be used to generate multiple sources with a single entry in the data catalog file. If different files follow a logical nomenclature, multiple data sources can be defined by iterating through all possible combinations of the placeholders. The placeholder names should be given in the source name and the path and its values listed under the placeholder argument.

  • variants (optional): This argument can be used to generate multiple sources with the same name, but from different providers or versions. Any keys here are essentially used to extend/overwrite the base arguments.

Data variants#

Data variants are used to define multiple data sources with the same name, but from different providers or versions. Below, we show an example of a data catalog for a RasterDataset with multiple variants of the same data source (esa_worldcover), but this works identical for other data types. Here, the metadata, data_type, driver and are common arguments used for all variants. The variant arguments are used to extend and/or overwrite the common arguments, creating new sources.

esa_worldcover:
  metadata:
    crs: 4326
  data_type: RasterDataset
  driver:
    name: raster
    filesystem: local
  variants:
    - provider: local
      version: 2021
      uri: landuse/esa_worldcover_2021/esa-worldcover.vrt
    - provider: local
      version: 2020
      uri: landuse/esa_worldcover/esa-worldcover.vrt
    - provider: aws
      version: 2020
      uri: s3://esa-worldcover/v100/2020/ESA_WorldCover_10m_2020_v100_Map_AWS.vrt
      driver:
        name: raster
        filesystem: s3

To request a specific variant, the variant arguments can be used as keyword arguments to the DataCatalog.get_rasterdataset method, see code below. By default the newest version from the last provider is returned when requesting a data source with specific version or provider. Requesting a specific version from a HydroMT configuration file is also possible.

from hydromt import DataCatalog
dc = DataCatalog().from_yml("data_catalog.yml")
# get the default version. This will return the latest (2020) version from the last
# provider (aws)
ds = dc.get_rasterdataset("esa_worldcover")
# get a 2020 version. This will return the 2020 version from the last provider (aws)
ds = dc.get_rasterdataset("esa_worldcover", version=2020)
# get a 2021 version. This will return the 2021 version from the local provider as
# this verion is not available from aws .
ds = dc.get_rasterdataset("esa_worldcover", version=2021)
# get the 2020 version from the local provider
ds = dc.get_rasterdataset("esa_worldcover", version=2020, provider="local")