Preparing a Data Catalog#

Steps in brief:

Have your (local) dataset ready in one of the supported raster, vector or geospatial time-series
Create your own yaml file with a reference to your prepared dataset following the HydroMT data conventions, see examples below.

A detailed description of the yaml file is given below. For more information see from_yml() and examples per data type

Data catalog yaml file#

Each data source, is added to a data catalog yaml file with a user-defined name.

A blue print for a dataset called my_dataset is shown below. The path, data_type and driver options are required and the meta option with the shown keys is highly recommended. The rename, nodata, unit_add and unit_mult options are set per variable (or attribute table column in case of a GeoDataFrame). kwargs contain any options passed to different drivers.

meta:
  version: version
  root: /path/to/data_root/
my_dataset:
  path: /absolut_path/to/my_dataset.extension OR relative_path/to_my_dataset.extension
  data_type: RasterDataset/GeoDataset/GeoDataFrame
  driver: raster/raster_tindex/netcdf/zarr/vector/vector_table
  crs: EPSG/WKT
  kwargs:
    key: value
  rename:
    old_variable_name: new_variable_name
  nodata:
    new_variable_name: value
  unit_add:
    new_variable_name: value
  unit_mult:
    new_variable_name: value
  meta:
    source_url: zenodo.org/my_dataset
    source_license: CC-BY-3.0
    source_version: vX.X
    paper_ref: Author et al. (year)
    paper_doi: doi
    category: category

The yaml file has a global meta data section:

version (recommended): data catalog version; we recommend calendar versioning <https://calver.org/>
root (optional): root folder for all the data sources in the yaml file. If not provided the folder of where the yaml file is located will be used as root. This is used in combination with each data source path argument to avoid repetition.
category (optional): used if all data source in catalog belong to the same category. Usual categories within HydroMT are geography, topography, hydrography, meteo, landuse, ocean, socio-economic, observed data but the user is free to define its own categories.

A full list of data source options is given below

Placeholder and alias#

There are two convenience options to limit repetition between data sources in data catalog files:

The placeholder argument can be used to generate multiple sources with a single entry in the data catalog file. If different files follow a logical nomenclature, multiple data sources can be defined by iterating through all possible combinations of the placeholders. The placeholder names should be given in the source name and the path and its values listed under the placeholder argument, see example below with an epoch and epsg placeholders.
The alias argument can be used to define a data source under a second short name, or to avoid repeating large sections with the same meta-data. If an alias is provided all information from the alias source is used to read the data except for the info that is overwritten by the current data source. The alias source should also be provided in the same file. Note that this only works at the first level of arguments, if e.g. the rename option is used in the current data source it overwrites all rename entries of the alias data source. In the example below ghs_pop is short for a specific version (epoch=2015; epsg=54009) of that dataset.

Note

Alias is deprecated and will be removed soon, see github issue for more information

ghs_pop:
  alias: ghs_pop_2015_54009
ghs_pop_{epoch}_{epsg}:
  data_type: RasterDataset
  driver: raster
  kwargs:
    chunks: {x: 3600, y: 3600}
  placeholder:
    epoch: [2015, 2020]
    epsg: [54009, 4326]
  meta:
    category: socio-economic
    paper_doi: 10.2905/0C6B9751-A71F-4062-830B-43C9F432370F
    paper_ref: Schiavina et al (2019)
    source_author: JRC-ISPRA EC
    source_license: https://data.jrc.ec.europa.eu/licence/com_reuse
    source_url: https://data.jrc.ec.europa.eu/dataset/0c6b9751-a71f-4062-830b-43c9f432370f
    source_version: R2019A_v1.0
  path: socio_economic/ghs/GHS_POP_E{epoch}_GLOBE_R2019A_{epsg}.tif