Preparing a Data Catalog#

Steps in brief:

Have your (local) dataset ready in one of the supported raster, vector or geospatial time-series
Create your own yaml file with a reference to your prepared dataset following the HydroMT data conventions, see examples below.

A detailed description of the yaml file is given below. For more information see from_yml() and examples per data type

Data catalog yaml file#

Each data source, is added to a data catalog yaml file with a user-defined name.

A blue print for a dataset called my_dataset is shown below. The path, data_type and driver options are required and the meta option with the shown keys is highly recommended. The rename, nodata, unit_add and unit_mult options are set per variable (or attribute table column in case of a GeoDataFrame). kwargs contain any options passed to different drivers.

meta:
  root: /path/to/data_root/
  version: version
my_dataset:
  crs: EPSG/WKT
  data_type: RasterDataset/GeoDataset/GeoDataFrame
  driver: raster/raster_tindex/netcdf/zarr/vector/vector_table
  filesystem: local/gcs/s3
  kwargs:
    key: value
  meta:
    source_url: zenodo.org/my_dataset
    source_license: CC-BY-3.0
    source_version: vX.X
    paper_ref: Author et al. (year)
    paper_doi: doi
    category: category
  nodata:
    new_variable_name: value
  path: /absolut_path/to/my_dataset.extension OR relative_path/to_my_dataset.extension
  placeholders:
    [placeholder_key: [placeholder_values]]
  zoom_levels:
    [zoom_level: zoom_resolution]
  rename:
    old_variable_name: new_variable_name
  unit_add:
    new_variable_name: value
  unit_mult:
    new_variable_name: value

The yaml file has an optional global meta data section:

root (optional): root folder for all the data sources in the yaml file. If not provided the folder of where the yaml file is located will be used as root. This is used in combination with each data source path argument to avoid repetition.
version (recommended): data catalog version; we recommend calendar versioning <https://calver.org/>
category (optional): used if all data source in catalog belong to the same category. Usual categories within HydroMT are geography, topography, hydrography, meteo, landuse, ocean, socio-economic, observed data but the user is free to define its own categories.

The following are required data source arguments:

data_type: type of input data. Either RasterDataset, GeoDataset or GeoDataFrame.
driver: data_type specific driver to read a dataset, see overview below.
path: path to the data file. Relative paths are combined with the global root option of the yaml file (if available) or the directory of the yaml file itself. To read multiple files in a single dataset (if supported by the driver) a string glob in the form of "path/to/my/files/*.nc" can be used. The filenames can be further specified with {variable}, {year} and {month} keys to limit which files are being read based on the get_data request in the form of "path/to/my/files/{variable}_{year}_{month}.nc". Note that month is by default not zero-padded (e.g. January 2012 is stored as "path/to/my/files/{variable}_2012_1.nc"). Users can optionally add a formatting string to define how the key should be read. For example, in a path written as "path/to/my/files/{variable}_{year}_{month:02d}.nc", the month always has two digits and is zero-padded for Jan-Sep (e.g. January 2012 is stored as "path/to/my/files/{variable}_2012_01.nc").

A full list of optional data source arguments is given below

Note

The alias argument will be deprecated and should no longer be used, see github issue for more information

Warning

Using cloud data is still experimental and only supported for DataFrame, RasterDataset and Geodataset with zarr. RasterDataset with raster driver is also possible but in case of multiple files (mosaic) we strongly recommend using a vrt file for speed and computation efficiency.