{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Example: Preparing a data catalog" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "This example illustrates the how to prepare your own HydroMT [DataCatalog](../_generated/hydromt.data_catalog.DataCatalog.rst) to reference your own data sources and start using then within HydroMT, see [user guide](../guides/advanced_user/data_prepare_cat.rst)." ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "# import python libraries\n", "import os\n", "from pprint import pprint\n", "\n", "import geopandas as gpd\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import rioxarray # noqa\n", "import xarray as xr\n", "\n", "import hydromt" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "The steps to use your own data within HydroMT are in brief:\n", "\n", " 1) **Have your (local) dataset ready** in one of the supported [raster](../guides/advanced_user/data_types.rst#raster-data-rasterdataset) (tif, ascii, netcdf, zarr...), \n", " [vector](../guides/advanced_user/data_types.rst#vector-data-geodataframe) (shp, geojson, gpkg...) or [geospatial time-series](../guides/advanced_user/data_types.rst#geospatial-point-time-series-geodataset) (netcdf, csv...) format.\n", " 2) **Create your own [yaml file](../guides/advanced_user/data_prepare_cat.rst#data-catalog-yaml-file)** with a reference to your prepared dataset and properties (path, data_type, driver, etc.) following the HydroMT [data conventions](../guides/user_guide/data_conventions.rst#data-conventions). For this step, you can also start from an existing pre-defined catalog or use it for inspiration.\n", "\n", "The existing pre-defined catalog are:" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "# this download the artifact_data archive v1.0.0\n", "data_catalog = hydromt.DataCatalog(data_libs=[\"artifact_data=v1.0.0\"])\n", "pprint(data_catalog.predefined_catalogs)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "In this notebook, we will see how we can create a data catalog for several type of input data. For this we have prepared several type of data that we will catalogue, let's see which data we have available:" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "# the artifact data is stored in the following location\n", "root = os.path.join(data_catalog._cache_dir, \"artifact_data\", \"v1.0.0\")\n", "# let's print some of the file that are there\n", "for item in os.listdir(root)[-10:]:\n", " print(item)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## RasterDataset from raster file" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "The first file we will use is a 'simple' raster file in a tif format: **vito.tif**. This file contains a landuse classification raster. The first thing to do before adding a new file to a data catalog is to get to know what is inside of our file mainly:\n", "\n", " - **location of the file**: `path`.\n", " - **type of data**: `data_type`. `RasterDataset` for gridded data, `GeoDataFrame` for vector data, `GeoDataset` for point timeseries and `DataFrame` for tabular data.\n", " - **file format**: `driver`. The file format impacts the driver or python function that will be used to open the data. Either `raster`, `raster_tindex`, `netcdf`, `zarr`, `vector`, `vector_table`.\n", " - **crs**: `crs`. Coordinate sytem of the data. Optional as it is usually encoded in the data itself.\n", " - **variables and their properties**: `rename`, `unit_mult`, `unit_add`. Looking at the variables in the input data and what are their names and units so that we can convert them to the [HydroMT data conventions](../guides/user_guide/data_conventions.rst).\n", " \n", "There are more arguments or properties to look for that are explained in more detailed in the [documentation](../guides/advanced_user/data_prepare_cat.rst). To discover our data we can either use GIS software like QGIS or GDAL or just use python directly to try and open the data." ] }, { "attachments": {}, "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "Let's open our vito.tif file with xarray and rioxarray:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "da = xr.open_dataarray(os.path.join(root, \"vito.tif\"))\n", "pprint(da)\n", "print(f\"CRS: {da.raster.crs}\")\n", "da.plot()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "What we see is that we have a simple raster with landuse data in crs 4326. Let's translate what we know into a data catalog." ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", " \n", "vito:\n", " uri: vito.tif\n", " data_type: RasterDataset\n", " driver: \n", " name: rasterio \n", " metadata:\n", " crs: 4326 \n", "\"\"\"\n", "yaml_path = \"tmpdir/vito.yml\"\n", "with open(yaml_path, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "And let's now see if HydroMT can properly read the file from the data catalog we prepared:" ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "data_catalog = hydromt.DataCatalog(data_libs=[yaml_path])\n", "da = data_catalog.get_rasterdataset(\"vito\")\n", "da" ] }, { "attachments": {}, "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## RasterDataset from several raster files" ] }, { "attachments": {}, "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "The second file we will add is the **merit_hydro** which consists of elevation and elevation-derived variables stored in several tif files for each variable. Let's see what are their names." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "folder_name = os.path.join(root, \"merit_hydro\")\n", "# let's see which files are there\n", "for path, _, files in os.walk(folder_name):\n", " print(path)\n", " for name in files:\n", " print(f\" - {name}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "We have here 9 files. When reading tif files, the name of the file is used as the variable name. HydroMT uses data conventions to ensure that certain variables should have the same name and units to be used in automatically in the workflows. For example elevation data should be called *elevtn* with unit in [m asl]. Check the [data conventions](../guides/user_guide/data_conventions.rst) and see if you need to ``rename`` or change units with ``unit_add`` and ``unit_mult`` for this dataset in the data catalog. \n", "\n", "Here all names and units are correct, so we just show an example were we rename the *hnd* variable." ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", "\n", "merit_hydro:\n", " data_type: RasterDataset\n", " driver: \n", " name: rasterio\n", " options:\n", " chunks:\n", " x: 6000\n", " y: 6000\n", " rename:\n", " hnd: height_above_nearest_drain\n", " uri: merit_hydro/*.tif\n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/merit_hydro.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "ds = data_catalog.get_rasterdataset(\"merit_hydro\")\n", "ds" ] }, { "attachments": {}, "cell_type": "markdown", "id": "21", "metadata": {}, "source": [ "In the ``path``, the filenames can be further specified with *{variable}*, *{year}* and *{month}* keys to limit which files are being read based on the get_data request in the form of *\"path/to/my/files/{variable}_{year}_{month}.nc\"*. \n", "\n", "Let's see how this works:" ] }, { "cell_type": "code", "execution_count": null, "id": "22", "metadata": {}, "outputs": [], "source": [ "# NOTE: the double curly brackets will be printed as single brackets in the text file\n", "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", "\n", "merit_hydro:\n", " data_type: RasterDataset\n", " driver: \n", " name: rasterio\n", " options:\n", " chunks:\n", " x: 6000\n", " y: 6000\n", " data_adapter:\n", " rename:\n", " hnd: height_above_nearest_drain\n", " uri: merit_hydro/{{variable}}.tif\n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/merit_hydro.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "ds = data_catalog.get_rasterdataset(\n", " \"merit_hydro\", variables=[\"height_above_nearest_drain\", \"elevtn\"]\n", ")\n", "ds" ] }, { "attachments": {}, "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "## RasterDataset from a netcdf file" ] }, { "attachments": {}, "cell_type": "markdown", "id": "25", "metadata": {}, "source": [ "The last RasterDataset file we will add is the **era5.nc** which consists of climate variables stored in a netcdf file. Let's open this file with xarray." ] }, { "cell_type": "code", "execution_count": null, "id": "26", "metadata": {}, "outputs": [], "source": [ "ds = xr.open_dataset(os.path.join(root, \"era5.nc\"))\n", "pprint(ds)" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "# Select first timestep\n", "ds1 = ds.sel(time=ds.time[0])\n", "ds1\n", "# Plot\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))\n", "ds1[\"precip\"].plot(ax=axes[0])\n", "axes[0].set_title(\"precip\")\n", "ds1[\"temp\"].plot(ax=axes[1])\n", "axes[1].set_title(\"temp\")\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))\n", "ds1[\"kin\"].plot(ax=axes[0])\n", "axes[0].set_title(\"kin\")\n", "ds1[\"press_msl\"].plot(ax=axes[1])\n", "axes[1].set_title(\"press_msl\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "Checking the [data conventions](../guides/user_guide/data_conventions.rst) we see that all variables already have the right names but the units should be changed:\n", "\n", " - precip from m to mm\n", " - temp, temp_min, temp_max from K to C\n", " - kin, kout from J.m-2 to W.m-2\n", " - press_msl from Pa to hPa\n", "\n", "Let's change the units using ``unit_mult`` and ``unit_add``:" ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", "\n", "era5:\n", " metadata:\n", " crs: 4326\n", " data_type: RasterDataset\n", " driver: \n", " name: raster_xarray\n", " data_adapter:\n", " unit_add:\n", " temp: -273.15\n", " temp_max: -273.15\n", " temp_min: -273.15\n", " time: 86400\n", " unit_mult:\n", " kin: 0.000277778\n", " kout: 0.000277778\n", " precip: 1000\n", " press_msl: 0.01\n", "\n", " uri: era5.nc\n", " \n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/era5.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "And now open our dataset and check the units have been converted." ] }, { "cell_type": "code", "execution_count": null, "id": "31", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "ds = data_catalog.get_rasterdataset(\"era5\")\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "# Select first timestep\n", "ds1 = ds.sel(time=ds.time[0])\n", "ds1\n", "# Plot\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))\n", "ds1[\"precip\"].plot(ax=axes[0])\n", "axes[0].set_title(\"precip\")\n", "ds1[\"temp\"].plot(ax=axes[1])\n", "axes[1].set_title(\"temp\")\n", "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))\n", "ds1[\"kin\"].plot(ax=axes[0])\n", "axes[0].set_title(\"kin\")\n", "ds1[\"press_msl\"].plot(ax=axes[1])\n", "axes[1].set_title(\"press_msl\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "## GeoDataFrame from a vector file" ] }, { "attachments": {}, "cell_type": "markdown", "id": "34", "metadata": {}, "source": [ "Now we will see how to add vector data to the data catalogue based on **rivers_lin2019_v1.gpkg**. Vector files can be open in Python with geopandas (or you can use QGIS) to inspect the data." ] }, { "cell_type": "code", "execution_count": null, "id": "35", "metadata": {}, "outputs": [], "source": [ "gdf = gpd.read_file(os.path.join(root, \"rivers_lin2019_v1.gpkg\"))\n", "pprint(gdf.head())\n", "print(f\"Variables: {gdf.columns}\")\n", "print(f\"CRS: {gdf.crs}\")\n", "gdf.plot()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "This data source contains rivers line including attributes that can be usefull to setup models such as river width, average discharge or bankfull discharge. Here it's not needed but feel free to try out some renaming or unit conversion. The minimal data catalog input would be:" ] }, { "cell_type": "code", "execution_count": null, "id": "37", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", "\n", "rivers_lin:\n", " data_type: GeoDataFrame\n", " driver: \n", " name: pyogrio\n", " uri: rivers_lin2019_v1.gpkg\n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/rivers.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "cell_type": "code", "execution_count": null, "id": "38", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "gdf = data_catalog.get_geodataframe(\"rivers_lin\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "39", "metadata": {}, "source": [ "## GeoDataset from a netcdf file" ] }, { "attachments": {}, "cell_type": "markdown", "id": "40", "metadata": {}, "source": [ "Now we will see how to add geodataset data to the data catalogue based on **gtsmv3_eu_era5.nc**. This geodataset file contains ocean water level timeseries at specific stations locations in netdcf format and can be opened in Python with xarray. In HydroMT we use a specific wrapper around xarray called GeoDataset to mark that this file contains geospatial timeseries, in this case point timeseries. But for now we can inspect it with xarray.\n", "\n", "To learn more about GeoDataset type you can check the [reading geodataset example](reading_point_data.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "id": "41", "metadata": {}, "outputs": [], "source": [ "ds = xr.open_dataset(os.path.join(root, \"gtsmv3_eu_era5.nc\"))\n", "ds" ] }, { "attachments": {}, "cell_type": "markdown", "id": "42", "metadata": {}, "source": [ "This is quite a classic file, so the data catalog entry is quite straightforward:" ] }, { "cell_type": "code", "execution_count": null, "id": "43", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {root}\n", "\n", "gtsm:\n", " data_type: GeoDataset\n", " driver: \n", " name: geodataset_xarray\n", " metadata:\n", " crs: 4326\n", " category: ocean\n", " source_version: GTSM v3.0\n", " uri: gtsmv3_eu_era5.nc\n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/gtsm.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "cell_type": "code", "execution_count": null, "id": "44", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "ds = data_catalog.get_geodataset(\"gtsm\")\n", "ds" ] }, { "attachments": {}, "cell_type": "markdown", "id": "45", "metadata": {}, "source": [ "## GeoDataset from vector files" ] }, { "attachments": {}, "cell_type": "markdown", "id": "46", "metadata": {}, "source": [ "For geodataset, you can also use the ``vector`` driver to combine two files, one for the location and one for the timeseries into one geodataset. We have a custom example available in the data folder of our example notebook using the files *stations.csv* for the locations and *stations_data.csv* for the timeseries:" ] }, { "cell_type": "code", "execution_count": null, "id": "47", "metadata": {}, "outputs": [], "source": [ "# example data folder\n", "root_data = \"data\"\n", "# let's print some of the file that are there\n", "for item in os.listdir(root_data):\n", " print(item)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "48", "metadata": {}, "source": [ "For this driver to work, the format of the timeseries table is quite strict (see [docs](../guides/advanced_user/data_types.rst#csv-point-time-series-data)). Let's inspect the two files using pandas in python:" ] }, { "cell_type": "code", "execution_count": null, "id": "49", "metadata": {}, "outputs": [], "source": [ "df_locs = pd.read_csv(\"data/stations.csv\")\n", "pprint(df_locs)" ] }, { "cell_type": "code", "execution_count": null, "id": "50", "metadata": {}, "outputs": [], "source": [ "df_data = pd.read_csv(\"data/stations_data.csv\")\n", "pprint(df_data)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "51", "metadata": {}, "source": [ "And see how the data catalog would look like:" ] }, { "cell_type": "code", "execution_count": null, "id": "52", "metadata": {}, "outputs": [], "source": [ "yml_str = f\"\"\"\n", "meta:\n", " roots: \n", " - {os.path.join(os.getcwd(), 'data')}\n", "\n", "waterlevel_csv:\n", " metadata:\n", " crs: 4326\n", " data_type: GeoDataset\n", " driver:\n", " name: geodataset_vector\n", " options:\n", " data_path: stations_data.csv\n", " uri: stations.csv\n", " data_adapter:\n", " rename:\n", " stations_data: waterlevel\n", "\n", "\"\"\"\n", "# overwrite data catalog\n", "data_lib = \"tmpdir/waterlevel.yml\"\n", "with open(data_lib, mode=\"w\") as f:\n", " f.write(yml_str)" ] }, { "cell_type": "code", "execution_count": null, "id": "53", "metadata": {}, "outputs": [], "source": [ "data_catalog.from_yml(data_lib) # add a yaml file to the data catalog\n", "print(data_catalog.sources.keys())\n", "ds = data_catalog.get_geodataset(\"waterlevel_csv\")\n", "ds" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" }, "vscode": { "interpreter": { "hash": "b2e5bbf73edd41d9b4fc28909f8e07f802c9a01daacb9a903fccd95763f42821" } } }, "nbformat": 4, "nbformat_minor": 5 }