Tip

For an interactive online version click here: Binder badge

Example: Reading geospatial point time-series#

This example illustrates the how to read geospatial point time-series data here called GeoDataset, using the HydroMT DataCatalog with the vector and netcdf or zarr drivers.

[1]:
from pprint import pprint

import numpy as np

import hydromt
[2]:
# Download artifacts for the Piave basin
data_catalog = hydromt.DataCatalog(data_libs=["artifact_data=v1.0.0"])

Xarray Driver#

To read geospatial point time-series data and parse it into a xarray Dataset or DataArray we use the open_mfdataset() or the open_zarr() method. All options in the data catalog yaml file will be passed to to these methods.

As an example we will use the GTSM data dataset which is stored in Netcdf format.

[3]:
# inspect data source entry in data catalog yaml file
data_catalog.get_source("gtsmv3_eu_era5")
[3]:
GeoDatasetSource(name='gtsmv3_eu_era5', uri='gtsmv3_eu_era5.nc', data_adapter=GeoDatasetAdapter(unit_add={}, unit_mult={}, rename={}), driver=GeoDatasetXarrayDriver(filesystem=<fsspec.implementations.local.LocalFileSystem object at 0x7718187f62b0>, options={}), uri_resolver=ConventionResolver(filesystem=<fsspec.implementations.local.LocalFileSystem object at 0x7718187f62b0>, options={}), root='/home/sam/.hydromt/artifact_data/v1.0.0', version='3.0', provider=None, metadata=SourceMetadata(crs=<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
, unit=None, extent={}, nodata=None, attrs={}, category='ocean', paper_doi='10.24381/cds.8c59054f', paper_ref='Copernicus Climate Change Service 2019', url='https://cds.climate.copernicus.eu/cdsapp#!/dataset/10.24381/cds.8c59054f?tab=overview', license='https://cds.climate.copernicus.eu/cdsapp/#!/terms/licence-to-use-copernicus-products'))

We can load any geospatial point time-series data using DataCatalog.get_geodataset(). Note that if we don’t provide any arguments it returns the full dataset with all data variables and for the full spatial domain. The result is per default returned as DataArray as the dataset consists of a single variable. To return a dataset use the single_var_as_array=False argument. Only the data coordinates and the time are actually read, the data variables are still lazy Dask arrays.

[4]:
ds = data_catalog.get_geodataset("gtsmv3_eu_era5", single_var_as_array=False)
ds
object: GeoDatasetXarrayDriver does not use kwarg predicate with value intersects.
[4]:
<xarray.Dataset> Size: 323kB
Dimensions:      (stations: 19, time: 2016)
Coordinates:
    lon          (stations) float64 152B dask.array<chunksize=(19,), meta=np.ndarray>
  * time         (time) datetime64[ns] 16kB 2010-02-01 ... 2010-02-14T23:50:00
    lat          (stations) float64 152B dask.array<chunksize=(19,), meta=np.ndarray>
    spatial_ref  int32 4B ...
  * stations     (stations) int32 76B 13670 2798 2799 13775 ... 2791 2790 2789
Data variables:
    waterlevel   (time, stations) float64 306kB dask.array<chunksize=(2016, 19), meta=np.ndarray>
Attributes:
    crs:        4326
    category:   ocean
    paper_doi:  10.24381/cds.8c59054f
    paper_ref:  Copernicus Climate Change Service 2019
    url:        https://cds.climate.copernicus.eu/cdsapp#!/dataset/10.24381/c...
    license:    https://cds.climate.copernicus.eu/cdsapp/#!/terms/licence-to-...

The data can be visualized with the DataArray.plot() xarray method. We show the evolution of the water level over time for a specific point location (station).

[5]:
ds.sel(stations=2791)["waterlevel"].plot()
[5]:
[<matplotlib.lines.Line2D at 0x771814ac7580>]
../_images/_examples_reading_point_data_10_1.png

We can request a (spatial) subset data by providing additional variables and bbox / geom arguments. Note that these return less stations and less variables. In this example only spatial arguments are applied as only a single variable is available. The variables argument is especially useful if each variable of the dataset is saved in a separate file and the {variable} key is used in the path argument of the data source to limit which files are actually read. If a single variable is requested a DataArray instead of a Dataset is returned unless the single_var_as_array argument is set to False (True by default).

[6]:
bbox = [12.50, 45.20, 12.80, 45.40]
ds_bbox = data_catalog.get_geodataset("gtsmv3_eu_era5", bbox=bbox)
ds_bbox
object: GeoDatasetXarrayDriver does not use kwarg predicate with value intersects.
[6]:
<xarray.DataArray 'waterlevel' (time: 2016, stations: 2)> Size: 32kB
dask.array<getitem, shape=(2016, 2), dtype=float64, chunksize=(2016, 2), chunktype=numpy.ndarray>
Coordinates:
    lon          (stations) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
  * time         (time) datetime64[ns] 16kB 2010-02-01 ... 2010-02-14T23:50:00
    lat          (stations) float64 16B dask.array<chunksize=(2,), meta=np.ndarray>
    spatial_ref  int32 4B ...
  * stations     (stations) int32 8B 13775 13723
Attributes:
    long_name:       sea_surface_height_above_mean_sea_level
    units:           m
    CDI_grid_type:   unstructured
    short_name:      waterlevel
    description:     Total water level resulting from the combination of baro...
    category:        ocean
    paper_doi:       10.24381/cds.8c59054f
    paper_ref:       Copernicus Climate Change Service 2019
    source_license:  https://cds.climate.copernicus.eu/cdsapp/#!/terms/licenc...
    source_url:      https://cds.climate.copernicus.eu/cdsapp#!/dataset/10.24...
    source_version:  GTSM v3.0

With a Geodataset, you can also directly access the associated point geometries using its to_gdf() method. Multi-dimensional data (e.g. the time series) can be reduced using a statistical method such as the max.

[7]:
pprint(ds_bbox.vector.to_gdf(reducer=np.nanmax))
                           geometry  waterlevel
stations
13775     POINT (12.75879 45.24902)    0.703679
13723     POINT (12.50244 45.25635)    0.721433
[8]:
# Plot
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import box

proj = ccrs.PlateCarree()

fig = plt.figure(figsize=(6, 10))
ax = plt.subplot(projection=proj)

bbox = gpd.GeoDataFrame(geometry=[box(12.50, 45.20, 12.80, 45.40)], crs=4326)

ax.add_image(cimgt.QuadtreeTiles(), 12)
# Plot the points
ds.vector.to_gdf().plot(ax=ax, markersize=30, c="blue", zorder=2)
ds_bbox.vector.to_gdf().plot(ax=ax, markersize=40, c="red", zorder=2)
# Plot the bounding box
bbox.boundary.plot(ax=ax, color="red", linewidth=0.8)
[8]:
<GeoAxes: >
../_images/_examples_reading_point_data_15_1.png

Vector driver#

To read vector data and parse it into a xarray.Dataset object we use the open_geodataset() method. Combined point locations (e.g. CSV or GeoJSON) data as well as text delimited time series (e.g. CSV) data are supported as file formats (see DataCatalog documentation). Both formats must contain an index and a crs has to be indicated in the .yml file. For demonstration we use dummy example data from the examples/data folder.

First load the data catalog of the corresponding example data geodataset_catalog.yml:

[9]:
geodata_catalog = hydromt.DataCatalog("data/geodataset_catalog.yml")
geodata_catalog.get_source("waterlevels_txt")
[9]:
GeoDatasetSource(name='waterlevels_txt', uri='stations.csv', data_adapter=GeoDatasetAdapter(unit_add={}, unit_mult={}, rename={'stations_data': 'waterlevel'}), driver=GeoDatasetVectorDriver(filesystem=<fsspec.implementations.local.LocalFileSystem object at 0x7718187f62b0>, options={'data_path': 'stations_data.csv'}), uri_resolver=ConventionResolver(filesystem=<fsspec.implementations.local.LocalFileSystem object at 0x7718187f62b0>, options={}), root='data', version=None, provider=None, metadata=SourceMetadata(crs=<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
, unit=None, extent={}, nodata=None, attrs={}, category=None))

We see here that our locations are defined in a data/stations.csv file and a data/stations_data.csv. Let’s check the content of these files before loading them with the get_geodataset method:

[10]:
import pandas as pd

print("Stations locations:")
df_stations = pd.read_csv("data/stations.csv")
pprint(df_stations)
Stations locations:
   stations         x         y
0      1001  12.50244  45.25635
1      1002  12.75879  45.24902
[11]:
print("Stations data:")
df_stations_data = pd.read_csv("data/stations_data.csv")
pprint(df_stations_data.head(3))
Stations data:
         time      1001      1002
0  2016-01-01  0.590860  0.591380
1  2016-01-02  0.565552  0.571342
2  2016-01-03  0.538679  0.549770
[12]:
ds = geodata_catalog.get_geodataset("waterlevels_txt", single_var_as_array=False)
ds
object: GeoDatasetVectorDriver does not use kwarg metadata with value crs=<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
 unit=None extent={} nodata=None attrs={} category=None.
[12]:
<xarray.Dataset> Size: 48kB
Dimensions:      (time: 2016, stations: 2)
Coordinates:
  * time         (time) datetime64[ns] 16kB 2016-01-01 2016-01-02 ... 2021-07-08
  * stations     (stations) int64 16B 1001 1002
    geometry     (stations) object 16B dask.array<chunksize=(2,), meta=np.ndarray>
    spatial_ref  int64 8B 0
Data variables:
    waterlevel   (stations, time) float64 32kB dask.array<chunksize=(2, 2016), meta=np.ndarray>
Attributes:
    crs:      4326