.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide\06-lazy-evaluation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_06-lazy-evaluation.py: Lazy evaluation =============== A powerful feature of xarray is `lazy evaluation`_. Lazy evaluation means that expressions are delayed until the results are needed. In a nutshell, lazy evaluation has the following advantages for us: * Lazy evaluation frees us from having to load all data into memory in one go. We can work with larger-than-memory datasets without having to manually divide the datasets into smaller, manageable pieces. * Lazy evaluation allows us to easily parallellize computations. This can speed up computations considerably depending on the available number of cores. Lazy evaluation can be contrasted with eager evaluation. In eager evaluation with xarray, all data always exists in RAM memory, and computations are performed immediately. Typical cases of when results are needed is when data has to be visualized or when the data has to be saved to disk. We will explain these concepts with a number of examples. .. GENERATED FROM PYTHON SOURCE LINES 25-34 .. code-block:: Python import dask.array import numpy as np import xarray as xr import imod lazy_da = xr.DataArray(dask.array.ones(5), dims=["x"]) lazy_result = lazy_da + 1 .. GENERATED FROM PYTHON SOURCE LINES 35-43 When we run the lines above, we evaluate the expression ``lazy_result = lazy_da + 1``. In eager evaluation, this means a value of 1 is added to every element; in delayed evaluation, the expression is stored instead in a graph of tasks. In xarray, delayed evaluation is taken care of by dask, an open source Python library for (parallel) scheduling and computing. We can identify the DataArray as lazy, since its ``.data`` attribute shows a dask array rather than a numpy array. .. GENERATED FROM PYTHON SOURCE LINES 43-46 .. code-block:: Python lazy_result.data .. raw:: html
Array Chunk
Bytes 40 B 40 B
Shape (5,) (5,)
Dask graph 1 chunks in 2 graph layers
Data type float64 numpy.ndarray
5 1


.. GENERATED FROM PYTHON SOURCE LINES 47-48 We can take a look at the graph of tasks as follows: .. GENERATED FROM PYTHON SOURCE LINES 48-51 .. code-block:: Python lazy_result.data.visualize() .. GENERATED FROM PYTHON SOURCE LINES 52-57 .. note:: The ``show`` function here is required for the online documentation. If you're running the examples interactively, the line above will show you a picture of the graph, and defining and calling ``show()`` can be ommitted. .. GENERATED FROM PYTHON SOURCE LINES 57-74 .. code-block:: Python def show(): # Show the latest png generated by dask.visualize with matplotlib. This is # required because the sphinx image scraper currently only supports # matplotlib and mayavi and will not capture the output of # dask.visualize(). import matplotlib.image as mpimg import matplotlib.pyplot as plt _, ax = plt.subplots() ax.set_axis_off() ax.imshow(mpimg.imread("mydask.png")) show() .. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_001.png :alt: 06 lazy evaluation :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 75-82 We can see a very simple graph with two tasks: * create the array of ones; * add a value of 1 to every element. Let's constrast this with eager evaluation. Rather than creating the data as a dask array, we will create it as an ordinary numpy array. .. GENERATED FROM PYTHON SOURCE LINES 82-87 .. code-block:: Python eager_da = xr.DataArray(np.ones(5), dims=["x"]) eager_result = eager_da + 1 eager_result.data .. rst-class:: sphx-glr-script-out .. code-block:: none array([2., 2., 2., 2., 2.]) .. GENERATED FROM PYTHON SOURCE LINES 88-102 The data is an numpy array. Since the actual values of the elements of the array are in memory, we also get a preview of the element values. See also the xarray documentation on `parallel computing with dask`_. Delayed computation and (i)MODFLOW files ---------------------------------------- The input and output functions in the imod package have been written to work with xarray's delayed evaluation. This means that opening IDFs or MODFLOW6 will not load the data from the files into memory until they are needed. Let's grab some example results, store them in a temporarily directory, and open them: .. GENERATED FROM PYTHON SOURCE LINES 102-107 .. code-block:: Python result_dir = imod.util.temporary_directory() imod.data.twri_output(result_dir) head = imod.mf6.open_hds(result_dir / "twri.hds", result_dir / "twri.grb") .. GENERATED FROM PYTHON SOURCE LINES 108-110 All time steps are available, but not loaded. Instead, a graph of tasks has been defined. Let's take a look at the tasks: .. GENERATED FROM PYTHON SOURCE LINES 110-114 .. code-block:: Python head.data.visualize() show() .. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_002.png :alt: 06 lazy evaluation :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 115-117 We might want to perform a computation for every time step. Let's assume we would want to compute the mean head for every time step: .. GENERATED FROM PYTHON SOURCE LINES 117-120 .. code-block:: Python mean_head = head.mean(["y", "x"]) .. GENERATED FROM PYTHON SOURCE LINES 121-123 We can take a look at the graph, and see that the dask are fully independent of each other: .. GENERATED FROM PYTHON SOURCE LINES 123-127 .. code-block:: Python mean_head.data.visualize() show() .. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_003.png :alt: 06 lazy evaluation :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 128-141 This means the computation is `embarrassingly parallel`_. Every task can be computed independently of the other by the individual CPUs of your computer. Dask will automatically distribute tasks over your CPUs, and manage the tasks so that memory usage does not exceed maximum RAM. This goes well most of the time, but not always; refer to the `Pitfalls`_ section below. Chunk size and memory layout ---------------------------- Computer memory and data storage are best thought of a one-dimensional arrays of bits (even though the hardware might be actually `multi-dimensional`_ behind the scenes). .. GENERATED FROM PYTHON SOURCE LINES 143-154 Pitfalls -------- There are a number of pitfalls to be aware of when working with dask and lazy evaluation. A primary pitfall is not taking the chunk (and memory) layout into mind. For working with MODFLOW models, this is typically operations that operate **over** the time dimension. Because of the file formats, MODFLOW data is chunked over the time dimension. Concretely, this means that the following code can be very slow for large datasets: .. GENERATED FROM PYTHON SOURCE LINES 154-161 .. code-block:: Python points = [(0.0, 0.0), (1.0, 1.0), (2.0, 2.0)] samples = [head.sel(x=x, y=y, method="nearest") for (x, y) in points] samples[0].data.visualize() show() .. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_004.png :alt: 06 lazy evaluation :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 162-166 The reason is that in the line above, a single x-y point is extracted at a time. To extract this single point, the entire dataset must be loaded into memory! Generally, the solution is to use a vectorized approach, selecting all the points "at once". .. GENERATED FROM PYTHON SOURCE LINES 166-171 .. code-block:: Python x = [0.0, 1.0, 2.0] y = [0.0, 1.0, 2.0] samples = head.sel(x=x, y=y, method="nearest") .. GENERATED FROM PYTHON SOURCE LINES 172-177 In these lines, the chunk is loaded, all point values for a single time step are extracted and store; then the next chunk is loaded, etc. Because the call is vectorized, dask can optimize the selection, and figure out that it should extract all values and load the chunks only once. We can verify this by taking a look at the task graph again: .. GENERATED FROM PYTHON SOURCE LINES 177-181 .. code-block:: Python samples.data.visualize() show() .. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_005.png :alt: 06 lazy evaluation :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 182-186 There are cases where the data cannot be loaded in a vectorized manner, or cases where the scheduling goes haywire. In this case, a straightforward solution is to fallback to eager evaluation (with numpy arrays instead of dask arrays). This can be done at any time by simply calling ``.compute()``: .. GENERATED FROM PYTHON SOURCE LINES 186-189 .. code-block:: Python eager_head = head.compute() .. GENERATED FROM PYTHON SOURCE LINES 190-194 This has the obvious downside that the entire dataset needs to fit in memory. However, the following lines are relatively fast again, since the entire dataset has already been loaded into memory this time around: .. GENERATED FROM PYTHON SOURCE LINES 194-198 .. code-block:: Python points = [(0.0, 0.0), (1.0, 1.0), (2.0, 2.0)] samples = [eager_head.sel(x=x, y=y, method="nearest") for (x, y) in points] .. GENERATED FROM PYTHON SOURCE LINES 199-210 Annoyingly, the VSCode interactive Python interpreter does not cooperate well with every `dask scheduler`_. If you're working in VSCode, and either performance seems slow or memory usage is very high, try running your script outside of VSCode. .. _lazy evaluation: https://en.wikipedia.org/wiki/Lazy_evaluation .. _parallel computing with dask: https://xarray.pydata.org/en/stable/user-guide/dask.html .. _multi-dimensional: https://cs.stackexchange.com/questions/123565/why-is-memory-one-dimensional .. _embarrassingly parallel: https://en.wikipedia.org/wiki/Embarrassingly_parallel .. _chunking: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters .. _dask scheduler: https://github.com/dask/dask/issues/5525 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 4.241 seconds) .. _sphx_glr_download_user-guide_06-lazy-evaluation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 06-lazy-evaluation.ipynb <06-lazy-evaluation.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 06-lazy-evaluation.py <06-lazy-evaluation.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 06-lazy-evaluation.zip <06-lazy-evaluation.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_