.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "user-guide\06-lazy-evaluation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_user-guide_06-lazy-evaluation.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_user-guide_06-lazy-evaluation.py:


Lazy evaluation
===============

A powerful feature of xarray is `lazy evaluation`_. Lazy evaluation means that
expressions are delayed until the results are needed. In a nutshell, lazy
evaluation has the following advantages for us:

    * Lazy evaluation frees us from having to load all data into memory in one
      go. We can work with larger-than-memory datasets without having to
      manually divide the datasets into smaller, manageable pieces.
    * Lazy evaluation allows us to easily parallellize computations. This can
      speed up computations considerably depending on the available number of
      cores.

Lazy evaluation can be contrasted with eager evaluation. In eager evaluation
with xarray, all data always exists in RAM memory, and computations are
performed immediately. Typical cases of when results are needed is when data
has to be visualized or when the data has to be saved to disk.

We will explain these concepts with a number of examples.

.. GENERATED FROM PYTHON SOURCE LINES 25-34

.. code-block:: Python

    import dask.array
    import numpy as np
    import xarray as xr

    import imod

    lazy_da = xr.DataArray(dask.array.ones(5), dims=["x"])
    lazy_result = lazy_da + 1


.. GENERATED FROM PYTHON SOURCE LINES 35-43

When we run the lines above, we evaluate the expression ``lazy_result =
lazy_da + 1``. In eager evaluation, this means a value of 1 is added to every
element; in delayed evaluation, the expression is stored instead in a graph
of tasks. In xarray, delayed evaluation is taken care of by dask, an open
source Python library for (parallel) scheduling and computing.

We can identify the DataArray as lazy, since its ``.data`` attribute shows a
dask array rather than a numpy array.

.. GENERATED FROM PYTHON SOURCE LINES 43-46

.. code-block:: Python


    lazy_result.data


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <table>
        <tr>
            <td>
                <table style="border-collapse: collapse;">
                    <thead>
                        <tr>
                            <td> </td>
                            <th> Array </th>
                            <th> Chunk </th>
                        </tr>
                    </thead>
                    <tbody>
                    
                        <tr>
                            <th> Bytes </th>
                            <td> 40 B </td>
                            <td> 40 B </td>
                        </tr>
                    
                        <tr>
                            <th> Shape </th>
                            <td> (5,) </td>
                            <td> (5,) </td>
                        </tr>
                        <tr>
                            <th> Dask graph </th>
                            <td colspan="2"> 1 chunks in 2 graph layers </td>
                        </tr>
                        <tr>
                            <th> Data type </th>
                            <td colspan="2"> float64 numpy.ndarray </td>
                        </tr>
                    </tbody>
                </table>
            </td>
            <td>
            <svg width="170" height="92" style="stroke:rgb(0,0,0);stroke-width:1" >

      <!-- Horizontal lines -->
      <line x1="0" y1="0" x2="120" y2="0" style="stroke-width:2" />
      <line x1="0" y1="42" x2="120" y2="42" style="stroke-width:2" />

      <!-- Vertical lines -->
      <line x1="0" y1="0" x2="0" y2="42" style="stroke-width:2" />
      <line x1="120" y1="0" x2="120" y2="42" style="stroke-width:2" />

      <!-- Colored Rectangle -->
      <polygon points="0.0,0.0 120.0,0.0 120.0,42.00989029700999 0.0,42.00989029700999" style="fill:#ECB172A0;stroke-width:0"/>

      <!-- Text -->
      <text x="60.000000" y="62.009890" font-size="1.0rem" font-weight="100" text-anchor="middle" >5</text>
      <text x="140.000000" y="21.004945" font-size="1.0rem" font-weight="100" text-anchor="middle" transform="rotate(0,140.000000,21.004945)">1</text>
    </svg>
            </td>
        </tr>
    </table>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 47-48

We can take a look at the graph of tasks as follows:

.. GENERATED FROM PYTHON SOURCE LINES 48-51

.. code-block:: Python


    lazy_result.data.visualize()


.. GENERATED FROM PYTHON SOURCE LINES 52-57

.. note::

  The ``show`` function here is required for the online documentation. If
  you're running the examples interactively, the line above will show you a
  picture of the graph, and defining and calling ``show()`` can be ommitted.

.. GENERATED FROM PYTHON SOURCE LINES 57-74

.. code-block:: Python


    def show():
        # Show the latest png generated by dask.visualize with matplotlib. This is
        # required because the sphinx image scraper currently only supports
        # matplotlib and mayavi and will not capture the output of
        # dask.visualize().
        import matplotlib.image as mpimg
        import matplotlib.pyplot as plt

        _, ax = plt.subplots()
        ax.set_axis_off()
        ax.imshow(mpimg.imread("mydask.png"))


    show()


.. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_001.png
   :alt: 06 lazy evaluation
   :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 75-82

We can see a very simple graph with two tasks:

* create the array of ones;
* add a value of 1 to every element.

Let's constrast this with eager evaluation. Rather than creating the data
as a dask array, we will create it as an ordinary numpy array.

.. GENERATED FROM PYTHON SOURCE LINES 82-87

.. code-block:: Python


    eager_da = xr.DataArray(np.ones(5), dims=["x"])
    eager_result = eager_da + 1
    eager_result.data


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([2., 2., 2., 2., 2.])


.. GENERATED FROM PYTHON SOURCE LINES 88-102

The data is an numpy array. Since the actual values of the elements of the
array are in memory, we also get a preview of the element values.

See also the xarray documentation on `parallel computing with dask`_.

Delayed computation and (i)MODFLOW files
----------------------------------------

The input and output functions in the imod package have been written to work
with xarray's delayed evaluation. This means that opening IDFs or MODFLOW6
will not load the data from the files into memory until they are needed.

Let's grab some example results, store them in a temporarily directory, and
open them:

.. GENERATED FROM PYTHON SOURCE LINES 102-107

.. code-block:: Python


    result_dir = imod.util.temporary_directory()
    imod.data.twri_output(result_dir)
    head = imod.mf6.open_hds(result_dir / "twri.hds", result_dir / "twri.grb")


.. GENERATED FROM PYTHON SOURCE LINES 108-110

All time steps are available, but not loaded. Instead, a graph of tasks has
been defined. Let's take a look at the tasks:

.. GENERATED FROM PYTHON SOURCE LINES 110-114

.. code-block:: Python


    head.data.visualize()
    show()


.. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_002.png
   :alt: 06 lazy evaluation
   :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 115-117

We might want to perform a computation for every time step. Let's assume we
would want to compute the mean head for every time step:

.. GENERATED FROM PYTHON SOURCE LINES 117-120

.. code-block:: Python


    mean_head = head.mean(["y", "x"])


.. GENERATED FROM PYTHON SOURCE LINES 121-123

We can take a look at the graph, and see that the dask are fully independent
of each other:

.. GENERATED FROM PYTHON SOURCE LINES 123-127

.. code-block:: Python


    mean_head.data.visualize()
    show()


.. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_003.png
   :alt: 06 lazy evaluation
   :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 128-141

This means the computation is `embarrassingly parallel`_. Every task can be
computed independently of the other by the individual CPUs of your computer.

Dask will automatically distribute tasks over your CPUs, and manage the tasks
so that memory usage does not exceed maximum RAM. This goes well most of the
time, but not always; refer to the `Pitfalls`_ section below.

Chunk size and memory layout
----------------------------

Computer memory and data storage are best thought of a one-dimensional arrays
of bits (even though the hardware might be actually `multi-dimensional`_
behind the scenes).

.. GENERATED FROM PYTHON SOURCE LINES 143-154

Pitfalls
--------

There are a number of pitfalls to be aware of when working with dask and lazy
evaluation. A primary pitfall is not taking the chunk (and memory) layout
into mind. For working with MODFLOW models, this is typically operations that
operate **over** the time dimension. Because of the file formats, MODFLOW
data is chunked over the time dimension.

Concretely, this means that the following code can be very slow for large
datasets:

.. GENERATED FROM PYTHON SOURCE LINES 154-161

.. code-block:: Python


    points = [(0.0, 0.0), (1.0, 1.0), (2.0, 2.0)]
    samples = [head.sel(x=x, y=y, method="nearest") for (x, y) in points]

    samples[0].data.visualize()
    show()


.. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_004.png
   :alt: 06 lazy evaluation
   :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 162-166

The reason is that in the line above, a single x-y point is extracted at a
time. To extract this single point, the entire dataset must be loaded into
memory! Generally, the solution is to use a vectorized approach, selecting
all the points "at once".

.. GENERATED FROM PYTHON SOURCE LINES 166-171

.. code-block:: Python


    x = [0.0, 1.0, 2.0]
    y = [0.0, 1.0, 2.0]
    samples = head.sel(x=x, y=y, method="nearest")


.. GENERATED FROM PYTHON SOURCE LINES 172-177

In these lines, the chunk is loaded, all point values for a single time step are
extracted and store; then the next chunk is loaded, etc. Because the call is
vectorized, dask can optimize the selection, and figure out that it should
extract all values and load the chunks only once. We can verify this by taking
a look at the task graph again:

.. GENERATED FROM PYTHON SOURCE LINES 177-181

.. code-block:: Python


    samples.data.visualize()
    show()


.. image-sg:: /user-guide/images/sphx_glr_06-lazy-evaluation_005.png
   :alt: 06 lazy evaluation
   :srcset: /user-guide/images/sphx_glr_06-lazy-evaluation_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 182-186

There are cases where the data cannot be loaded in a vectorized manner, or
cases where the scheduling goes haywire. In this case, a straightforward
solution is to fallback to eager evaluation (with numpy arrays instead of
dask arrays). This can be done at any time by simply calling ``.compute()``:

.. GENERATED FROM PYTHON SOURCE LINES 186-189

.. code-block:: Python


    eager_head = head.compute()


.. GENERATED FROM PYTHON SOURCE LINES 190-194

This has the obvious downside that the entire dataset needs to fit in memory.

However, the following lines are relatively fast again, since the entire
dataset has already been loaded into memory this time around:

.. GENERATED FROM PYTHON SOURCE LINES 194-198

.. code-block:: Python


    points = [(0.0, 0.0), (1.0, 1.0), (2.0, 2.0)]
    samples = [eager_head.sel(x=x, y=y, method="nearest") for (x, y) in points]


.. GENERATED FROM PYTHON SOURCE LINES 199-210

Annoyingly, the VSCode interactive Python interpreter does not cooperate well
with every `dask scheduler`_. If you're working in VSCode, and either
performance seems slow or memory usage is very high, try running your script
outside of VSCode.

.. _lazy evaluation: https://en.wikipedia.org/wiki/Lazy_evaluation
.. _parallel computing with dask: https://xarray.pydata.org/en/stable/user-guide/dask.html
.. _multi-dimensional: https://cs.stackexchange.com/questions/123565/why-is-memory-one-dimensional
.. _embarrassingly parallel: https://en.wikipedia.org/wiki/Embarrassingly_parallel
.. _chunking: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
.. _dask scheduler: https://github.com/dask/dask/issues/5525


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 4.241 seconds)


.. _sphx_glr_download_user-guide_06-lazy-evaluation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 06-lazy-evaluation.ipynb <06-lazy-evaluation.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 06-lazy-evaluation.py <06-lazy-evaluation.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 06-lazy-evaluation.zip <06-lazy-evaluation.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_