Version control your projects
Introduction
Since scripting becomes increasingly important in hydrological projects, the rules of software development also start to apply to us. This has the downside that it requires extra effort from us, hydrologists, to learn new tools, but the upside is that there already is a wealth of properly tested tools and documentation available from the software development world. Having reproducible code is one of these important things, for which version control systems have been developed. It is very useful to use these systems in hydrological projects, since they provide the following advantages:
- You keep track of the history of a project. In this way you keep a journal of decisions made in a project.
- If you mess up something, you can always revert to a previous state.
- It allows for collaborative development, where individuals can create their own branch to safely work on new developments and later merge their changes.
This document describes how to use Git and DVC. If you are running into issues, Stackoverflow has a lot of information available (>145,000 questions at 02-03-2023), since Git is so commonly used today.
Git
Today (03-2023), Git is the most commonly used version control system for code. Code development is usually shared on platforms such as Github or Gitlab.
One of the main differences between Git and older systems such as Subversion is that Git is a distributed version control system, whereas Subversion is a centralized system. In subversion, there is no history of changes on the machine, only on the central server. So therefore, to do any version control, the user has to be connected to a server. With Git a repository on a server (called remote) is first cloned to the user's local machine. This repository includes the full history of changes. The user can then make his/her changes, after which he/she can push these back to the remote, but does not have to. Because it is more indirect, and the availability of widely used platforms such as Github, the distributed nature of Git allows for smooth development between different organizations as well as open-source code development. Furthermore, you can develop code offline.
Due to its background in software engineering, Git is only useful for versioning scripts and textfiles, and not for large files. For versioning larger files we use DVC.
DVC
DVC stands for Data Version Control and it does exactly that. It is built on top of Git and so it uses very similar commands, which can be both easy and confusing at the same time. DVC ensures your large files themselves are not tracked by Git. Instead, it keeps the large files in its own cache and writes simple text files, which store the unique "version code" (a hash), each time you commit to DVC. These textfiles then have to be committed to Git. This may seem a bit cumbersome, as you have to commit data first to DVC, and then its version code to Git, but this allows the user to utilize the full potential of both Github as well as Cloud Storages (Like Amazon S3, Google Cloud, Microsoft Azure etc.). The version history is all stored in Git and can be pushed to Github or Gitlab, but the data itself can be stored on different platforms more specialized in handling large files.
Installation
First install Git from this link. If installing on Windows it is useful to make sure Git is added to the %PATH% variable during installation. To test if the installation works, open a command line or shell window and test if the following command works:
If this works, install DVC from this link.
Software to use Git and DVC
You can run git from the regular Windows command line, but the Windows installation also comes with MinGW (Minimalist GNU for Windows) which can be useful.
MinGW
MinGW allows you to use Linux commands, as well as other cool stuff (e.g. compiling Fortran 90 with gfortran for free). Note that in it you have to specify paths with Linux-style forward slashes /
instead of Windows style backward slashes \
.
We recommend defining a button for MinGW in Total Commander by clicking:
Configuration > Button Bar > Add
And then specify the path to git-bash.exe
under "Command", e.g. C:\\Program Files\\Git\\Git\\git-bash.exe
Depending on your system configuration, you might need to add under "Parameters": --login -i
VSCode
VScode has integrated Git support as well. This is very useful when you are mainly coding, but since you cannot call DVC, less when you are changing model data. This documentation describes how to use it.
Tutorial
There are already several tutorials online available on Git as well as DVC. For Git, this is a useful no-nonsense guide. For DVC, there is this bit more involved tutorial. However, because we want this document to be a bit more than a pointer to other tutorials, here's a quick guide that will describe your typical workflow.
Starting a repository
You can start you own repository by changing to the directory first and then initializing git. Open your terminal of choice (e.g. Powershell, cmd, bash) and run the following commands
> cd path/to/folder
> git init
This will create a hidden .git
folder.
To then initialize dvc:
> dvc init
This will create a hidden .dvc
folder as well as a .dvc/.gitignore
file, which will tell Git to automatically ignore parts of the .dvc
folder, for example the .dvc/cache
folder, where DVC stores all versions of data files. We want Git to ignore these, as we do not want Git to keep track of Gigabytes of data!
To show hidden files and folders (very useful), you can configure Total Commander by clicking:
Configuration > Options > Display > "Show hidden files"
It is convenient for later to add the file extensions of files you want to keep in dvc in the .gitignore
file, e.g. *.idf
. Or you could add your data folder (e.g. data
) to this file to exclude everything in it by specifying data/
.
This recipe allows you to kickstart your project, creating already a smart folder structure, a readme and a .gitignore file.
Checking the status
One of the first things you should do once you have started your repository, is checking its status.
Change directories into the repo and call:
> git status
This will show in red the files that are untracked by Git and have to be added still. And in green the files that are staged but have not been committed yet.
In order to check if any data files changed in DVC, in a similar way you call:
> dvc status
Checking differences
If you have modified one or multiple files, you can view lines you have changed per file with:
> git diff
This will show in green the lines that are new, or different, in the new version of the file. And in red the lines as they were in the old version, or now have been removed.
Adding textfiles
So you have checked the status and will probably have seen that there is a .dvc/.gitignore
file that is already staged (this was added by DVC). What DVC has done for us is the following:
> git add .dvc/.gitignore
This will stage the .gitignore
file. This means you have selected a change to be added to a commit. If you want to include a script or other textfile in version control, you have to stage it yourself by calling the git add
command.
Staging a script does not mean you have added it to version control yet! A file is only added to version control after you have committed your changes (see next section).
Committing changes
Before your files are added to version control, you have to commit them first. A commit can be seen as a "snapshot" of a project that is added to the version control and is therefore "safe".
Each commit should be followed by a message that specifies what you have changed. It is very important that these messages are short and informative, as they form your "journal entries" and will make searching your version history a lot easier.
In our example, you can commit the added .gitignore file to your version control by calling:
> git commit -m "Added file extensions to be ignored by Git."
The -m
option allows you to specify a string that will be your commit message.
But what if you accidentally made an error in your message? You can then change it using:
> git commit --amend
This will open up a text editor where you can alter your commit message.
The Windows installation of Git comes with a common Linux text editor called Vim. It is possible your Git is configured to automatically open Vim. Vim is very powerful, but has a steep learning curve. Especially confusing for beginners is closing Vim
So to close Vim: First press ESC, then press the colon :
. To save your changes, you can type x
, or to discard them you can type q!
. Finish by pressing Enter.
After committing data, do not naively rename files or move them around. Instead, you should use Git for that. See @section_moving.
Moving/renaming scripts
Be careful moving scripts around after they have been committed to Git. Despite having the same filename, Git does not necessarily understand this: It sees a file being deleted and other file being created.
So in order to safely move or rename committed files, don't copy+paste committed files in your Windows Explorer, but instead let Git do it for you:
> git mv script.py src/script.py
The same holds for renaming:
> git mv script.py renamed_script.py
Adding and committing data files
We now know how to add textfiles, but not how to add large binary files, your data. For this we use DVC, to add data in data version control, e.g. all files in the folder data/rivers
, you can add the folder:
> dvc add data/rivers
This creates a unique version code (hash) for all files in the folder and adds the data to the cache in the .dvc
directory. The hash of the data is stored in textfile named data/rivers.dvc
.
We have however not included the data in the version log yet... DVC lets Git take care of keeping the version log, so that both code changes as well as data changes are in the same log.
It does this by creating .dvc
files with the hashes and we are expected to add these to our git repository. We therefore have to run the following commands to fully include the added data to our version control:
> git add -f data/rivers.dvc
> git commit -m "Added river data"
We added the -f
option to the add
command, just in case you added the data folder to your .gitignore
file. This forces git to still add the rivers.dvc
file.
To summarize, the data backups are stored in the .dvc
folder, the log of changes in the .git
folder. This allows us to send the data separately to a data cloud oriented to handling large data files (Google Cloud, Amazon S3, Azure etc.), and to send the .git
folder to Github or Gitlab to have a nicely browsable version history.
Updating data
The commands to add changes made in the data are the same as to add new data:
> dvc add data/rivers
> git add -f data/rivers.dvc
> git commit -m "Changed river data"
Moving data
You can safely move data to a different location in your repo with DVC by calling:
> dvc move heads.nc data/heads.nc
> git add data/heads.nc.dvc
> git commit -m "Moved heads to data folder"
Viewing the version history
To view the results of your hard work of carefully maintaining version control, you can call:
> git log
This shows the version history in text form. This is useful for a quick lookup of the last commits, but if you want to go back a lot more becomes cumbersome. There are quite some graphical user interfaces available to view the version history, one of which is already included with your windows installation. You can start it by calling:
> gitk
This is a very light-weight interface, and more fancy ones exist. Furthermore, when pushing to a remote (see @pushing_to_remote), Github and Gitlab have very good interfaces to view the commit history as well.
Reverting and resetting commits
Say you added a commit that was not going to be used, and want to go back. Git provides multiple options to do this safely, and some to do this not so safely.
Say you want to revert the repository back to a previous state, but want to keep the changes you made after that state. You can revert the last commit you made with:
> git revert HEAD
HEAD
is the hash of the last commit you made. You can lookup these hashes by calling git log
. Say this last hash currently is j1c13377c6d4adfhcc69c6ac7b51e919b15a65c4
We could do the same by calling:
> git revert j1c13377c6d4adfhcc69c6ac7b51e919b15a65c4
If you now call git log
again, you can see that the revert of this specific commit is also stored as a new commit. You can always revert the revert as well, if you suddenly decide you actually needed those changes.
Sometimes you commit something which in hindsight is so stupid, you neither want to bother other people with it, nor yourself again. In this case you can reset the repository to a previous state. Say you want to reset to the state before your last commit (basically undoing the last commit), you can call:
> git reset --hard HEAD~1
HEAD~1
is the hash of the second last commit. In this way all commits after the second last commit are deleted, so be careful.
Note that for git revert
we refer to the commit we want to undo, whereas for git reset
we refer to the commit before this, because git reset
wants to know the state we want to reset to.
If you are also resetting/reverting data changes added to DVC, the data referred to in the last git commit still has to be put in your workspace. To do this call:
> dvc checkout
This will make DVC check all data referred to in the .dvc files again, recalculating hashes, which is a slow process. If you know which files are changed (e.g. 'heads.nc'), you can speed up things considerably by referring to the specific files:
> dvc checkout heads.nc
Pushing to remote
Up to now you have learned how to safely store scripts and data locally, but what if your hard-drive crashes? Or if you accidentally shift+delete your whole repository? In those cases you wished you also stored your repository somewhere else! That's why Git, as well as DVC, allow you to push your repository to a remote.
This has two benefits:
- Your repository is safely stored somewhere else
- Your colleagues can also access your repository and collaborate on Gitlab/Github.
We first have to configure Git to know the location of your remote and give it a name. A commonly used name for remote repo is "origin".
> git remote add origin <https://github.com/username/repo.git>
You can list the remotes you have added with:
> git remote -v
After you have added your remote, you can push your code to the remote by calling:
> git push
DVC behaves very similar, except that the location of the DVC remote is stored in a file named .dvc/config
that has to be added to Git. Say we want to add my google drive to a remote named "google-drive":
> dvc remote add google-drive gdrive://some-hash-number-here
> git add .dvc/config
> git commit -m "Added google-drive as dvc remote"
You can then push to google-drive by calling:
> dvc push
Cloning an existing repository
It is also possible to easily create a local copy of an existing repository. You can tell git to clone a remote repository into the current directory with:
> git clone <http-address>
For example, to clone the California model repository, you can call:
> git clone <https://gitlab.com/deltares/imod/california_model.git>
This might require a Gitlab account with 2 factor authorization.
Your organization might have strict security settings and you might run into: SSL certificate problem: self signed certificate in certificate chain github
. You can follow these instructions to fix this.
Further reading
For Git commands, you can also take a look at this useful cheat sheet .