Introduction
This Code-Along was made for a talk on Datalad in the HPC-NRW seminar. It consists out of two parts: The first one has some basic usage in it, while the second one deals with the usage of Datalad on a cluster. The exercises shown here are self-made, the setup for the second example is based on this paper.
Furthermore, the author would like to acknowledge other helpful resources, that make sense for the usage of Datalad. In particular: The Datalad handbook and the cheatsheet, as well as the documentation for git-annex.
Docker setup
If you want to foo along in an encapsulated docker environment, feel free to use the following configuration:
FROM fedora:40
WORKDIR /home/root
RUN dnf update -y
RUN dnf install -y wget datalad pip vim
ARG user=dataladuser
ARG group=dataladuser
ARG uid=1000
ARG gid=1000
RUN groupadd -g ${gid} ${group}
RUN useradd -u ${uid} -g ${group} -m ${user} # <--- the '-m' create a user home directory
# Switch to user
USER ${uid}:${gid}
WORKDIR /home/dataladuser
RUN git config --global user.email "j_kuhl19@uni-muenster.de"
RUN git config --global user.name "Justus Kuhlmann"
RUN pip3 install numpy scipy matplotlib
Run this by copying this into a Dockerfile, besure to change the git config
parameters then use
docker build -t datalad .
docker run -it datalad
to build and run the container.
Part 1: Datalad basics
Install datalad
Install datalad in which ever way is possible. Look at
pip3 install datalad-installer
datalad-installer
how this is possible the easiest. On fedora, you can use
dnf install datalad
on Ubuntu, you can use
apt install git git-annex
pip3 install datalad
Create a new dataset
We have data, that someone produced (maybe it was us, maybe it was one of our collaborators) and we only need to analyse it. Practically, this data is already published as a Datalad dataset, that we will download in just a moment. To speed things up a little, I have already made a python script for you to try and analyse this.
For our analysis, we want to make a new dataset, in which we import the raw data and analyse it with a python script.
datalad create analysis
cd analysis
Add content
Imagine you already have some kind of script, that you have written. An already prepared example can be downloaded from here:
wget -O script.py https://raw.githubusercontent.com/jkuhl-uni/datalad-code-along/master/src/files/script.py
By downloading it, we have changed something in a DataLad-tracked folder. Analogous to a git-repository, we commit the changes:
datalad save -m "add analysis script"
The analysis script is to be used as
python3 ./script.py <data-file> <output-file>
Add another dataset
Next, we want to download the data, that we want to analyse:
datalad install https://github.com/jkuhl-uni/test-data.git
datalad save -m "add data as sub-dataset"
Run the analysis script
Let's try to run it with the first data file:
python3 ./script.py test-data/data1.out ./plot1.pdf
We find that this does not work. This is, since we only have the place-holder files in our installation of the data
dataset.
we run
datalad get test-data/data1.out
to retrieve the data. Then we run the script again:
python3 ./script.py test-data/data1.out ./plot1.pdf
Now it works!
We can have a look at the plot. it should look something like this:
we have changed the dataset analysis
. So datalad should also include these changes:
datalad save -m "add a first plot"
Cool, we have a new file in the dataset. But people won't know, how we did this.
Use datalad run
and datalad rerun
Instead, we can use datalad run
to make this more reproducible.
datalad run -i script.py -i test-data/data2.out -o plot2.pdf -m "add a second plot with datalad run" "python3 ./script.py test-data/data2.out plot2.pdf"
Nice, this looks pretty easy.
We can have a look at the history of the repository as usual with git log [--oneline]
.
Alright. Now imagine we just had a stroke and forgot how the heck we got the second plot. Luckily, we ran this with datalad run
. To redo the changes applied with this commit, we run
datalad rerun <last-commit-id>
to re-do the plot we just made.
Part 2: Example for cluster usage of Datalad
This part of hte code-along is largely based on the structure found in this paper and has been adapted for this tutorial.
In this part, we try to emulate the usage of datalad on a cluster on our own computer: We take, still, the analysis dataset, that we just created. First, we copy the dataset to a new location, using datalad install: in analysis dataset
datalad create-sibling-ria -s ria --new-store-ok --storage-sibling only ria+file://../ria
datalad push --to ria
cd ..
datalad clone analysis analysis_sink
After the processing of our data, we want the new dataset to hold the original data and the processed data. In our case, the processed data (the PDFs) are not that big, however, we can imagine situations, where each of the added files is multiple Gigabytes of data. We open a ria-store
, a datalad repository that only holds the data. This is not tracked by git.
Now, if we look at the known siblings, we find that datalad is aware of Now, we imagine that we want to process the rest of the data in parallel. We can parallelise this process "naively" aka we cen simply parallelise it as different jobs. In this example, we will use a wrapper script that can basically be used as the basis to a SLURM script later on.
cd analysis
vim jobscript.sh
The heart of the job script will be the datalad run
command from earlier.
The overall script looks somethign like this:
set -e -u -x
MYSOURCE=</absolute/path/to/the/dataset>
WORKDIR=/tmp/
MYSINK=</absolute/path/to/the/sink>
LOCK=~/dataset.lock
number=${1} # this can be replaced by something like ${SLURM_ARRAY_TASK_ID} on a SLURM managed cluster
datalad clone ${MYSOURCE} ${WORKDIR}/ds_${number}
cd ${WORKDIR}/ds_${number}
git remote add sink ${MYSINK}
git checkout -b calc_${number}
datalad install -r .
datalad siblings -d test-data enable -s uniS3
datalad run -i script.py -i test-data/data${number}.out -o plot${number}.pdf -m "add a second plot with a wrapper script" "python3 ./script.py test-data/data${number}.out plot${number}.pdf"
# publish
datalad push --to ria
flock ${LOCK} git push sink
Lets go through that line-by-line: To make our lifes a litle easier, we first define some bash variables.
MYSOURCE=</absolute/path/to/the/dataset>
WORKDIR=/tmp/
MYSINK=</absolute/path/to/the/sink>
LOCK=~/dataset.lock
number = ${1}
They hold the most basic stuff that wewill need later in the execution of our code. NYSOURCE
gives the place of the original dataset, WORKDIR
the temporary directory, that we do our calculations in.
number
is just a placeholder for some commandline argument.
Now the script actually starts.
first, we clone the original dataset to a temporary location and follow it:
datalad clone ${MYSOURCE} ${WORKDIR}/ds_${number}
cd ${WORKDIR}/ds_${number}
next, we add the sink as a git remote.
Now comes the part, because of which we do all of this.
with git checkout -b calc_${number}
, we switch to a new git branch.
then, the command is run.
Before the end of the script, push the data to the ria store and make the new branch known to the sink-dataset. Since git does not like it if the push multiple commits at the exat same time, we use flock
to make sure we are not interfering another push.
In priciple, if the temporary directory is not wiped automatically, we also need to dake care of cleaning up behind ourselves. This would be done with
datalad drop *
cd ${WORKDIR}
rm -rf ${WORKDIR}/ds_${number}
Once we fought our way out of vim
, we save the changes to the dataset:
datalad save -m "added jobscript"
Cool, we have added the jobscript to the dataset. Now, we also make the sink aware of this change:
cd ../analysis_sink
datalad update --how=merge
We change back to the original repository nad execute our script:
cd ../analysis
bash jobscript.sh 3
Okay, if this looks good, we can change again to our sink and check if we indeed have a new branch tracked by git
cd ../analysis_sink
git branch
To incorporate the change into our dataset, we have to to two things: first, we merge the new branch into the master
branch and delete the branch after merging
git merge calc_3
git branch -d calc_3
we can check, if the new plot is in the sink with ls
.
If we try to gretrieve the data with datalad get plot3.pdf
, we see, that this is not possible yet.
This is because the git annex here is not aware yet, that we pushed the data to our ria store. To change that, we use
git annex fsck -f ria
and try to get the data again using
datalad get plot3.pdf
et voila, there it is. Now, we can also execute our wrapper "in parallel" in our shell:
cd ../analysis
for i in {4..7}; do ((bash jobscript.sh $i) &) ; done
and collect the different branches with
git merge calc_4
git merge calc_5
git merge calc_6
git merge calc_7
Finally, if we want to not do this for each branch separately, we can use
flock ../dataset.lock git merge -m "Merge results from calcs" $(git branch -l | grep "calc" | tr -d ' ')
git annex fsck -f ria
flock ../dataset.lock git branch -d $(git branch -l | grep "calc" | tr -d ' ')
where the flock
is only necessary, if our cluster might still be working on some stuff.