3. Challenge: Provenance Capture

You can always get help

In order to learn about available DataLad commands, use datalad --help. In order to learn more about a specific command, use datalad <subcommand> --help.

3.1. Challenge 1:

Create a DataLad dataset called iyoda, applying a specific post-creation routine called yoda (-c yoda).

Okidoki!

Creating the dataset

$ datalad create -c yoda iyoda
[INFO] Running procedure cfg_yoda
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-103-provenance/iyoda (dataset) [VIRTUALENV/bin/python /home...]
create(ok): /home/me/challenges/102-103-provenance/iyoda (dataset)

Run the command gitk:

  • What did the “YODA” setup actually do?

  • How do we know that data module should go into inputs/?

Add the following dataset as a subdataset called inputs: https://github.com/datalad-handbook/iris_data. Inspect its history with gitk. When was it made, what does it contain?

Show me the solution

$ cd iyoda
$ datalad clone -d . https://github.com/datalad-handbook/iris_data.git inputs
[INFO] Remote origin not usable by git-annex; setting annex-ignore
install(ok): inputs (dataset)
add(ok): inputs (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)

Hint: In order to investigate the subdatasets history, enter it first.

Inside of iyoda, create a script code/extract.py with the following content:

from os.path import join as opj
import csv
with open(opj('inputs', 'iris.csv')) as csvfile:
   reader = csv.DictReader(csvfile)
   for row in reader:
       if row['variety'] != 'Setosa':
           continue
       print(row['petal.length'])
$ cat << EOT > code/extract.py

from os.path import join as opj
import csv
with open(opj('inputs', 'iris.csv')) as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row['variety'] != 'Setosa':
            continue
        print(row['petal.length'])

EOT

This Python script will print all rows matching the Setosa variety. Be careful: With Python, consistent indentation with tabs OR spaces is necessary!

You can test your script with $ python code/extract.py

Beware of path semantics on Windows

On Windows, test the script with:

$ python code\extract.py

If there are no errors, save the script.

Tell me how!

$ datalad save -m "Save data extraction script"
add(ok): code/extract.py (file)
save(ok): . (dataset)

Try to figure out why there was no output when running the script.

[ … ]

space for a dramatic pause

[ … ]

Retry running the script after getting content from the subdataset.

Right, let’s go!

To retrieve contents from the subdataset run:

$ datalad get inputs
get(ok): inputs/iris.csv (file) [from web...]
action summary:
  get (notneeded: 1, ok: 1)
$ python code/extract.py
1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5
1.5
1.6
1.4
1.1
1.2
1.5
1.3
1.4
1.7
1.5
1.7
1.5
1
1.7
1.9
1.6
1.6
1.5
1.4
1.6
1.6
1.5
1.5
1.4
1.5
1.2
1.3
1.4
1.3
1.5
1.3
1.3
1.3
1.6
1.9
1.4
1.6
1.4
1.5
1.4

Now that the script works as intended, run it and write its outputs into a file for further processing using the following code:

$ python code/extract.py > outputs.dat

Beware of path semantics on Windows

On Windows, test the script with:

$ python code\extract.py > outputs.dat

Check the dataset state and save the modification. Inspect the change record:

  • What information is captured?

  • Imagine yourself in a year. What information would you be missing?

Let’s take a look

$ datalad status
untracked: outputs.dat (file)

and save:

$ datalad save -m "Create the desired setosa variety petal length data file"
add(ok): outputs.dat (file)
save(ok): . (dataset)

Run the script again, but through DataLad, and declare inputs and outputs. This time, save the output file as plength.txt. Use gitk to inspect the change record. What is different now?

Let’s take a look

$ datalad run -i inputs/iris.csv -o plength.txt "python code/extract.py > {outputs}"
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-103-provenance/iyoda (dataset) [python code/extract.py > plength.txt]
add(ok): plength.txt (file)
save(ok): . (dataset)

But beware on Windows!

paths strike again!

$ datalad run -i inputs\iris.csv -o plength.txt "python code\extract.py > {outputs}"

Finally, force DataLad to lose the file plength.txt. Re-execute the provenance record. Afterwards, check what has changed.

Final stretch now

First, drop recklessy:

$ datalad drop --reckless availability plength.txt
drop(ok): plength.txt (file)

Then, rerun:

$ datalad rerun HEAD
[INFO] run commit 3f4aa8f; (python code/extra...)
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run.remove(ok): plength.txt (file) [Removed file]
run(ok): /home/me/challenges/102-103-provenance/iyoda (dataset) [python code/extract.py > plength.txt]
add(ok): plength.txt (file)
action summary:
  add (ok: 1)
  get (notneeded: 3)
  run (ok: 1)
  run.remove (ok: 1)
  save (notneeded: 2)

What changed?

$ datalad status
nothing to save, working tree clean