1. Challenge: DataLad Datasets

You can always get help

In order to learn about available DataLad commands, use datalad --help. In order to learn more about a specific command, use datalad <subcommand> --help.

1.1. Challenge 1

Create a dataset called my-dataset on your computer. Inside of the dataset, run the command gitk and explore it.

Can you find:

  • the dataset identifier?

  • the version label?

  • the dataset creator?

  • the dataset creation date?

Afterwards, run the command gitk --all. What is the difference from before?

Show me how to do it

To create a new dataset, run:

$ datalad create my-dataset
create(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset)

Finally, remove the dataset.

How do I do that?

To remove it, run datalad drop (manual). Importantly, this command needs to run outside of the dataset.

$ datalad drop --what all -d my-dataset --reckless availability
uninstall(ok): . (dataset)

1.2. Challenge 2

Text files are digital files containing plain text. Take a minute to think: - Why is it often useful to keep textfiles out of git-annex? On the other hand, what could be a reason to annex text files?

Tell me!

Why is it useful to keep textfiles out of git-annex?

  • To make editing easier (no need to unlock)

  • To have a nicer Git history (commits can show differences between file revisions)

  • To distribute the file automatically with every clone (unlike with annexed files, file content of files kept in Git is readily available in shared dataset clones)

What could be a reason to annex text files?

  • To keep file contents private/secret (annexing files allows access control)

  • An unusually large text file (at least dozens of MB)

Create a DataLad dataset called text2gitdataset and configure it to never annex text files (there are several ways to do this!).

Ok, show me the ways!

1. Right at dataset creation

$ datalad create -c text2git text2gitdataset
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/text2gitdataset (dataset) [VIRTUALENV/bin/python /home...]
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset (dataset)

2. After dataset creation with a datalad run-procedure (manual)

$ datalad create text2gitdataset-2
$ cd text2gitdataset-2
$ datalad run-procedure cfg_text2git
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset-2 (dataset)
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/text2gitdataset-2 (dataset) [VIRTUALENV/bin/python /home...]

3. By editing .gitattributes by hand

$ datalad create text2gitdataset-3
$ cd text2gitdataset-3
$ echo "* annex.largefiles=(mimeencoding=binary)and(largerthan=0))" >> .gitattributes
$ datalad save -m "configure Dataset to keep text files in Git"
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset-3 (dataset)
add(ok): .gitattributes (file)
save(ok): . (dataset)

In the end, remove the datasets.

Can you show me again?

Clean-up:

$ datalad drop -d text2gitdataset --what all --reckless availability
$ datalad drop -d text2gitdataset-2 --what all --reckless availability
$ datalad drop -d text2gitdataset-3 --what all --reckless availability
uninstall(ok): . (dataset)
uninstall(ok): . (dataset)
uninstall(ok): . (dataset)

1.3. Challenge 3

Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions. DataLad datasets can version control their contents, regardless of size.

Create a new dataset my-dataset that is configured to store text files in Git (see previous challenge) and add a README.md file with some content into it. Make sure to save it with a helpful commit message, and inspect your datasets revision history.

Let’s go!

Create the dataset and cd into it:

$ datalad create -c text2git my-dataset
$ cd my-dataset
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset) [VIRTUALENV/bin/python /home...]
create(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset)

Create a text file and save it (you can also create a text file with an editor of your choice, e.g., vim.)

$ echo "# Example Dataset" > README.md
$ datalad status
untracked: README.md (file)
$ datalad save -m "add a README to the dataset"
add(ok): README.md (file)
save(ok): . (dataset)

Check the dataset’s history:

$ git log
commit 2b505f91✂SHA1
Author: Elena Piscopia <elena@example.net>
Date:   Tue Jun 18 16:13:00 2019 +0200

    add a README to the dataset

commit 5ec4dab1✂SHA1
Author: Elena Piscopia <elena@example.net>
Date:   Tue Jun 18 16:13:00 2019 +0200

    Instruct annex to add text files to Git

commit 1e658827✂SHA1
Author: Elena Piscopia <elena@example.net>
Date:   Tue Jun 18 16:13:00 2019 +0200

    [DATALAD] new dataset

Run gitk again. Can you find the dataset modification date?

Finally, edit the README and save it again.

Let’s go!

$ echo "This is my example dataset" >> README.md
$ datalad save -m "Add redundant explanation"
add(ok): README.md (file)
save(ok): . (dataset)

1.4. Challenge 4

Download and save the following set of penguin images available at the URLs below into a dataset:

You can reuse the dataset from the previous challenge, or create a new one. Can you do the download while recording provenance?

Give me a hint about provenance

Try using datalad download-url (manual) or datalad-next’s “download” command combined with datalad run (manual).

Show me the entire solution

You can download a file and save it manually:

$ wget -q -O chinstrap_01.jpg "https://hub.datalad.org/edu/penguins/media/branch/main/examples/adelie.jpg"
$ datalad save -m "Add manually downloaded images"
add(ok): chinstrap_01.jpg (file)
save(ok): . (dataset)

Or download it recording its origin as provenance:

$ datalad run -m "Add image from the web" " datalad download 'https://hub.datalad.org/edu/penguins/media/branch/main/examples/chinstrap.jpg'"
[INFO] == Command start (output follows) =====
download(ok): chinstrap.jpg
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset) [ datalad download 'https://hub.datalad.o...]
add(ok): chinstrap.jpg (file)
save(ok): . (dataset)

Run gitk in the dataset. Can you find the file identifier of any of the newly downloaded files?

1.5. Challenge 5

Other than creating datasets on your own, DataLad allows to clone existing datasets, too. Clone and explore the dataset from the following publication:

> Wittkuhn, L., Schuck, N.W. Dynamics of fMRI patterns reflect sub-second activation sequences and reveal replay in human visual cortex. Nat Commun 12, 1795 (2021). https://doi.org/10.1038/s41467-021-21970-2

You can find it at https://github.com/lnnrtwttkhn/highspeed-analysis.

Show me how to clone it

$ datalad clone https://github.com/lnnrtwttkhn/highspeed-analysis.git
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] access to 1 dataset sibling keeper not auto-enabled, enable with:
| 		datalad siblings -d "/home/me/challenges/102-101-dataset/highspeed-analysis" enable -s keeper
install(ok): /home/me/challenges/102-101-dataset/highspeed-analysis (dataset)

Explore the dataset:

  • When was it created?

  • When was it last updated?

  • How many contributors does it have?

  • How much annexed file content does it contain?

  • How many subdatasets are there?

Let’s compare explorations

When was it created?

$ cd highspeed-analysis
# first commit
$ git log $(git rev-list --max-parents=0 HEAD)
commit 34eeffcc✂SHA1
Author: Lennart Wittkuhn <wittkuhn@mpib-berlin.mpg.de>
Date:   Thu Nov 5 08:45:43 2020 +0100

    [DATALAD] new dataset

When was it last updated?

# most recent commit
$ git show
commit 5ed32fa7✂SHA1
Author: Lennart Wittkuhn <wittkuhn@mpib-berlin.mpg.de>
Date:   Mon Aug 23 10:24:36 2021 +0200

    update state of input subdatasets in /data

diff --git a/data/bids b/data/bids
index 5dd8eb8..4b59024 160000
--- a/data/bids
+++ b/data/bids
@@ -1 +1 @@
-Subproject commit 5dd8eb86✂SHA1
+Subproject commit 4b59024c✂SHA1
diff --git a/data/decoding b/data/decoding
index 332d94a..a05eb37 160000
--- a/data/decoding
+++ b/data/decoding
@@ -1 +1 @@
-Subproject commit 332d94a2✂SHA1
+Subproject commit a05eb37d✂SHA1

How many contributors does it have?

# contributions by contributor
$ git shortlog -s

How much annexed file content does it contain?

$ datalad status --annex all
1 annex'd file (0.0 B/7.5 MB present/total size)
nothing to save, working tree clean

How many subdatasets are there?

$ datalad subdatasets
subdataset(ok): code/raincloud-plots (dataset)
subdataset(ok): data/bids (dataset)
subdataset(ok): data/decoding (dataset)

Finally, get the annexed file content and drop it afterwards.

Yeah, data!

Get it…

$ datalad get .
get(ok): data/tmp/dt_pred_conc_slope.Rdata (file) [from gin...]

Drop it!

$ datalad drop .
drop(ok): data/tmp/dt_pred_conc_slope.Rdata (file)
drop(ok): . (directory)