1. Challenge: DataLad Datasets¶
You can always get help
In order to learn about available DataLad commands, use datalad --help. In order to learn more about a specific command, use datalad <subcommand> --help.
1.1. Challenge 1¶
Create a dataset called my-dataset on your computer.
Inside of the dataset, run the command gitk and explore it.
Can you find:
the dataset identifier?
the version label?
the dataset creator?
the dataset creation date?
Afterwards, run the command gitk --all. What is the difference from before?
Show me how to do it
To create a new dataset, run:
$ datalad create my-dataset
create(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset)
Finally, remove the dataset.
How do I do that?
To remove it, run datalad drop (manual). Importantly, this command needs to run outside of the dataset.
$ datalad drop --what all -d my-dataset --reckless availability
uninstall(ok): . (dataset)
1.2. Challenge 2¶
Text files are digital files containing plain text. Take a minute to think: - Why is it often useful to keep textfiles out of git-annex? On the other hand, what could be a reason to annex text files?
Tell me!
Why is it useful to keep textfiles out of git-annex?
To make editing easier (no need to unlock)
To have a nicer Git history (commits can show differences between file revisions)
To distribute the file automatically with every clone (unlike with annexed files, file content of files kept in Git is readily available in shared dataset clones)
What could be a reason to annex text files?
To keep file contents private/secret (annexing files allows access control)
An unusually large text file (at least dozens of MB)
Create a DataLad dataset called text2gitdataset and configure it to never annex text files (there are several ways to do this!).
Ok, show me the ways!
1. Right at dataset creation
$ datalad create -c text2git text2gitdataset
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/text2gitdataset (dataset) [VIRTUALENV/bin/python /home...]
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset (dataset)
2. After dataset creation with a datalad run-procedure (manual)
$ datalad create text2gitdataset-2
$ cd text2gitdataset-2
$ datalad run-procedure cfg_text2git
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset-2 (dataset)
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/text2gitdataset-2 (dataset) [VIRTUALENV/bin/python /home...]
3. By editing .gitattributes by hand
$ datalad create text2gitdataset-3
$ cd text2gitdataset-3
$ echo "* annex.largefiles=(mimeencoding=binary)and(largerthan=0))" >> .gitattributes
$ datalad save -m "configure Dataset to keep text files in Git"
create(ok): /home/me/challenges/102-101-dataset/text2gitdataset-3 (dataset)
add(ok): .gitattributes (file)
save(ok): . (dataset)
In the end, remove the datasets.
Can you show me again?
Clean-up:
$ datalad drop -d text2gitdataset --what all --reckless availability
$ datalad drop -d text2gitdataset-2 --what all --reckless availability
$ datalad drop -d text2gitdataset-3 --what all --reckless availability
uninstall(ok): . (dataset)
uninstall(ok): . (dataset)
uninstall(ok): . (dataset)
1.3. Challenge 3¶
Version controlling a file means to record its changes over time, associate those changes with an author, date, and identifier, creating a lineage of file content, and being able to revert changes or restore previous file versions. DataLad datasets can version control their contents, regardless of size.
Create a new dataset my-dataset that is configured to store text files in Git (see previous challenge) and add a README.md file with some content into it.
Make sure to save it with a helpful commit message, and inspect your datasets revision history.
Let’s go!
Create the dataset and cd into it:
$ datalad create -c text2git my-dataset
$ cd my-dataset
[INFO] Running procedure cfg_text2git
[INFO] == Command start (output follows) =====
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset) [VIRTUALENV/bin/python /home...]
create(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset)
Create a text file and save it (you can also create a text file with an editor of your choice, e.g., vim.)
$ echo "# Example Dataset" > README.md
$ datalad status
untracked: README.md (file)
$ datalad save -m "add a README to the dataset"
add(ok): README.md (file)
save(ok): . (dataset)
Check the dataset’s history:
$ git log
commit 2b505f91✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0200
add a README to the dataset
commit 5ec4dab1✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0200
Instruct annex to add text files to Git
commit 1e658827✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0200
[DATALAD] new dataset
Run gitk again. Can you find the dataset modification date?
Finally, edit the README and save it again.
Let’s go!
$ echo "This is my example dataset" >> README.md
$ datalad save -m "Add redundant explanation"
add(ok): README.md (file)
save(ok): . (dataset)
1.4. Challenge 4¶
Download and save the following set of penguin images available at the URLs below into a dataset:
chinstrap_01.jpg: https://hub.datalad.org/edu/penguins/media/branch/main/examples/adelie.jpgchinstrap_02.jpg: https://hub.datalad.org/edu/penguins/media/branch/main/examples/chinstrap.jpg
You can reuse the dataset from the previous challenge, or create a new one. Can you do the download while recording provenance?
Give me a hint about provenance
Try using datalad download-url (manual) or datalad-next’s “download” command combined with datalad run (manual).
Show me the entire solution
You can download a file and save it manually:
$ wget -q -O chinstrap_01.jpg "https://hub.datalad.org/edu/penguins/media/branch/main/examples/adelie.jpg"
$ datalad save -m "Add manually downloaded images"
add(ok): chinstrap_01.jpg (file)
save(ok): . (dataset)
Or download it recording its origin as provenance:
$ datalad run -m "Add image from the web" " datalad download 'https://hub.datalad.org/edu/penguins/media/branch/main/examples/chinstrap.jpg'"
[INFO] == Command start (output follows) =====
download(ok): chinstrap.jpg
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-101-dataset/my-dataset (dataset) [ datalad download 'https://hub.datalad.o...]
add(ok): chinstrap.jpg (file)
save(ok): . (dataset)
Run gitk in the dataset. Can you find the file identifier of any of the newly downloaded files?
1.5. Challenge 5¶
Other than creating datasets on your own, DataLad allows to clone existing datasets, too. Clone and explore the dataset from the following publication:
> Wittkuhn, L., Schuck, N.W. Dynamics of fMRI patterns reflect sub-second activation sequences and reveal replay in human visual cortex. Nat Commun 12, 1795 (2021). https://doi.org/10.1038/s41467-021-21970-2
You can find it at https://github.com/lnnrtwttkhn/highspeed-analysis.
Show me how to clone it
$ datalad clone https://github.com/lnnrtwttkhn/highspeed-analysis.git
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] access to 1 dataset sibling keeper not auto-enabled, enable with:
| datalad siblings -d "/home/me/challenges/102-101-dataset/highspeed-analysis" enable -s keeper
install(ok): /home/me/challenges/102-101-dataset/highspeed-analysis (dataset)
Explore the dataset:
When was it created?
When was it last updated?
How many contributors does it have?
How much annexed file content does it contain?
How many subdatasets are there?
Let’s compare explorations
When was it created?
$ cd highspeed-analysis
# first commit
$ git log $(git rev-list --max-parents=0 HEAD)
commit 34eeffcc✂SHA1
Author: Lennart Wittkuhn <wittkuhn@mpib-berlin.mpg.de>
Date: Thu Nov 5 08:45:43 2020 +0100
[DATALAD] new dataset
When was it last updated?
# most recent commit
$ git show
commit 5ed32fa7✂SHA1
Author: Lennart Wittkuhn <wittkuhn@mpib-berlin.mpg.de>
Date: Mon Aug 23 10:24:36 2021 +0200
update state of input subdatasets in /data
diff --git a/data/bids b/data/bids
index 5dd8eb8..4b59024 160000
--- a/data/bids
+++ b/data/bids
@@ -1 +1 @@
-Subproject commit 5dd8eb86✂SHA1
+Subproject commit 4b59024c✂SHA1
diff --git a/data/decoding b/data/decoding
index 332d94a..a05eb37 160000
--- a/data/decoding
+++ b/data/decoding
@@ -1 +1 @@
-Subproject commit 332d94a2✂SHA1
+Subproject commit a05eb37d✂SHA1
How many contributors does it have?
# contributions by contributor
$ git shortlog -s
How much annexed file content does it contain?
$ datalad status --annex all
1 annex'd file (0.0 B/7.5 MB present/total size)
nothing to save, working tree clean
How many subdatasets are there?
$ datalad subdatasets
subdataset(ok): code/raincloud-plots (dataset)
subdataset(ok): data/bids (dataset)
subdataset(ok): data/decoding (dataset)
Finally, get the annexed file content and drop it afterwards.
Yeah, data!
Get it…
$ datalad get .
get(ok): data/tmp/dt_pred_conc_slope.Rdata (file) [from gin...]
Drop it!
$ datalad drop .
drop(ok): data/tmp/dt_pred_conc_slope.Rdata (file)
drop(ok): . (directory)