2. Challenge: DataLad Subdatasets¶
You can always get help
In order to learn about available DataLad commands, use datalad --help. In order to learn more about a specific command, use datalad <subcommand> --help.
2.1. Challenge 1¶
Datasets can have subdatasets.
Let’s build a nested dataset from scratch.
Start by creating a dataset called penguin-report, and inside of it, create a subdataset called inputs:
penguin-report
└── inputs
Show me how to do it
Create a new dataset as the superdataset:
$ datalad create penguin-report
$ cd penguin-report
create(ok): /home/me/challenges/102-102-subdataset/penguin-report (dataset)
Next, a subdataset inputs is created and registered inside of the superdataset using -d:
$ datalad create -d . inputs
add(ok): inputs (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
create(ok): inputs (dataset)
2.2. Challenge 2¶
Let’s populate the subdataset with contents.
Download the following set of CSV files into the inputs dataset and save them:
adelie.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/5/002f3893385f710df69eeebe893144ff
gentoo.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381
chinstrap.csv: https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462
Downloading first:
There are several ways to accomplish this. The solution below uses download command from datalad-next and datalad run (manual) inside of the subdataset.
$ cd inputs
$ datalad run -m "Download penguin data" "datalad download 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/219/5/002f3893385f710df69eeebe893144ff adelie.tst' 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/220/7/e03b43c924f226486f2f0ab6709d2381 gentoo.tsv' 'https://pasta.lternet.edu/package/data/eml/knb-lter-pal/221/8/fe853aa8f7a59aa84cdd3197619ef462 chinstrap.csv'"
[INFO] == Command start (output follows) =====
download(ok): adelie.tst
download(ok): gentoo.tsv
download(ok): chinstrap.csv
[INFO] == Command exit (modification check follows) =====
run(ok): /home/me/challenges/102-102-subdataset/penguin-report/inputs (dataset) [datalad download 'https://pasta.lternet....]
add(ok): adelie.tst (file)
add(ok): chinstrap.csv (file)
add(ok): gentoo.tsv (file)
save(ok): . (dataset)
Afterwards, record the new subdataset state in the superdataset.
Saving the updated subdataset state
datalad status in the superdataset will show that the subdataset changed:
$ cd ..
$ datalad status
modified: inputs (dataset)
To save this most recent state use datalad save (manual) with -d:
# navigate into penguin-report superdataset
$ cd ..
$ datalad save -d . -m "Save updated subdataset version"
add(ok): inputs (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
2.3. Challenge 3¶
Where can you find out about the subdataset version?
Tell me!
The information is stored in commits about the subdataset - but only in the superdataset. Take a look at the so called “subproject commit”:
$ git show inputs
commit bbfebc66✂SHA1
Author: Elena Piscopia <elena@example.net>
Date: Tue Jun 18 16:13:00 2019 +0200
Save updated subdataset version
diff --git a/inputs b/inputs
index 7dab0f4..f672de9 160000
--- a/inputs
+++ b/inputs
@@ -1 +1 @@
-Subproject commit 7dab0f49✂SHA1
+Subproject commit f672de93✂SHA1
2.4. Challenge 4¶
Clone the following dataset: https://github.com/psychoinformatics-de/studyforrest-data. Try to list the available subdatasets.
I’m excited!
Start with cloning:
$ datalad clone https://github.com/psychoinformatics-de/studyforrest-data.git
[INFO] Remote origin not usable by git-annex; setting annex-ignore
[INFO] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times. -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at ✂MEMORYADDRESS✂: Failed to resolve 'studyforrest.ds.inm7.de' ([Errno -2] Name or service not known)"))
install(ok): /home/me/challenges/102-102-subdataset/studyforrest-data (dataset)
Find out about subdatasets afterwards:
$ cd studyforrest-data
$ datalad subdatasets
subdataset(ok): artifact/3T_movie_eyetracking (dataset)
subdataset(ok): artifact/3T_structural_mri (dataset)
subdataset(ok): artifact/3T_visuallocalizer (dataset)
subdataset(ok): artifact/7T_audiomovie (dataset)
subdataset(ok): artifact/7T_musicperception (dataset)
subdataset(ok): artifact/media (dataset)
subdataset(ok): artifact/movie_eyetracking (dataset)
subdataset(ok): code/conversion_qa (dataset)
subdataset(ok): derivative/aggregate_fmri_timeseries (dataset)
subdataset(ok): derivative/aligned_mri (dataset)
subdataset(ok): derivative/cortical_surfaces_freesurfer (dataset)
subdataset(ok): derivative/image_space_transformations (dataset)
subdataset(ok): derivative/retinotopic_maps (dataset)
subdataset(ok): derivative/visual_areas (dataset)
subdataset(ok): original/3T_multiresolution_fmri (dataset)
subdataset(ok): original/3T_structural_mri (dataset)
subdataset(ok): original/7T_multiresolution_fmri (dataset)
subdataset(ok): original/phase2 (dataset)
subdataset(ok): stimulus/computational_annotations (dataset)
subdataset(ok): stimulus/curated_annotations (dataset)
Take a look at any of the subdatasets’ directories. Why do they appear to be empty?
What do you need to do to retrieve availability information about a dataset, but not download its content? Try with the subdataset original/phase2.
Okidoki, I’m ready.
$ datalad get -n original/phase2
[INFO] RIA store unavailable. -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to access http://studyforrest.ds.inm7.de/ria-layout-version -caused by- Failed to establish a new session 1 times. -caused by- HTTPConnectionPool(host='studyforrest.ds.inm7.de', port=80): Max retries exceeded with url: /ria-layout-version (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at ✂MEMORYADDRESS✂: Failed to resolve 'studyforrest.ds.inm7.de' ([Errno -2] Name or service not known)"))
[WARNING] Failed to fetch type=git special remote mddatasrc: datalad.runner.exception.CommandError(CommandError: 'git -c diff.ignoreSubmodules=none -c core.quotepath=false fetch --verbose --progress mddatasrc' failed with exitcode 128 under /home/me/challenges/102-102-subdataset/studyforrest-data/original/phase2 [err: 'fatal: unable to access 'http://psydata.ovgu.de/studyforrest/phase2/.git/': Failed to connect to psydata.ovgu.de port 80 after 134902 ms: Could not connect to server'])
[WARNING] mddatasrc was marked by git-annex as annex-ignore.Edit .git/config to reset if you think that was done by mistake due to absent connection etc.
[INFO] Reconfigured inm7-storage for ria+https://datapub.fz-juelich.de/studyforrest/studyforrest.ria
[INFO] Configure additional publication dependency on "inm7-storage"
install(ok): /home/me/challenges/102-102-subdataset/studyforrest-data/original/phase2 (dataset) [Installed subdataset in order to get /home/me/challenges/102-102-subdataset/studyforrest-data/original/phase2]
Beware of Windows path semantics
On Windows, make sure to adjust the path to the subdataset:
$ datalad get -r original\phase2
Where can you find out about the origin location of a dataset’s subdatasets?
Let’s see!
The information is stored in the superdatasets’ .gitmodules file:
$ cat .gitmodules
[submodule "artifact/7T_audiomovie"]
path = artifact/7T_audiomovie
url = http://studyforrest.ds.inm7.de/d47/59300-5563-467d-be5f-e5b164fb3060
datalad-id = d4759300-5563-467d-be5f-e5b164fb3060
datalad-url = "ria+http://studyforrest.ds.inm7.de#d4759300-5563-467d-be5f-e5b164fb3060"
datalad-recursiveinstall = skip
[submodule "artifact/3T_movie_eyetracking"]
path = artifact/3T_movie_eyetracking
url = http://studyforrest.ds.inm7.de/d5d/d3da0-a631-4c0c-a4a9-de55dfc4620f
datalad-id = d5dd3da0-a631-4c0c-a4a9-de55dfc4620f
datalad-url = "ria+http://studyforrest.ds.inm7.de#d5dd3da0-a631-4c0c-a4a9-de55dfc4620f"
datalad-recursiveinstall = skip
[submodule "artifact/3T_visuallocalizer"]
path = artifact/3T_visuallocalizer
url = http://studyforrest.ds.inm7.de/607/5c0fa-ab72-4bab-9888-3b597f0e63b1
datalad-id = 6075c0fa-ab72-4bab-9888-3b597f0e63b1
datalad-url = "ria+http://studyforrest.ds.inm7.de#6075c0fa-ab72-4bab-9888-3b597f0e63b1"
datalad-recursiveinstall = skip
[submodule "derivative/aggregate_fmri_timeseries"]
path = derivative/aggregate_fmri_timeseries
url = https://github.com/psychoinformatics-de/studyforrest-data-aggregate.git
datalad-id = 7fcd8812-d0fe-11e7-8db2-a0369f7c647e
datalad-url = "ria+http://studyforrest.ds.inm7.de#7fcd8812-d0fe-11e7-8db2-a0369f7c647e"
[submodule "derivative/aligned_mri"]
path = derivative/aligned_mri
url = https://github.com/psychoinformatics-de/studyforrest-data-aligned.git
datalad-id = c8ec2919-493b-4af5-9271-cbe9ebd08c43
datalad-url = "ria+http://studyforrest.ds.inm7.de#c8ec2919-493b-4af5-9271-cbe9ebd08c43"
[submodule "artifact/3T_structural_mri"]
path = artifact/3T_structural_mri
url = http://studyforrest.ds.inm7.de/ad9/b6c66-4413-4b4f-b6da-b7f25d0d6397
datalad-id = ad9b6c66-4413-4b4f-b6da-b7f25d0d6397
datalad-url = "ria+http://studyforrest.ds.inm7.de#ad9b6c66-4413-4b4f-b6da-b7f25d0d6397"
datalad-recursiveinstall = skip
[submodule "artifact/movie_eyetracking"]
path = artifact/movie_eyetracking
url = http://studyforrest.ds.inm7.de/126/cd950-377c-4600-a921-045cf408bd9f
datalad-id = 126cd950-377c-4600-a921-045cf408bd9f
datalad-url = "ria+http://studyforrest.ds.inm7.de#126cd950-377c-4600-a921-045cf408bd9f"
datalad-recursiveinstall = skip
[submodule "artifact/media"]
path = artifact/media
url = http://studyforrest.ds.inm7.de/da1/5d84c-9c8b-11e9-a3fb-f0d5bf7b5561
datalad-id = da15d84c-9c8b-11e9-a3fb-f0d5bf7b5561
datalad-url = "ria+http://studyforrest.ds.inm7.de#da15d84c-9c8b-11e9-a3fb-f0d5bf7b5561"
datalad-recursiveinstall = skip
[submodule "artifact/7T_musicperception"]
path = artifact/7T_musicperception
url = http://studyforrest.ds.inm7.de/c08/af312-e05b-43b3-b499-db0d2ad46bf6
datalad-id = c08af312-e05b-43b3-b499-db0d2ad46bf6
datalad-url = "ria+http://studyforrest.ds.inm7.de#c08af312-e05b-43b3-b499-db0d2ad46bf6"
datalad-recursiveinstall = skip
[submodule "original/7T_multiresolution_fmri"]
path = original/7T_multiresolution_fmri
url = https://github.com/psychoinformatics-de/studyforrest-data-multires7t.git
datalad-id = 3a8648b3-7df8-413f-8efb-4d39040ac174
datalad-url = "ria+http://studyforrest.ds.inm7.de#3a8648b3-7df8-413f-8efb-4d39040ac174"
[submodule "original/3T_multiresolution_fmri"]
path = original/3T_multiresolution_fmri
url = https://github.com/psychoinformatics-de/studyforrest-data-multires3t.git
datalad-id = 5b1081d6-84d7-11e8-b00a-a0369fb55db0
datalad-url = "ria+http://studyforrest.ds.inm7.de#5b1081d6-84d7-11e8-b00a-a0369fb55db0"
[submodule "stimulus/curated_annotations"]
path = stimulus/curated_annotations
url = https://github.com/psychoinformatics-de/studyforrest-data-annotations.git
datalad-id = 45b9ab26-07fc-11e8-8c71-f0d5bf7b5561
datalad-url = "ria+http://studyforrest.ds.inm7.de#45b9ab26-07fc-11e8-8c71-f0d5bf7b5561"
[submodule "stimulus/computational_annotations"]
path = stimulus/computational_annotations
url = http://studyforrest.ds.inm7.de/4c5/36c4a-ec61-11e6-9440-00b56d060aa7
datalad-id = 4c536c4a-ec61-11e6-9440-00b56d060aa7
datalad-url = "ria+http://studyforrest.ds.inm7.de#4c536c4a-ec61-11e6-9440-00b56d060aa7"
datalad-recursiveinstall = skip
[submodule "derivative/cortical_surfaces_freesurfer"]
path = derivative/cortical_surfaces_freesurfer
url = https://github.com/psychoinformatics-de/studyforrest-data-freesurfer.git
datalad-id = 3304e775-5f5f-435a-b68e-d98c9f5fb72a
datalad-url = "ria+http://studyforrest.ds.inm7.de#3304e775-5f5f-435a-b68e-d98c9f5fb72a"
[submodule "code/conversion_qa"]
path = code/conversion_qa
url = https://github.com/mih/gumpdata.git
datalad-id = 0f66b1ba-e9a9-46fd-b9d9-2e64fe94d307
datalad-url = "ria+http://studyforrest.ds.inm7.de#0f66b1ba-e9a9-46fd-b9d9-2e64fe94d307"
[submodule "original/phase2"]
path = original/phase2
url = https://github.com/psychoinformatics-de/studyforrest-data-phase2.git
datalad-id = 5eaff716-54eb-11e8-803d-a0369f7c647e
datalad-url = "ria+http://studyforrest.ds.inm7.de#5eaff716-54eb-11e8-803d-a0369f7c647e"
[submodule "derivative/retinotopic_maps"]
path = derivative/retinotopic_maps
url = https://github.com/psychoinformatics-de/studyforrest-data-retinotopy.git
datalad-id = 2d05f277-94b0-470b-8e11-4e56691d5b89
datalad-url = "ria+http://studyforrest.ds.inm7.de#2d05f277-94b0-470b-8e11-4e56691d5b89"
[submodule "original/3T_structural_mri"]
path = original/3T_structural_mri
url = https://github.com/psychoinformatics-de/studyforrest-data-structural.git
datalad-id = 1882e2e6-fbbf-4ade-a65f-3a1615235f51
datalad-url = "ria+http://studyforrest.ds.inm7.de#1882e2e6-fbbf-4ade-a65f-3a1615235f51"
[submodule "derivative/image_space_transformations"]
path = derivative/image_space_transformations
url = https://github.com/psychoinformatics-de/studyforrest-data-templatetransforms.git
datalad-id = ceb007ac-ef05-4392-98d2-35c02a774a21
datalad-url = "ria+http://studyforrest.ds.inm7.de#ceb007ac-ef05-4392-98d2-35c02a774a21"
[submodule "derivative/visual_areas"]
path = derivative/visual_areas
url = https://github.com/psychoinformatics-de/studyforrest-data-visualrois.git
datalad-id = 92e65958-4a5a-4c34-a4f4-ee070f7a123b
datalad-url = "ria+http://studyforrest.ds.inm7.de#92e65958-4a5a-4c34-a4f4-ee070f7a123b"
Navigate into the newly installed subdataset original/phase2.
Run gitk and explore its files to find out what this dataset is all about.