3.2. Native High Performance Computing integration for SLURM¶
For high-performance computing we need a special flavor of DataLad’s reproducibility approach. This section sketches a solution for HPC systems running the job scheduler slurm.
3.2.1. Why datalad run/rerun conflicts with HPC batch processing¶
A typical workflow in HPC batch processing with SLURM involves a “job”, e.g., a script performing a computation, and a batch script that schedules this job.
Using datalad run outside, as a prefix command when submitting a SLURM job with sbatch slurm.sh would not provide a record of the computational job or its results: The job would not even start running when the sbatch call returns.
On the other hand, if one used datalad run inside a SLURM batch jobs it would cause three problems:
Critical conflict: A Git repository must not be accessed concurrently
Parallel I/O inside a job is fine
Concurrent Git calls are race conditions, which cause undefined behavior and error. Imagine conflicting git checkout calls in the same clone …
Critical inefficiency: Sequential Git / DataLad calls inside a (highly) parallel SLURM job are considered a waste of compute time
datalad statuswould be a sequential section, may take many seconds or minutes
git annex getcan take pretty long, unwelcome inside a job, even if one parallelizes it with-J
Breaking machine-actionable reproducibility: Should the SLURM job script be part of the rerun record?
Yes it should because it contains the resource definitions for the job and the actual commands.
It is actually needed to rerun (even though SLURM jobs scripts are not super portable between clusters)
If the
datalad run ...call was inside the SLURM job script then it would need to be modified to rerun fromdatalad run <command>todatalad rerun <commithash>. Thus it cannot be reused unmodified.
3.2.2. The DataLad SLURM extension¶
The DataLad SLURM extension introduces alternatives to the datalad run and datalad rerun commands.
The
datalad slurm-schedulecommand will schedule a SLURM job. It is a prefix command to the usualsbatch slurm.sh ...command with its own command line options. It requires to specify all the output files of the job or output directories where all the output files will be in.During the time that the job runs, no DataLad activity happens.
Some time after the job finishes (or a set of jobs)
datalad slurm-finishneeds to be called. It will check the job’s status and will commit the job’s outputs to the repository. This will happen outside of any job and handles jobs one after the other. Thus, it avoids the problems mentioned above.To reproduce some job’s result simply execute
datalad slurm-reschedule <commithash>similar todatalad rerun. The re-scheduled jobs also need to be finished withdatalad slurm-finishafter they are done.
For more information including installation instructions, checkout the Github page.
3.2.3. Example usage¶
To schedule a slurm script slurm.sh:
$ datalad slurm-schedule \
--output=<output_files_or_dir> \
sbatch slurm.sh [optional arguments]
where <output_files_or_dir> are the expected outputs from the job. Further optional command line arguments can be found in the documentation.
Multiple jobs (including array jobs) can be scheduled one after the other. They are tracked in an SQLite database. Note that any open jobs must not have conflicting outputs with previously scheduled jobs. This is so that the outputs of each slurm run can be tracked to the slurm job which generated them.
To finish these jobs once they are complete, simply run:
$ datalad slurm-finish
Alternatively, to finish a particular scheduled job, run:
$ datalad slurm-finish <slurm_job_id>
This will create a [DATALAD SLURM RUN] entry in the git log, analogous to a datalad run (manual) command.
datalad slurm-finish will flag an error for any jobs which could not be handled, either because they are still running, or the job failed. These are not committed to the repository but also not automatically cleared from the SQLite database.
Instead, the user needs to decide how to handle failed jobs. Either use the --accept-failed-jobs
flag to handle them like successful jobs or --close-failed-jobs to discard them. This can be per job or for all failed jobs left.
To list the current status of all open jobs without saving anything in Git yet, run:
$ datalad slurm-finish --list-open-jobs
To reschedule a previously scheduled job:
$ datalad slurm-reschedule <schedule_commit_hash>
where <schedule_commit_hash> is the commit hash of the previously scheduled job which must be properly finalized already. Such a reproduced job also needs a subsequent datalad slurm-finish call.
In the lingo of the original DataLad package, the combination of datalad slurm-schedule + datalad slurm-finish is similar to datalad run, and datalad slurm-reschedule + datalad slurm-finish is similar to datalad rerun (manual).
An example workflow could look like this (constructed deliberately to have some failed jobs):
$ datalad slurm-schedule \
-o models/abrupt/gold/ sbatch submit_gold.slurm
$ datalad slurm-schedule \
-o models/abrupt/silver/ sbatch submit_silver.slurm
$ datalad slurm-schedule \
-o models/abrupt/bronze/ sbatch submit_bronze.slurm
$ datalad slurm-schedule \
-o models/abrupt/platinum/ sbatch submit_array_platinum.slurm
Checking the job statuses at some point while they are running:
$ datalad slurm-finish --list-open-jobs
The following jobs are open:
slurm-job-id slurm-job-status
10524442 COMPLETED
10524535 RUNNING
10524556 FAILED
10524620 PENDING
Later, once all the jobs have finished running:
$ datalad slurm-finish
add(ok): models/abrupt/gold/05_02/slurm-10524442.out (file)
add(ok): models/abrupt/gold/05_02/slurm-job-10524442.env.json (file)
add(ok): models/abrupt/gold/05_02/model_0.model.gz (file)
save(ok): . (dataset)
add(ok): models/abrupt/silver/05_02/slurm-10524535.out (file)
add(ok): models/abrupt/silver/05_02/slurm-job-10524535.env.json (file)
add(ok): models/abrupt/silver/05_02/model_0.model.gz (file)
add(ok): models/abrupt/silver/05_02/model.scaler.gz (file)
save(ok): . (dataset)
finish(impossible): [Slurm job(s) for job 10524556 are not complete.Statuses: 10524556: FAILED]
finish(impossible): [Slurm job(s) for job 10524620 are not complete.Statuses: 10524620_0: COMPLETED, 10524620_1: COMPLETED, 10524620_2: TIMEOUT]
action summary:
add (ok: 7)
finish (impossible: 2)
save (ok: 2)
To close the failed jobs:
$ datalad slurm-finish --close-failed-jobs
finish(ok): [Closing failed / cancelled jobs. Statuses: 10524556: FAILED]
finish(ok): [Closing failed / cancelled jobs. Statuses: 10524620_0: COMPLETED, 10524620_1: COMPLETED, 10524620_2: TIMEOUT]
action summary:
finish (ok: 2)
Note that if any sub-job of an array job fails, that whole job is treated as a failed job. The user always has the option to manually commit the successful outputs if desired.
The Git history would then appear like so:
$ git log --oneline
a8e4aa6 (HEAD -> master) [DATALAD SLURM RUN] Slurm job 10524535: Completed
25067fe [DATALAD SLURM RUN] Slurm job 10524442: Completed
With one particular entry looking like:
commit a8e4aa62519db3b5f63243cc925ee918984bf506 (HEAD -> master)
Author: Tim Callow <tim@notmyrealemail.com>
Date: Tue Feb 18 09:31:47 2025 +0100
[DATALAD SLURM RUN] Slurm job 10524535: Completed
=== Do not change lines below ===
{
"chain": [],
"cmd": "sbatch submit_silver.slurm",
"commit_id": null,
"dsid": "61576cad-ea4f-4425-8f35-16b9955c9926",
"extra_inputs": [],
"inputs": [],
"outputs": [
"models/abrupt/silver",
"models/abrupt/silver/05_02/slurm-10524535.out",
"models/abrupt/silver/05_02/slurm-job-10524535.env.json"
],
"pwd": ".",
"slurm_job_id": 10524535,
"slurm_outputs": [
"models/abrupt/silver/05_02/slurm-10524535.out",
"models/abrupt/silver/05_02/slurm-job-10524535.env.json"
]
}
^^^ Do not change lines above ^^^