5.2. Modular data science: DataLad compared to Kedro

Modern data science and data engineering projects face similar challenges to machine learning analyses: They involve complex pipelines with multiple stages, require reproducible environments, and benefit from modular organization that enables reuse. This has led to the emergence of multiple frameworks with similar goals but different philosophies and implementations.

One notable framework is Kedro, an open-source Python framework for creating reproducible, maintainable, and modular data science code. Originally developed at McKinsey’s QuantumBlack and now hosted by the LF AI & Data Foundation, Kedro has become widely adopted in the data engineering community.

This section compares DataLad with its YODA principles to Kedro, highlighting their similarities, differences, and potential for complementary use. While both tools aim to solve problems of reproducibility, version control, and modularity, they approach these challenges from different angles and serve somewhat different use cases.

5.2.1. Philosophy and focus

Both DataLad/YODA and Kedro emerged from practical experience with data-intensive projects and independently arrived at similar organizational principles. However, their focus and scope differ significantly.

Kedro is a Python-specific framework designed for data engineering and data science pipelines. It emphasizes:

  • Software engineering best practices for data science code (Kedro docs)

  • Modular pipelines that can be tested, documented, and reused (modular pipelines)

  • Abstracted data access through a Data Catalog

  • Built-in visualization of pipeline structure (Kedro-Viz)

  • Integration with ML tools like MLflow and experiment trackers

DataLad with YODA principles is a language-agnostic approach to research data management. It emphasizes:

  • Version control for data, code, and containers at any scale

  • Federated composition through nested subdatasets

  • Provenance tracking for computational results

  • Decentralized data sharing via multiple storage backends

  • Domain-agnostic applicability across research fields

5.2.2. Setup

To illustrate the differences, let’s set up equivalent project structures in both tools.

5.2.2.1. Kedro setup

Kedro projects are created using the Kedro CLI and follow a standardized template:

### Kedro
# Create a new Kedro project
$ pip install kedro
$ kedro new --starter=spaceflights-pandas --name=my-analysis

The resulting project has this structure:

my-analysis/
├── conf/                  # Configuration files
│   ├── base/
│   │   ├── catalog.yml    # Data Catalog definitions
│   │   └── parameters.yml # Pipeline parameters
│   └── local/
│       └── credentials.yml
├── data/                  # Data directory (layered)
│   ├── 01_raw/
│   ├── 02_intermediate/
│   ├── 03_primary/
│   └── ...
├── notebooks/             # Jupyter notebooks
├── src/
│   └── my_analysis/
│       ├── pipelines/     # Modular pipeline definitions
│       │   ├── data_processing/
│       │   │   ├── nodes.py
│       │   │   └── pipeline.py
│       │   └── data_science/
│       │       ├── nodes.py
│       │       └── pipeline.py
│       └── pipeline_registry.py
└── tests/

This structure is designed around Kedro’s core concepts: nodes (Python functions), pipelines (collections of nodes), and the Data Catalog (a registry mapping dataset names to storage locations).

5.2.2.2. DataLad setup

A DataLad dataset following YODA principles can be created with:

### DataLad
$ datalad create -c yoda -c text2git my-analysis
$ cd my-analysis
$ mkdir -p data/{raw,intermediate,processed} model metrics

but in principle could be any file-tree formalization like a BIDS study dataset. The resulting structure:

my-analysis/
├── .datalad/              # DataLad configuration
├── .gitattributes
├── code/                  # Analysis scripts (any language)
│   └── README.md
├── data/                  # Data directory
│   ├── raw/
│   ├── intermediate/
│   └── processed/
└── CHANGELOG.md

Unlike Kedro, DataLad does not prescribe a specific code organization either – it focuses on data and provenance management while remaining agnostic about how analysis code is structured[1].

5.2.3. Version controlling data

Both tools address the fundamental challenge of managing data that is too large for Git, but they use different mechanisms.

5.2.3.1. Kedro’s Data Catalog and versioning

Kedro manages data access through a Data Catalog defined in conf/base/catalog.yml:

### Kedro catalog.yml
companies:
  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv

model_input_table:
  type: pandas.ParquetDataset
  filepath: data/03_primary/model_input_table.parquet
  versioned: true

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true

When versioned: true is set, Kedro automatically creates timestamped subdirectories for each save operation:

### Kedro versioned dataset structure
data/06_models/regressor.pickle/
├── 2024-01-15T10.30.45.123Z/
│   └── regressor.pickle
└── 2024-01-16T14.22.33.456Z/
    └── regressor.pickle

The Data Catalog abstracts storage locations, allowing the same pipeline code to work with local files, S3, GCS, or other storage backends simply by changing configuration.

5.2.3.2. DataLad’s git-annex versioning

DataLad uses git-annex to handle large files while keeping them under full Git version control. Data is tracked at the file level, with content stored in the annex and only lightweight symlinks in the working tree[2]:

### DataLad
$ datalad download-url \
    --archive \
    --message "Download raw data" \
    https://example.com/companies.csv \
    -O 'data/raw/'

$ datalad status
$ datalad save -m "Process data" data/processed/

Every version of every file is tracked in the Git history, and the full provenance is available through standard Git commands:

### DataLad
$ git log --oneline data/processed/model_input.parquet
a1b2c3d Process data: filter outliers
d4e5f6g Initial data processing

How does versioning differ between the tools?

Kedro versioning:

  • Timestamp-based directory structure

  • Automatic versioning on each pipeline run

  • Versions stored as separate files (no deduplication)

  • Loading specific versions requires explicit specification

  • Works within single project scope

DataLad/git-annex versioning:

  • Content-addressed storage with deduplication

  • Full Git history for all changes

  • Can retrieve any historical version via Git checkout

  • Distributed storage across multiple remotes

  • Works across federated dataset hierarchies

5.2.4. Sharing and reuse

One of the most significant differences between Kedro and DataLad lies in how they handle modularity and reuse.

5.2.4.1. Kedro modular pipelines

Kedro’s approach to modularity is modular pipelines – self-contained pipeline components that can be reused within a project or packaged for use in other projects:

### Kedro: Create a modular pipeline
$ kedro pipeline create data_processing

This creates a new pipeline with its own nodes.py, pipeline.py, and test structure. Pipelines can be composed:

### Kedro pipeline_registry.py
from kedro.pipeline import Pipeline
from my_analysis.pipelines import data_processing, data_science

def register_pipelines() -> dict[str, Pipeline]:
    return {
        "data_processing": data_processing.create_pipeline(),
        "data_science": data_science.create_pipeline(),
        "__default__": (
            data_processing.create_pipeline() +
            data_science.create_pipeline()
        ),
    }

Kedro pipelines can be packaged and shared via Python packages, though they remain within the Python ecosystem and typically within a single project’s codebase.

5.2.4.2. YODA subdatasets

DataLad’s approach to modularity is subdatasets – nested Git repositories that can be independently versioned, shared, and composed.

For example, using a publicly available dataset as a data dependency:

### DataLad: Add data as a subdataset
$ datalad clone -d . \
    https://github.com/datalad-handbook/iris_data \
    data/raw/iris

This clones the iris flower dataset as a subdataset, recording its exact version. For code dependencies, you can similarly clone analysis libraries:

### DataLad: Add shared analysis code as subdataset
$ datalad clone -d . \
    https://github.com/datalad/datalad \
    code/datalad-lib

The key difference is that subdatasets are truly independent – they can live in different repositories, be shared separately, and be reused across completely different projects. A DataLad superdataset tracks the exact version of each subdataset, enabling perfect reproducibility:

### DataLad: Check subdataset versions
$ datalad subdatasets

### Anyone can recreate the exact state
$ datalad get -r .

This federated approach scales from single files to thousands of datasets, as demonstrated by datasets.datalad.org with over 8,000 subdatasets in a deep hierarchy.

Modular pipelines vs. subdatasets

The modularity approaches serve different purposes:

Kedro modular pipelines:

  • Python code modularity

  • Single project scope

  • Shared via Python packages

  • Focus on code organization

  • Runtime composition

YODA subdatasets:

  • Data and code modularity

  • Federated, multi-project scope

  • Shared via Git repositories

  • Focus on data provenance

  • Version-locked composition

5.2.5. Pipeline execution and provenance

Both tools support executing analyses and tracking what was done, but with different approaches.

5.2.5.1. Kedro pipeline execution

Kedro provides a complete pipeline execution framework with dependency resolution:

### Kedro: Run the full pipeline
$ kedro run

### Kedro: Run specific pipeline
$ kedro run --pipeline data_processing

### Kedro: Run with specific parameters
$ kedro run --params model_options.test_size=0.3

Kedro automatically resolves node dependencies and executes them in the correct order. The pipeline structure can be visualized with Kedro-Viz:

### Kedro: Visualize pipeline
$ pip install kedro-viz
$ kedro viz run

This opens an interactive web interface showing nodes, datasets, and their connections.

5.2.5.2. DataLad run

DataLad does not include a workflow manager, but provides datalad run (manual) to record any command execution with its inputs, outputs, and exact parameters[3]:

### DataLad: Record a computation
$ datalad run \
    --message "Train classifier" \
    --input data/processed/train.csv \
    --output model/classifier.pkl \
    python code/train.py

The command, its inputs, and outputs are recorded in the Git history as machine-readable provenance. This can be re-executed later:

### DataLad: Rerun a computation
$ datalad rerun <commit-hash>

For complex pipelines, DataLad integrates with workflow managers like Snakemake or Nextflow, while still capturing provenance for each step.

Workflow management comparison

Kedro:

  • Built-in workflow management

  • Node-level dependency resolution

  • Parallel execution support

  • Interactive visualization (Kedro-Viz)

  • Deployment to Airflow, Prefect, etc.

DataLad:

  • Command-level provenance tracking

  • Integrates with external workflow managers

  • Full rerun capability

  • CLI-focused interface

  • Provenance embedded in Git history

5.2.6. Configuration management

Both tools separate configuration from code, but in different ways.

5.2.6.1. Kedro configuration

Kedro uses a hierarchical configuration system with conf/base/ for shared settings and conf/local/ for environment-specific overrides:

### conf/base/parameters.yml
model_options:
  test_size: 0.2
  random_state: 42
  features:
    - feature_a
    - feature_b

Parameters are accessed in nodes via dependency injection:

### Kedro node
def train_model(data, params: dict):
    test_size = params["model_options"]["test_size"]
    # ...

5.2.6.2. DataLad configuration

DataLad itself doesn’t prescribe a configuration structure, but the YODA principles recommend separating configuration into clearly identifiable files. Configuration can be tracked in the dataset alongside code:

### DataLad: Track configuration
$ datalad save -m "Update model parameters" code/config.yml

For container-based workflows, configuration is often baked into the container definition or passed as command-line arguments, both of which are captured by datalad run.

5.2.7. Summary

DataLad and Kedro share similar goals of enabling reproducible, modular data science, but they approach the problem from different angles. The choice between them depends on your specific needs, existing infrastructure, and team preferences.

Table 5.1 Comparison of DataLad/YODA and Kedro

Feature

Kedro

DataLad/YODA

Language

Python-specific

Language-agnostic

Primary focus

Data engineering pipelines

Research data management

Modularity

Modular pipelines (Python)

Subdatasets (Git)

Scope

Single project

Federated multi-project

Data versioning

Timestamp directories

git-annex (content-addressed)

Sharing model

Python packages

Git repositories

Workflow management

Built-in

External (Snakemake, etc.)

Visualization

Kedro-Viz (web UI)

CLI / external tools

Storage backends

Data Catalog abstraction

git-annex special remotes

Provenance

Pipeline structure

Git history + run records

Learning curve

Medium (Python patterns)

Medium (Git/git-annex concepts)

5.2.8. When to use which tool

Consider Kedro if you:

  • Work primarily in Python

  • Need built-in workflow management and visualization

  • Want standardized project templates for data science teams

  • Focus on single-project data pipelines

  • Value integration with ML experiment tracking tools

  • Prefer configuration-driven data access abstraction

Consider DataLad/YODA if you:

  • Work across multiple programming languages

  • Need to manage very large datasets (terabytes+)

  • Want federated composition across multiple projects

  • Require decentralized data sharing

  • Need fine-grained provenance for every file

  • Work in research environments with diverse tools

5.2.9. Using them together

Kedro and DataLad can be complementary. A powerful pattern is to use Kedro for pipeline execution within a DataLad dataset for version control and sharing.

Data versioning guidance

When combining Kedro and DataLad, DataLad handles all data versioning via git-annex. Do not enable Kedro’s versioned: true in the Data Catalog – Kedro’s timestamp-based versioning creates duplicate copies of each file version, which conflicts with DataLad’s content-addressed storage and deduplication. Conveniently, Kedro’s Data Catalog is designed to make this easy: versioning is off by default, so simply omit the versioned flag and let DataLad track file versions through the Git history.

The following walkthrough demonstrates this combination using a minimal Kedro project.

5.2.9.1. Step 1: Create a Kedro project

First, install Kedro and create a new project using kedro new:

### Install Kedro (in your environment)
$ pip install kedro pandas

### Create a new Kedro project (non-interactive)
$ kedro new --name=kedro-datalad-demo --tools=none --example=no --telemetry=no
$ cd kedro-datalad-demo

This generates the standard Kedro project structure including pyproject.toml, settings.py, pipeline_registry.py, catalog.yml, .gitignore, and other scaffolding files.

5.2.9.2. Step 2: Initialize DataLad inside the Kedro project

Turn the Kedro project into a DataLad dataset. We use --force because the directory already exists, and text2git to store text files directly in Git:

### Initialize DataLad in the existing Kedro project
$ datalad create --force -c text2git .

Note

We skip the -c yoda configuration here because Kedro’s kedro new already provides a project structure with sensible defaults.

5.2.9.3. Step 3: Add demo pipeline, save, and run with provenance

Add a simple demo pipeline by modifying the generated pipeline_registry.py:

$ cat > src/kedro_datalad_demo/pipeline_registry.py << 'EOF'
from kedro.pipeline import Pipeline, node

def greet(name: str) -> str:
    greeting = f"Hello, {name}!"
    # Write output for provenance tracking
    with open("output.txt", "w") as f:
        f.write(greeting)
    return greeting

def register_pipelines():
    return {
        "__default__": Pipeline([
            node(greet, inputs="params:name", outputs="greeting")
        ])
    }
EOF

Set the pipeline parameter (kedro new generates an empty parameters.yml):

$ cat > conf/base/parameters.yml << 'EOF'
name: DataLad
EOF

Save the Kedro project with the demo pipeline to the DataLad dataset:

### Track Kedro project structure
$ datalad save -m "Initialize Kedro project with demo pipeline"

Run the Kedro pipeline with DataLad provenance tracking:

### Optional: Disable telemetry for cleaner output
$ export KEDRO_DISABLE_TELEMETRY=true

### Run Kedro pipeline with DataLad provenance
$ datalad run \
    --message "Execute Kedro demo pipeline" \
    --output "output.txt" \
    kedro run

The pipeline execution is now recorded in the Git history with full provenance. You can verify that the output was created and tracked:

$ cat output.txt
Hello, DataLad!

$ git log --oneline -1
a1b2c3d Execute Kedro demo pipeline

5.2.9.4. Step 4: Add data dependencies as subdatasets

For data dependencies, you can use DataLad subdatasets while configuring Kedro’s Data Catalog to point to those locations:

### Add shared data as subdataset
$ datalad clone -d . \
    https://github.com/datalad-handbook/iris_data \
    data/01_raw/iris

Then in conf/base/catalog.yml:

iris_data:
  type: pandas.CSVDataset
  filepath: data/01_raw/iris/iris.csv

5.2.9.5. Benefits of this combination

This approach provides:

  • Kedro’s workflow management and visualization

  • DataLad’s version control and provenance for every pipeline run

  • YODA’s federated composition for data dependencies

  • Git-based sharing of the complete reproducible analysis

  • Ability to retrieve any historical state with datalad get

Share the complete project:

### Share to a sibling (after creating one)
$ datalad push --to origin

Anyone cloning this dataset will get the exact versions of all subdatasets and can reproduce the analysis.

Testing these examples

A test script that validates all commands in this section is available at docs/beyond_basics/kedro-examples-test.sh in the handbook repository. Run it with ./kedro-examples-test.sh --cleanup after installing prerequisites (pip install datalad kedro pandas).

Footnotes