5.2. Modular data science: DataLad compared to Kedro¶

Modern data science and data engineering projects face similar challenges to machine learning analyses: They involve complex pipelines with multiple stages, require reproducible environments, and benefit from modular organization that enables reuse. This has led to the emergence of multiple frameworks with similar goals but different philosophies and implementations.

One notable framework is Kedro, an open-source Python framework for creating reproducible, maintainable, and modular data science code. Originally developed at McKinsey’s QuantumBlack and now hosted by the LF AI & Data Foundation, Kedro has become widely adopted in the data engineering community.

This section compares DataLad with its YODA principles to Kedro, highlighting their similarities, differences, and potential for complementary use. While both tools aim to solve problems of reproducibility, version control, and modularity, they approach these challenges from different angles and serve somewhat different use cases.

5.2.1. Philosophy and focus¶

Both DataLad/YODA and Kedro emerged from practical experience with data-intensive projects and independently arrived at similar organizational principles. However, their focus and scope differ significantly.

Kedro is a Python-specific framework designed for data engineering and data science pipelines. It emphasizes:

Software engineering best practices for data science code (Kedro docs)
Modular pipelines that can be tested, documented, and reused (modular pipelines)
Abstracted data access through a Data Catalog
Built-in visualization of pipeline structure (Kedro-Viz)
Integration with ML tools like MLflow and experiment trackers

DataLad with YODA principles is a language-agnostic approach to research data management. It emphasizes:

Version control for data, code, and containers at any scale
Federated composition through nested subdatasets
Provenance tracking for computational results
Decentralized data sharing via multiple storage backends
Domain-agnostic applicability across research fields

5.2.2. Setup¶

To illustrate the differences, let’s set up equivalent project structures in both tools.

5.2.2.1. Kedro setup¶

Kedro projects are created using the Kedro CLI and follow a standardized template:

### Kedro
# Create a new Kedro project
$ pip install kedro
$ kedro new --starter=spaceflights-pandas --name=my-analysis

The resulting project has this structure:

my-analysis/
├── conf/                  # Configuration files
│   ├── base/
│   │   ├── catalog.yml    # Data Catalog definitions
│   │   └── parameters.yml # Pipeline parameters
│   └── local/
│       └── credentials.yml
├── data/                  # Data directory (layered)
│   ├── 01_raw/
│   ├── 02_intermediate/
│   ├── 03_primary/
│   └── ...
├── notebooks/             # Jupyter notebooks
├── src/
│   └── my_analysis/
│       ├── pipelines/     # Modular pipeline definitions
│       │   ├── data_processing/
│       │   │   ├── nodes.py
│       │   │   └── pipeline.py
│       │   └── data_science/
│       │       ├── nodes.py
│       │       └── pipeline.py
│       └── pipeline_registry.py
└── tests/

This structure is designed around Kedro’s core concepts: nodes (Python functions), pipelines (collections of nodes), and the Data Catalog (a registry mapping dataset names to storage locations).

5.2.2.2. DataLad setup¶

A DataLad dataset following YODA principles can be created with:

### DataLad
$ datalad create -c yoda -c text2git my-analysis
$ cd my-analysis
$ mkdir -p data/{raw,intermediate,processed} model metrics

but in principle could be any file-tree formalization like a BIDS study dataset. The resulting structure:

my-analysis/
├── .datalad/              # DataLad configuration
├── .gitattributes
├── code/                  # Analysis scripts (any language)
│   └── README.md
├── data/                  # Data directory
│   ├── raw/
│   ├── intermediate/
│   └── processed/
└── CHANGELOG.md

Unlike Kedro, DataLad does not prescribe a specific code organization either – it focuses on data and provenance management while remaining agnostic about how analysis code is structured[1].

5.2.3. Version controlling data¶

Both tools address the fundamental challenge of managing data that is too large for Git, but they use different mechanisms.

5.2.3.1. Kedro’s Data Catalog and versioning¶

Kedro manages data access through a Data Catalog defined in conf/base/catalog.yml:

### Kedro catalog.yml
companies:
  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv

model_input_table:
  type: pandas.ParquetDataset
  filepath: data/03_primary/model_input_table.parquet
  versioned: true

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true

When versioned: true is set, Kedro automatically creates timestamped subdirectories for each save operation:

### Kedro versioned dataset structure
data/06_models/regressor.pickle/
├── 2024-01-15T10.30.45.123Z/
│   └── regressor.pickle
└── 2024-01-16T14.22.33.456Z/
    └── regressor.pickle

The Data Catalog abstracts storage locations, allowing the same pipeline code to work with local files, S3, GCS, or other storage backends simply by changing configuration.

5.2.3.2. DataLad’s git-annex versioning¶

DataLad uses git-annex to handle large files while keeping them under full Git version control. Data is tracked at the file level, with content stored in the annex and only lightweight symlinks in the working tree[2]:

### DataLad
$ datalad download-url \
    --archive \
    --message "Download raw data" \
    https://example.com/companies.csv \
    -O 'data/raw/'

$ datalad status
$ datalad save -m "Process data" data/processed/

Every version of every file is tracked in the Git history, and the full provenance is available through standard Git commands:

### DataLad
$ git log --oneline data/processed/model_input.parquet
a1b2c3d Process data: filter outliers
d4e5f6g Initial data processing

Kedro versioning:

Timestamp-based directory structure
Automatic versioning on each pipeline run
Versions stored as separate files (no deduplication)
Loading specific versions requires explicit specification
Works within single project scope

DataLad/git-annex versioning:

Content-addressed storage with deduplication
Full Git history for all changes
Can retrieve any historical version via Git checkout
Distributed storage across multiple remotes
Works across federated dataset hierarchies

5.2.4. Sharing and reuse¶

One of the most significant differences between Kedro and DataLad lies in how they handle modularity and reuse.

5.2.4.1. Kedro modular pipelines¶

Kedro’s approach to modularity is modular pipelines – self-contained pipeline components that can be reused within a project or packaged for use in other projects:

### Kedro: Create a modular pipeline
$ kedro pipeline create data_processing

This creates a new pipeline with its own nodes.py, pipeline.py, and test structure. Pipelines can be composed:

### Kedro pipeline_registry.py
from kedro.pipeline import Pipeline
from my_analysis.pipelines import data_processing, data_science

def register_pipelines() -> dict[str, Pipeline]:
    return {
        "data_processing": data_processing.create_pipeline(),
        "data_science": data_science.create_pipeline(),
        "__default__": (
            data_processing.create_pipeline() +
            data_science.create_pipeline()
        ),
    }

Kedro pipelines can be packaged and shared via Python packages, though they remain within the Python ecosystem and typically within a single project’s codebase.

5.2.4.2. YODA subdatasets¶

DataLad’s approach to modularity is subdatasets – nested Git repositories that can be independently versioned, shared, and composed.

For example, using a publicly available dataset as a data dependency:

### DataLad: Add data as a subdataset
$ datalad clone -d . \
    https://github.com/datalad-handbook/iris_data \
    data/raw/iris

This clones the iris flower dataset as a subdataset, recording its exact version. For code dependencies, you can similarly clone analysis libraries:

### DataLad: Add shared analysis code as subdataset
$ datalad clone -d . \
    https://github.com/datalad/datalad \
    code/datalad-lib

The key difference is that subdatasets are truly independent – they can live in different repositories, be shared separately, and be reused across completely different projects. A DataLad superdataset tracks the exact version of each subdataset, enabling perfect reproducibility:

### DataLad: Check subdataset versions
$ datalad subdatasets

### Anyone can recreate the exact state
$ datalad get -r .

This federated approach scales from single files to thousands of datasets, as demonstrated by datasets.datalad.org with over 8,000 subdatasets in a deep hierarchy.

The modularity approaches serve different purposes:

Kedro modular pipelines:

Python code modularity
Single project scope
Shared via Python packages
Focus on code organization
Runtime composition

YODA subdatasets:

Data and code modularity
Federated, multi-project scope
Shared via Git repositories
Focus on data provenance
Version-locked composition

5.2.5. Pipeline execution and provenance¶

Both tools support executing analyses and tracking what was done, but with different approaches.

5.2.5.1. Kedro pipeline execution¶

Kedro provides a complete pipeline execution framework with dependency resolution:

### Kedro: Run the full pipeline
$ kedro run

### Kedro: Run specific pipeline
$ kedro run --pipeline data_processing

### Kedro: Run with specific parameters
$ kedro run --params model_options.test_size=0.3

Kedro automatically resolves node dependencies and executes them in the correct order. The pipeline structure can be visualized with Kedro-Viz:

### Kedro: Visualize pipeline
$ pip install kedro-viz
$ kedro viz run

This opens an interactive web interface showing nodes, datasets, and their connections.

5.2.5.2. DataLad run¶

DataLad does not include a workflow manager, but provides datalad run (manual) to record any command execution with its inputs, outputs, and exact parameters[3]:

### DataLad: Record a computation
$ datalad run \
    --message "Train classifier" \
    --input data/processed/train.csv \
    --output model/classifier.pkl \
    python code/train.py

The command, its inputs, and outputs are recorded in the Git history as machine-readable provenance. This can be re-executed later:

### DataLad: Rerun a computation
$ datalad rerun <commit-hash>

For complex pipelines, DataLad integrates with workflow managers like Snakemake or Nextflow, while still capturing provenance for each step.

Kedro:

Built-in workflow management
Node-level dependency resolution
Parallel execution support
Interactive visualization (Kedro-Viz)
Deployment to Airflow, Prefect, etc.

DataLad:

Command-level provenance tracking
Integrates with external workflow managers
Full rerun capability
CLI-focused interface
Provenance embedded in Git history

5.2.6. Configuration management¶

Both tools separate configuration from code, but in different ways.

5.2.6.1. Kedro configuration¶

Kedro uses a hierarchical configuration system with conf/base/ for shared settings and conf/local/ for environment-specific overrides:

### conf/base/parameters.yml
model_options:
  test_size: 0.2
  random_state: 42
  features:
    - feature_a
    - feature_b

Parameters are accessed in nodes via dependency injection:

### Kedro node
def train_model(data, params: dict):
    test_size = params["model_options"]["test_size"]
    # ...

5.2.6.2. DataLad configuration¶

DataLad itself doesn’t prescribe a configuration structure, but the YODA principles recommend separating configuration into clearly identifiable files. Configuration can be tracked in the dataset alongside code:

### DataLad: Track configuration
$ datalad save -m "Update model parameters" code/config.yml

For container-based workflows, configuration is often baked into the container definition or passed as command-line arguments, both of which are captured by datalad run.

5.2.7. Summary¶

DataLad and Kedro share similar goals of enabling reproducible, modular data science, but they approach the problem from different angles. The choice between them depends on your specific needs, existing infrastructure, and team preferences.

Table 5.1 Comparison of DataLad/YODA and Kedro¶
Feature	Kedro	DataLad/YODA
Language	Python-specific	Language-agnostic
Primary focus	Data engineering pipelines	Research data management
Modularity	Modular pipelines (Python)	Subdatasets (Git)
Scope	Single project	Federated multi-project
Data versioning	Timestamp directories	git-annex (content-addressed)
Sharing model	Python packages	Git repositories
Workflow management	Built-in	External (Snakemake, etc.)
Visualization	Kedro-Viz (web UI)	CLI / external tools
Storage backends	Data Catalog abstraction	git-annex special remotes
Provenance	Pipeline structure	Git history + run records
Learning curve	Medium (Python patterns)	Medium (Git/git-annex concepts)

5.2.8. When to use which tool¶

Consider Kedro if you:

Work primarily in Python
Need built-in workflow management and visualization
Want standardized project templates for data science teams
Focus on single-project data pipelines
Value integration with ML experiment tracking tools
Prefer configuration-driven data access abstraction

Consider DataLad/YODA if you:

Work across multiple programming languages
Need to manage very large datasets (terabytes+)
Want federated composition across multiple projects
Require decentralized data sharing
Need fine-grained provenance for every file
Work in research environments with diverse tools

5.2.9. Using them together¶

Kedro and DataLad can be complementary. A powerful pattern is to use Kedro for pipeline execution within a DataLad dataset for version control and sharing.

Data versioning guidance

When combining Kedro and DataLad, DataLad handles all data versioning via git-annex. Do not enable Kedro’s versioned: true in the Data Catalog – Kedro’s timestamp-based versioning creates duplicate copies of each file version, which conflicts with DataLad’s content-addressed storage and deduplication. Conveniently, Kedro’s Data Catalog is designed to make this easy: versioning is off by default, so simply omit the versioned flag and let DataLad track file versions through the Git history.

The following walkthrough demonstrates this combination using a minimal Kedro project.

5.2.9.1. Step 1: Create a Kedro project¶

First, install Kedro and create a new project using kedro new:

### Install Kedro (in your environment)
$ pip install kedro pandas

### Create a new Kedro project (non-interactive)
$ kedro new --name=kedro-datalad-demo --tools=none --example=no --telemetry=no
$ cd kedro-datalad-demo

This generates the standard Kedro project structure including pyproject.toml, settings.py, pipeline_registry.py, catalog.yml, .gitignore, and other scaffolding files.

5.2.9.2. Step 2: Initialize DataLad inside the Kedro project¶

Turn the Kedro project into a DataLad dataset. We use --force because the directory already exists, and text2git to store text files directly in Git:

### Initialize DataLad in the existing Kedro project
$ datalad create --force -c text2git .

Note

We skip the -c yoda configuration here because Kedro’s kedro new already provides a project structure with sensible defaults.

5.2.9.3. Step 3: Add demo pipeline, save, and run with provenance¶

Add a simple demo pipeline by modifying the generated pipeline_registry.py:

$ cat > src/kedro_datalad_demo/pipeline_registry.py << 'EOF'
from kedro.pipeline import Pipeline, node

def greet(name: str) -> str:
    greeting = f"Hello, {name}!"
    # Write output for provenance tracking
    with open("output.txt", "w") as f:
        f.write(greeting)
    return greeting

def register_pipelines():
    return {
        "__default__": Pipeline([
            node(greet, inputs="params:name", outputs="greeting")
        ])
    }
EOF

Set the pipeline parameter (kedro new generates an empty parameters.yml):

$ cat > conf/base/parameters.yml << 'EOF'
name: DataLad
EOF

Save the Kedro project with the demo pipeline to the DataLad dataset:

### Track Kedro project structure
$ datalad save -m "Initialize Kedro project with demo pipeline"

Run the Kedro pipeline with DataLad provenance tracking:

### Optional: Disable telemetry for cleaner output
$ export KEDRO_DISABLE_TELEMETRY=true

### Run Kedro pipeline with DataLad provenance
$ datalad run \
    --message "Execute Kedro demo pipeline" \
    --output "output.txt" \
    kedro run

The pipeline execution is now recorded in the Git history with full provenance. You can verify that the output was created and tracked:

$ cat output.txt
Hello, DataLad!

$ git log --oneline -1
a1b2c3d Execute Kedro demo pipeline

5.2.9.4. Step 4: Add data dependencies as subdatasets¶

For data dependencies, you can use DataLad subdatasets while configuring Kedro’s Data Catalog to point to those locations:

### Add shared data as subdataset
$ datalad clone -d . \
    https://github.com/datalad-handbook/iris_data \
    data/01_raw/iris

Then in conf/base/catalog.yml:

iris_data:
  type: pandas.CSVDataset
  filepath: data/01_raw/iris/iris.csv

5.2.9.5. Benefits of this combination¶

This approach provides:

Kedro’s workflow management and visualization
DataLad’s version control and provenance for every pipeline run
YODA’s federated composition for data dependencies
Git-based sharing of the complete reproducible analysis
Ability to retrieve any historical state with datalad get

Share the complete project:

### Share to a sibling (after creating one)
$ datalad push --to origin

Anyone cloning this dataset will get the exact versions of all subdatasets and can reproduce the analysis.

Testing these examples

A test script that validates all commands in this section is available at docs/beyond_basics/kedro-examples-test.sh in the handbook repository. Run it with ./kedro-examples-test.sh --cleanup after installing prerequisites (pip install datalad kedro pandas).

Footnotes

Table of Contents

Related Topics