5.2. Modular data science: DataLad compared to Kedro¶
Modern data science and data engineering projects face similar challenges to machine learning analyses: They involve complex pipelines with multiple stages, require reproducible environments, and benefit from modular organization that enables reuse. This has led to the emergence of multiple frameworks with similar goals but different philosophies and implementations.
One notable framework is Kedro, an open-source Python framework for creating reproducible, maintainable, and modular data science code. Originally developed at McKinsey’s QuantumBlack and now hosted by the LF AI & Data Foundation, Kedro has become widely adopted in the data engineering community.
This section compares DataLad with its YODA principles to Kedro, highlighting their similarities, differences, and potential for complementary use. While both tools aim to solve problems of reproducibility, version control, and modularity, they approach these challenges from different angles and serve somewhat different use cases.
5.2.1. Philosophy and focus¶
Both DataLad/YODA and Kedro emerged from practical experience with data-intensive projects and independently arrived at similar organizational principles. However, their focus and scope differ significantly.
Kedro is a Python-specific framework designed for data engineering and data science pipelines. It emphasizes:
Software engineering best practices for data science code (Kedro docs)
Modular pipelines that can be tested, documented, and reused (modular pipelines)
Abstracted data access through a Data Catalog
Built-in visualization of pipeline structure (Kedro-Viz)
Integration with ML tools like MLflow and experiment trackers
DataLad with YODA principles is a language-agnostic approach to research data management. It emphasizes:
Version control for data, code, and containers at any scale
Federated composition through nested subdatasets
Provenance tracking for computational results
Decentralized data sharing via multiple storage backends
Domain-agnostic applicability across research fields
5.2.2. Setup¶
To illustrate the differences, let’s set up equivalent project structures in both tools.
5.2.2.1. Kedro setup¶
Kedro projects are created using the Kedro CLI and follow a standardized template:
### Kedro
# Create a new Kedro project
$ pip install kedro
$ kedro new --starter=spaceflights-pandas --name=my-analysis
The resulting project has this structure:
my-analysis/
├── conf/ # Configuration files
│ ├── base/
│ │ ├── catalog.yml # Data Catalog definitions
│ │ └── parameters.yml # Pipeline parameters
│ └── local/
│ └── credentials.yml
├── data/ # Data directory (layered)
│ ├── 01_raw/
│ ├── 02_intermediate/
│ ├── 03_primary/
│ └── ...
├── notebooks/ # Jupyter notebooks
├── src/
│ └── my_analysis/
│ ├── pipelines/ # Modular pipeline definitions
│ │ ├── data_processing/
│ │ │ ├── nodes.py
│ │ │ └── pipeline.py
│ │ └── data_science/
│ │ ├── nodes.py
│ │ └── pipeline.py
│ └── pipeline_registry.py
└── tests/
This structure is designed around Kedro’s core concepts: nodes (Python functions), pipelines (collections of nodes), and the Data Catalog (a registry mapping dataset names to storage locations).
5.2.2.2. DataLad setup¶
A DataLad dataset following YODA principles can be created with:
### DataLad
$ datalad create -c yoda -c text2git my-analysis
$ cd my-analysis
$ mkdir -p data/{raw,intermediate,processed} model metrics
but in principle could be any file-tree formalization like a BIDS study dataset. The resulting structure:
my-analysis/
├── .datalad/ # DataLad configuration
├── .gitattributes
├── code/ # Analysis scripts (any language)
│ └── README.md
├── data/ # Data directory
│ ├── raw/
│ ├── intermediate/
│ └── processed/
└── CHANGELOG.md
Unlike Kedro, DataLad does not prescribe a specific code organization either – it focuses on data and provenance management while remaining agnostic about how analysis code is structured[1].
5.2.3. Version controlling data¶
Both tools address the fundamental challenge of managing data that is too large for Git, but they use different mechanisms.
5.2.3.1. Kedro’s Data Catalog and versioning¶
Kedro manages data access through a Data Catalog defined in conf/base/catalog.yml:
### Kedro catalog.yml
companies:
type: pandas.CSVDataset
filepath: data/01_raw/companies.csv
model_input_table:
type: pandas.ParquetDataset
filepath: data/03_primary/model_input_table.parquet
versioned: true
regressor:
type: pickle.PickleDataset
filepath: data/06_models/regressor.pickle
versioned: true
When versioned: true is set, Kedro automatically creates timestamped subdirectories for each save operation:
### Kedro versioned dataset structure
data/06_models/regressor.pickle/
├── 2024-01-15T10.30.45.123Z/
│ └── regressor.pickle
└── 2024-01-16T14.22.33.456Z/
└── regressor.pickle
The Data Catalog abstracts storage locations, allowing the same pipeline code to work with local files, S3, GCS, or other storage backends simply by changing configuration.
5.2.3.2. DataLad’s git-annex versioning¶
DataLad uses git-annex to handle large files while keeping them under full Git version control. Data is tracked at the file level, with content stored in the annex and only lightweight symlinks in the working tree[2]:
### DataLad
$ datalad download-url \
--archive \
--message "Download raw data" \
https://example.com/companies.csv \
-O 'data/raw/'
$ datalad status
$ datalad save -m "Process data" data/processed/
Every version of every file is tracked in the Git history, and the full provenance is available through standard Git commands:
### DataLad
$ git log --oneline data/processed/model_input.parquet
a1b2c3d Process data: filter outliers
d4e5f6g Initial data processing
How does versioning differ between the tools?
Kedro versioning:
Timestamp-based directory structure
Automatic versioning on each pipeline run
Versions stored as separate files (no deduplication)
Loading specific versions requires explicit specification
Works within single project scope
DataLad/git-annex versioning:
Content-addressed storage with deduplication
Full Git history for all changes
Can retrieve any historical version via Git checkout
Distributed storage across multiple remotes
Works across federated dataset hierarchies
5.2.5. Pipeline execution and provenance¶
Both tools support executing analyses and tracking what was done, but with different approaches.
5.2.5.1. Kedro pipeline execution¶
Kedro provides a complete pipeline execution framework with dependency resolution:
### Kedro: Run the full pipeline
$ kedro run
### Kedro: Run specific pipeline
$ kedro run --pipeline data_processing
### Kedro: Run with specific parameters
$ kedro run --params model_options.test_size=0.3
Kedro automatically resolves node dependencies and executes them in the correct order. The pipeline structure can be visualized with Kedro-Viz:
### Kedro: Visualize pipeline
$ pip install kedro-viz
$ kedro viz run
This opens an interactive web interface showing nodes, datasets, and their connections.
5.2.5.2. DataLad run¶
DataLad does not include a workflow manager, but provides datalad run (manual) to record any command execution with its inputs, outputs, and exact parameters[3]:
### DataLad: Record a computation
$ datalad run \
--message "Train classifier" \
--input data/processed/train.csv \
--output model/classifier.pkl \
python code/train.py
The command, its inputs, and outputs are recorded in the Git history as machine-readable provenance. This can be re-executed later:
### DataLad: Rerun a computation
$ datalad rerun <commit-hash>
For complex pipelines, DataLad integrates with workflow managers like Snakemake or Nextflow, while still capturing provenance for each step.
Workflow management comparison
Kedro:
Built-in workflow management
Node-level dependency resolution
Parallel execution support
Interactive visualization (Kedro-Viz)
Deployment to Airflow, Prefect, etc.
DataLad:
Command-level provenance tracking
Integrates with external workflow managers
Full rerun capability
CLI-focused interface
Provenance embedded in Git history
5.2.6. Configuration management¶
Both tools separate configuration from code, but in different ways.
5.2.6.1. Kedro configuration¶
Kedro uses a hierarchical configuration system with conf/base/ for shared settings and conf/local/ for environment-specific overrides:
### conf/base/parameters.yml
model_options:
test_size: 0.2
random_state: 42
features:
- feature_a
- feature_b
Parameters are accessed in nodes via dependency injection:
### Kedro node
def train_model(data, params: dict):
test_size = params["model_options"]["test_size"]
# ...
5.2.6.2. DataLad configuration¶
DataLad itself doesn’t prescribe a configuration structure, but the YODA principles recommend separating configuration into clearly identifiable files. Configuration can be tracked in the dataset alongside code:
### DataLad: Track configuration
$ datalad save -m "Update model parameters" code/config.yml
For container-based workflows, configuration is often baked into the container definition or passed as command-line arguments, both of which are captured by datalad run.
5.2.7. Summary¶
DataLad and Kedro share similar goals of enabling reproducible, modular data science, but they approach the problem from different angles. The choice between them depends on your specific needs, existing infrastructure, and team preferences.
Feature |
Kedro |
DataLad/YODA |
|---|---|---|
Language |
Python-specific |
Language-agnostic |
Primary focus |
Data engineering pipelines |
Research data management |
Modularity |
Modular pipelines (Python) |
Subdatasets (Git) |
Scope |
Single project |
Federated multi-project |
Data versioning |
Timestamp directories |
git-annex (content-addressed) |
Sharing model |
Python packages |
Git repositories |
Workflow management |
Built-in |
External (Snakemake, etc.) |
Visualization |
Kedro-Viz (web UI) |
CLI / external tools |
Storage backends |
Data Catalog abstraction |
git-annex special remotes |
Provenance |
Pipeline structure |
Git history + run records |
Learning curve |
Medium (Python patterns) |
Medium (Git/git-annex concepts) |
5.2.8. When to use which tool¶
Consider Kedro if you:
Work primarily in Python
Need built-in workflow management and visualization
Want standardized project templates for data science teams
Focus on single-project data pipelines
Value integration with ML experiment tracking tools
Prefer configuration-driven data access abstraction
Consider DataLad/YODA if you:
Work across multiple programming languages
Need to manage very large datasets (terabytes+)
Want federated composition across multiple projects
Require decentralized data sharing
Need fine-grained provenance for every file
Work in research environments with diverse tools
5.2.9. Using them together¶
Kedro and DataLad can be complementary. A powerful pattern is to use Kedro for pipeline execution within a DataLad dataset for version control and sharing.
Data versioning guidance
When combining Kedro and DataLad, DataLad handles all data versioning via git-annex.
Do not enable Kedro’s versioned: true in the Data Catalog – Kedro’s timestamp-based
versioning creates duplicate copies of each file version, which conflicts with DataLad’s
content-addressed storage and deduplication.
Conveniently, Kedro’s Data Catalog is designed to make this easy: versioning is off by
default, so simply omit the versioned flag and let DataLad track file versions
through the Git history.
The following walkthrough demonstrates this combination using a minimal Kedro project.
5.2.9.1. Step 1: Create a Kedro project¶
First, install Kedro and create a new project using kedro new:
### Install Kedro (in your environment)
$ pip install kedro pandas
### Create a new Kedro project (non-interactive)
$ kedro new --name=kedro-datalad-demo --tools=none --example=no --telemetry=no
$ cd kedro-datalad-demo
This generates the standard Kedro project structure including pyproject.toml,
settings.py, pipeline_registry.py, catalog.yml, .gitignore, and other
scaffolding files.
5.2.9.2. Step 2: Initialize DataLad inside the Kedro project¶
Turn the Kedro project into a DataLad dataset.
We use --force because the directory already exists, and text2git to store
text files directly in Git:
### Initialize DataLad in the existing Kedro project
$ datalad create --force -c text2git .
Note
We skip the -c yoda configuration here because Kedro’s kedro new
already provides a project structure with sensible defaults.
5.2.9.3. Step 3: Add demo pipeline, save, and run with provenance¶
Add a simple demo pipeline by modifying the generated pipeline_registry.py:
$ cat > src/kedro_datalad_demo/pipeline_registry.py << 'EOF'
from kedro.pipeline import Pipeline, node
def greet(name: str) -> str:
greeting = f"Hello, {name}!"
# Write output for provenance tracking
with open("output.txt", "w") as f:
f.write(greeting)
return greeting
def register_pipelines():
return {
"__default__": Pipeline([
node(greet, inputs="params:name", outputs="greeting")
])
}
EOF
Set the pipeline parameter (kedro new generates an empty parameters.yml):
$ cat > conf/base/parameters.yml << 'EOF'
name: DataLad
EOF
Save the Kedro project with the demo pipeline to the DataLad dataset:
### Track Kedro project structure
$ datalad save -m "Initialize Kedro project with demo pipeline"
Run the Kedro pipeline with DataLad provenance tracking:
### Optional: Disable telemetry for cleaner output
$ export KEDRO_DISABLE_TELEMETRY=true
### Run Kedro pipeline with DataLad provenance
$ datalad run \
--message "Execute Kedro demo pipeline" \
--output "output.txt" \
kedro run
The pipeline execution is now recorded in the Git history with full provenance. You can verify that the output was created and tracked:
$ cat output.txt
Hello, DataLad!
$ git log --oneline -1
a1b2c3d Execute Kedro demo pipeline
5.2.9.4. Step 4: Add data dependencies as subdatasets¶
For data dependencies, you can use DataLad subdatasets while configuring Kedro’s Data Catalog to point to those locations:
### Add shared data as subdataset
$ datalad clone -d . \
https://github.com/datalad-handbook/iris_data \
data/01_raw/iris
Then in conf/base/catalog.yml:
iris_data:
type: pandas.CSVDataset
filepath: data/01_raw/iris/iris.csv
5.2.9.5. Benefits of this combination¶
This approach provides:
Kedro’s workflow management and visualization
DataLad’s version control and provenance for every pipeline run
YODA’s federated composition for data dependencies
Git-based sharing of the complete reproducible analysis
Ability to retrieve any historical state with
datalad get
Share the complete project:
### Share to a sibling (after creating one)
$ datalad push --to origin
Anyone cloning this dataset will get the exact versions of all subdatasets and can reproduce the analysis.
Testing these examples
A test script that validates all commands in this section is available at
docs/beyond_basics/kedro-examples-test.sh in the handbook repository.
Run it with ./kedro-examples-test.sh --cleanup after installing prerequisites
(pip install datalad kedro pandas).
Footnotes