Started woring on pipeline implmentation

This commit is contained in:
Juraj Novosad
2025-07-07 12:15:26 +02:00
parent 82abe6fb6c
commit b326db6b5d
2 changed files with 70 additions and 0 deletions

70
Experiment pipeline.md Normal file
View File

@ -0,0 +1,70 @@
This document explains implementation pipeline for conducting experiments.
There will be explained architectural decisions about development of experiment pipeline.
Implementation will be in python programming language backed by pytorch for machine learning. The development environment will be orchestrated using `uv` manager.
## How to run and configure each experiment
From user point of view the software should be controlled by some sort of configuration which will be easy to read and deterministic, easy to reproduce experiments. More than enough times the machine learning experiments are just hard coded in scripts. Which makes them hard to keep track of used algorithms and specified constants since they tend to be all over the code.
In my approach all algorithms and constants will be specified in single configuration file. This specification file will in sequence specify each step of pipeline. Each step of what will be done with data sequentially.
Architectural decision was to have specification file in `yaml` format. This format is easy to read, supports nesting and comments.
Each action with data will be defined as object in stages object of `yaml` configuration, from now on I will refer to it as `step`. Step can load data, select channels, train classifier, use and evaluate classifier. Each `step` will have common parameters, telling the main program where to find it and it will have it's own specific parameters as per module needs.
Object `general` specifies pipeline wide settings.
Another important design feature is ability to control which flow of data. Each
Lets have an example:
```yaml
# Example configuration file for a data processing pipeline
# General configuration for running pipeline
general:
logging: "console" # Options: "console", "file", "both"
log_level: "INFO" # Options: "DEBUG", "INFO", "WARNING"
log_file: "logs/pipeline.log" # Path to the log file if logging to file
stages:
- name: "Load_datasets" # Step named Load dataset
output_stream: "input_dataset" # Streams transfer data between steps, somthing like artifacts in gitlab CI
module_path: "EEG_preprocessing_modules/data_loader.py" # Where to find module
module_params: # Module specific parameters
datasets:
- path: "data/dataset1.csv"
name: "dataset1"
- path: "data/dataset2.csv"
name: "dataset2"
action: "merge"
- name: "Process_data"
input_stream: "input_dataset"
output_stream: "processed_dataset"
module_path: "EEG_preprocessing_modules/preprocessing.py"
module_params:
select_channels:
- "channel1"
- "channel2"
filter_frequency: 0.5
resample_rate: 100
- name: "Train_augment_model"
type: "train"
input_stream: "processed_dataset"
module_path: "models_augment/GAN/main.py"
module_params:
noise_level: 0.01
save_path: "models_augment/GAN/model.pth"
- name: "Augment_data"
type: "inference" # Inference is default, but can be specified explicitly
input_stream: "processed_dataset"
output_stream: "augmented_dataset"
module_path: "models_augment/GAN/main.py"
module_params: # Model specific parameters, they're passed to the module script as dictionary
noise_level: 0.01
model_weights: "models_augment/GAN/model.pth"
```
Example shows a simple setup for the steps:
* Load datasets and merge them to single data stream
* Select few channels and resample them to

View File