Started woring on pipeline implmentation

2025-07-07 12:15:26 +02:00
parent 82abe6fb6c
commit b326db6b5d
2 changed files with 70 additions and 0 deletions
--- a/pipeline.md
+++ b/pipeline.md
@ -0,0 +1,70 @@
+This document explains implementation pipeline for conducting experiments. 
+There will be explained architectural decisions about development of experiment pipeline. 
+
+Implementation will be in python programming language backed by pytorch for machine learning. The development environment will be orchestrated using `uv` manager. 
+
+## How to run and configure each experiment
+
+From user point of view the software should be controlled by some sort of configuration which will be easy to read and deterministic, easy to reproduce experiments. More than enough times the machine learning experiments are just hard coded in scripts. Which makes them hard to keep track of used algorithms and specified constants since they tend to be all over the code.
+
+In my approach all algorithms and constants will be specified in single configuration file. This specification file will in sequence specify each step of pipeline. Each step of what will be done with data sequentially. 
+Architectural decision was to have specification file in `yaml` format. This format is easy to read, supports nesting and comments. 
+Each action with data will be defined as object in stages object of `yaml` configuration, from now on I will refer to it as `step`. Step can load data, select channels, train classifier, use and evaluate classifier. Each  `step` will have common parameters, telling the main program where to find it and it will have it's own specific parameters as per module needs.
+Object `general` specifies pipeline wide settings.
+
+Another important design feature is ability to control which flow of data. Each 
+Lets have an example: 
+```yaml
+# Example configuration file for a data processing pipeline
+
+# General configuration for running pipeline
+general:
+	logging: "console" # Options: "console", "file", "both"
+	log_level: "INFO" # Options: "DEBUG", "INFO", "WARNING"
+	log_file: "logs/pipeline.log" # Path to the log file if logging to file
+
+  
+
+stages:
+	- name: "Load_datasets" # Step named Load dataset
+		output_stream: "input_dataset" # Streams transfer data between steps, somthing like artifacts in gitlab CI
+		module_path: "EEG_preprocessing_modules/data_loader.py" # Where to find module
+		module_params: # Module specific parameters
+			datasets:
+				- path: "data/dataset1.csv"
+					name: "dataset1"
+				- path: "data/dataset2.csv"
+					name: "dataset2"
+			action: "merge"
+		
+	- name: "Process_data"
+		input_stream: "input_dataset"
+		output_stream: "processed_dataset"
+		module_path: "EEG_preprocessing_modules/preprocessing.py"
+		module_params:
+			select_channels:
+			- "channel1"
+			- "channel2"
+			filter_frequency: 0.5
+			resample_rate: 100
+	
+	- name: "Train_augment_model"
+		type: "train"
+		input_stream: "processed_dataset"
+		module_path: "models_augment/GAN/main.py"
+		module_params:
+			noise_level: 0.01
+			save_path: "models_augment/GAN/model.pth"
+	
+	- name: "Augment_data"
+		type: "inference" # Inference is default, but can be specified explicitly
+		input_stream: "processed_dataset"
+		output_stream: "augmented_dataset"
+		module_path: "models_augment/GAN/main.py"
+		module_params: # Model specific parameters, they're passed to the module script as dictionary
+			noise_level: 0.01
+			model_weights: "models_augment/GAN/model.pth"
+```
+Example shows a simple setup for the steps:
+* Load datasets and merge them to single data stream
+* Select few channels and resample them to 
--- a/implementation.md
+++ b/implementation.md