research_augmentation_EEG/Organizing dataset.md


This document is about how to work with datasets. Basic idea is to research how others do it and implement it in our pipeline.

Our approach will be as follows. Do training from scratch on as large mix of datasets as possible. Then do fine tuning on some benchmark dataset and evaluate on it.

This document talks about used datasets: train, eval and how are they used.

## Unify datasets

Many decisions to be made.
Convert arbitrary dataset format to `mne.io.Raw`. Resample to common frequency, select only relevant probes.
Common frequency: TODO.
Selected common channels: TODO. Missing channels will be filled with 0.
Normalize values to one interval.
Bandpass filtering: what should be the parameters


## How others do it
### [[Augmentation methods#EEG Data Augmentation Method for Identity Recognition Based on Spatial–Temporal Generating Adversarial Network]]
This paper uses GAN to augment data and train something like brain identification on it.
More info on that in [[Augmentation methods]].
They used dataset BCI competition IV dataset 2A.
This dataset records EEG data during motor imagery tasks involving left hand, right hand, both feet, and tongue movements performed by 9 subjects. Each subject performed 72 trials of each of the 4 tasks during a single experiment, and each motor imagery trial lasted for 3 s. The EEG data were recorded using 22 Ag/AgCl electrodes at a sampling frequency of 250 Hz and were bandpass filtered between 0.5 and 100 Hz.
![Image_comp4_2A](Pasted_image_20250610163351.png)
Furthemore authors used 50Hz notch to supress line noise and excluded three channels recording eye movement.
For each individual’s EEG data, a third-order Butterworth IIR filter was applied in the 4–40 Hz frequency band to reduce the influence of eye movements.
Subsequentially data were min-max normalized to range <0,1>.
**The dataset was divided into training and testing sets in a 4:1 ratio, with each individual’s training set consisting of 864 samples.**

### [[Augmentation methods#Generative Adversarial Networks-Based Data Augmentation for Brain–Computer Interface(2020)]]
**Evaluation using their own dataset**:
Leave one subject out - train on all subjects except one, then test on that one.
Adaptive training - train on all subjects and half data of one subject. Test on 2nd half of that subjects data.
**Evaluation on BCI Competition III dataset IV a**:
Down-sampled to 100 Hz. Only testing generalizability, using the adaptive training with and without augmented data.


### [[Augmentation methods#Augmenting The Size of EEG datasets Using Generative Adversarial Networks (2018)]]
* Evaluation using 5 fold cross-validation on dataset PhysioNet against AutoEncoders and VAE. Using metric reconstruction error.
* Assesing impact of RGAN with different classification models. Evaluating classifcation accuracy on deep feed-forward NN, SVM, random forest tree.

## [[Augmentation methods#Data augmentation strategies for EEG-based motor imagery decoding (2022)]]

Used datasets:
* https://academic.oup.com/gigascience/article/6/7/gix034/3796323
* https://www.nature.com/articles/sdata2018211
For now I don't know where to get raw data of those datasets
Data processing:
* Bandpass filter 1-40 Hz
* Baseline correction was performed with the first 200ms pre-cue. Subtract average of eeg signal before the cue
* Artifact correction, oculograph and myograph. Slightly different parameters for each dataset
* Data re-referencing to average to improve the signal-to-noise ratio. The signal at each channel is re-referenced to the average signal across all electrodes.
* Used [[Papers#Autoreject Automated artifact rejection for MEG and EEG data]]
Dataset was split in ratio 70:12:18 between train:validation:test.