0

I am currently data wrangling on a very new project, and it is proving a challenge.

I have EEG data that has been preprocessed in eeglab in MATLAB, and I would like to load it into python to use it to train a classifier. I also have a .csv file with the subject IDs of each individual, along with a number (1, 2 or 3) corresponding to which third of the sample they are in.

Currently, I have the data saved as .mat files, one for each individual (104 in total), each containing an array shaped 64x2000x700 (64 channels, 2000 data points per 2 second segment (sampling frequency of 1000Hz), 700 segments). I would like to load each participant's data into the dataframe alongside their subject ID and classification score.

I tried this:

all_files = glob.glob(os.path.join(path, "*.mat"))
 
lang_class= pd.read_csv("TestLangLabels.csv")
               
df_dict = {}


for file in all_files:
    file_name = os.path.splitext(os.path.basename(file))[0]
    df_dict[file]
    df_dict[file_name]= loadmat(file,appendmat=False)
    # Setting the file name (without extension) as the index name
    df_dict[file_name].index.name = file_name

But the files are so large that this maxes out my memory and doesn't complete.

Then, I attempted to loop it using pandas using the following:


main_dataframe = pd.DataFrame(loadmat(all_files[0]))
  
for i in range(1,len(all_files)):
    data = loadmat(all_files[i])
    df = pd.DataFrame(data)
    main_dataframe = pd.concat([main_dataframe,df],axis=1)

At which point I got the error: ValueError: Data must be 1-dimensional

Is there a way of doing this that I am overlooking, or will downsampling be inevitable?

subjectID Data Class
AA123 64x2000x700 2

I believe that something like this could then be used as a test/train dataset for my model, but welcome any and all advice!

Thank you in advance.

M.A.
  • 1
  • 3
  • 1
    If your data is 64x2000x700 then somethign is off, because thats just 600Mb per file. Ovbiosuly 104 of these is quite a lot, what is your goal? Many classifiers (e.g. Deep learning) don't need all the data loaded in one go, they are trained in batches – Ander Biguri May 31 '22 at 11:23
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community May 31 '22 at 12:33
  • @AnderBiguri Hi, thank you for your question. The participants are ranked based on their performance in a given task. My goal is to train a classifier (Firstly I'd like to implement Xgboost, a modified decision tree algorithm, and then later down the line I would like to try a combination of CNN/SVM) to predict, based on EEG data, whether the participant would score in the top, middle, or bottom third of all participants. The files are ~300,000Kb each, likely because the original sampling frequency of 1000Hz has been kept. I am very new to python so sorry for anything that is unclear! – M.A. May 31 '22 at 12:40
  • And does your clasifier need all the data in RAM to train? can't you not just load it in chunks and update it with some gradient descend? CNNs are like that, Google does not train their clasifier with billions of images by loading them up at the same time, instead the data is loaded "on demand", when the algorithm needs it. I've never trained decision trees, so not sure if they need all the data in one go, but I'd be suprised if tehy do. – Ander Biguri May 31 '22 at 12:52
  • @AnderBiguri it is quite possible that they do not need all the data, but I am very new to the field. Are there any resources you could point me towards with a tutorial on how to do that? Thank you for your reply! – M.A. May 31 '22 at 12:59

1 Answers1

0

Is there a reason you have such a high sampling rate? I don't believe Ive heard a compelling reason to go over 512hz and normally take it down to 256hz. I don't know if it matters for ML, but most other approach really don't need that. Going from 1000hz to 500hz or even 250hz might help.

Douwe
  • 112
  • 10