1

I have a jupyter notebook in which I read files from my desktop and then run my analysis on them resulting in a feature matrix (pandas dataframe). As my files are heavy, my laptop crashes if I read all the files (10+ gb) at once and run the analysis. Is there a way to perform my analysis in batches? For example, I want my Jupyter notebook to 1) read the first 10 files 2) perform my analysis from the rest of the jupyter notebook, 3) save the matrix and then re-run steps 1)-3) reading the next 10 files and so on.

I cannot exactly provide an MWE, but here's roughly a sketch of what I want:

import pathlib
from pathlib import Path
import os

# Set the directory paths for folder A
path_1_A = "/directory/for/path/1A/"
path_2_A = "/directory/for/path/2A/"
path_3_A = "/directory/for/path/3A/"

# Create empty lists to store data
list_1_A = [] 
list_2_A = [] 
list_3_A = [] 

# Iterate through files in the directories in batches of 10
for csv_file in os.listdir(path_1_A):
  .
  .
  for csv_file in os.listdir(path_2_A):
        # Get corresponding data for list_1_A, list 2_A_ list 3_A
        .
        .
        .


# Set the directory paths for folder B in batches of 10
path_1_B = "/directory/for/path/1B/"
path_2_B = "/directory/for/path/2B/"
path_3_B = "/directory/for/path/3B/"

# Create empty lists to store data
list_1_B = [] 
list_2_B = [] 
list_3_B = [] 

# Iterate through files in the directories
for csv_file in os.listdir(path_1_B):
  .
  .
  for csv_file in os.listdir(path_2_B):
        # Store corresponding data in list_1_B, list 2_B_ list 3_B 
        .
        .
        .
 #Perform analysis on lists above for this batch
 .       
 .#100+ cells of the notebook
 .    

 print(feature_matrix) #Store this feature_matrix as dataframe_batch_1

 #Repeat the process above for the next 10 files from each folder, and so on. 

I will ultimately club all the dataframes into one big dataframe.

S C
  • 284
  • 2
  • 9

0 Answers0