I have a jupyter notebook in which I read files from my desktop and then run my analysis on them resulting in a feature matrix (pandas dataframe). As my files are heavy, my laptop crashes if I read all the files (10+ gb) at once and run the analysis. Is there a way to perform my analysis in batches? For example, I want my Jupyter notebook to 1) read the first 10 files 2) perform my analysis from the rest of the jupyter notebook, 3) save the matrix and then re-run steps 1)-3) reading the next 10 files and so on.
I cannot exactly provide an MWE, but here's roughly a sketch of what I want:
import pathlib
from pathlib import Path
import os
# Set the directory paths for folder A
path_1_A = "/directory/for/path/1A/"
path_2_A = "/directory/for/path/2A/"
path_3_A = "/directory/for/path/3A/"
# Create empty lists to store data
list_1_A = []
list_2_A = []
list_3_A = []
# Iterate through files in the directories in batches of 10
for csv_file in os.listdir(path_1_A):
.
.
for csv_file in os.listdir(path_2_A):
# Get corresponding data for list_1_A, list 2_A_ list 3_A
.
.
.
# Set the directory paths for folder B in batches of 10
path_1_B = "/directory/for/path/1B/"
path_2_B = "/directory/for/path/2B/"
path_3_B = "/directory/for/path/3B/"
# Create empty lists to store data
list_1_B = []
list_2_B = []
list_3_B = []
# Iterate through files in the directories
for csv_file in os.listdir(path_1_B):
.
.
for csv_file in os.listdir(path_2_B):
# Store corresponding data in list_1_B, list 2_B_ list 3_B
.
.
.
#Perform analysis on lists above for this batch
.
.#100+ cells of the notebook
.
print(feature_matrix) #Store this feature_matrix as dataframe_batch_1
#Repeat the process above for the next 10 files from each folder, and so on.
I will ultimately club all the dataframes into one big dataframe.