I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.
files = sorted(glob.glob('data/*.parquet'))
@delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()
X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets
Is there the best way to do it without out-of-memory?