Is there the best way to train binary classification with 1000 parquet files?

Question

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.

files = sorted(glob.glob('data/*.parquet'))

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in chunk])
df = df.compute()

X = df.drop(['label'], axis=1)
y = df['label']
# Split the data into training and testing sets

Is there the best way to do it without out-of-memory?

you can try `vaex` [check here](https://github.com/vaexio/vaex) — Jeril, Aug 01 '23 at 08:38

Kurumi Tokisaki · Answer 1 · 2023-08-01T09:14:43.243

0

You dont need to load all data at once. Depends on the classification algorithm you are using whether support incremental training. In scikit learn, all estimators implementing the partial_fit API are candidates such as SGDClassifier. if you are using tensorflow, you can use tfio.experimental.IODataset to stream you parquet to DNN you are training on.

edited Aug 01 '23 at 09:14

answered Aug 01 '23 at 09:09

Kurumi Tokisaki

171
6

Is there the best way to train binary classification with 1000 parquet files?

1 Answers1