Is it possible to tokenize feature on the fly while training?

Question

I have CSV file with feature column (string) and multiple label columns (multi-label classification). Dataset is too big to fit into memory, so I have to load this using make_csv_dataset with specified batch size. The issue is that I don't know how to tokenize the feature column without overloading the memory, can I implement a DataProvider which will tokenize the data on the fly while training? Then I can tokenize it batch by batch so memory is not an issue.

Probably [this](https://stackoverflow.com/questions/31784011/scikit-learn-fitting-data-into-chunks-vs-fitting-it-all-at-once) can help. — meti, Feb 02 '22 at 17:34
Hi pbartkow, Could you please specify the problem faced in make_csv_dataset — , Feb 16 '22 at 09:08

score 0 · Answer 1 · answered Mar 01 '22 at 14:13

0

You can tokenize on the fly using "pyspark(in python)", and "sparkly" (in R) Please refer to this issue and this link to solve your problem.

answered Mar 01 '22 at 14:13

Is it possible to tokenize feature on the fly while training?

1 Answers1