I have CSV file with feature column (string) and multiple label columns (multi-label classification). Dataset is too big to fit into memory, so I have to load this using make_csv_dataset with specified batch size. The issue is that I don't know how to tokenize the feature column without overloading the memory, can I implement a DataProvider which will tokenize the data on the fly while training? Then I can tokenize it batch by batch so memory is not an issue.
Asked
Active
Viewed 87 times
0
-
Probably [this](https://stackoverflow.com/questions/31784011/scikit-learn-fitting-data-into-chunks-vs-fitting-it-all-at-once) can help. – meti Feb 02 '22 at 17:34
-
Hi pbartkow, Could you please specify the problem faced in make_csv_dataset – Feb 16 '22 at 09:08