0

I have CSV file with feature column (string) and multiple label columns (multi-label classification). Dataset is too big to fit into memory, so I have to load this using make_csv_dataset with specified batch size. The issue is that I don't know how to tokenize the feature column without overloading the memory, can I implement a DataProvider which will tokenize the data on the fly while training? Then I can tokenize it batch by batch so memory is not an issue.

pbartkow
  • 126
  • 1
  • 10
  • Probably [this](https://stackoverflow.com/questions/31784011/scikit-learn-fitting-data-into-chunks-vs-fitting-it-all-at-once) can help. – meti Feb 02 '22 at 17:34
  • Hi pbartkow, Could you please specify the problem faced in make_csv_dataset –  Feb 16 '22 at 09:08

1 Answers1

0

You can tokenize on the fly using "pyspark(in python)", and "sparkly" (in R) Please refer to this issue and this link to solve your problem.