0

I have a dataset with 45 million rows of data. I have three 6gb ram gpu. I am trying to train a language model on the data.

For that, I am trying to load the data as the fastai data bunch. But this part always fails because of the memory issue.

data_lm = TextLMDataBunch.from_df('./', train_df=df_trn, 
valid_df=df_val, bs=10)

How do I handle this issue?

1 Answers1

0

When you use this function, your Dataframe is loaded in memory. Since you have a very big dataframe, this causes your memory error. Fastai handles tokenization with a chunksize, so you should still be able to tokenize your text.

Here are two things you should try :

  • Add a chunksize argument (the default value is 10k) to your TextLMDataBunch.from_df, so that the tokenization process needs less memory.

  • If this is not enough, I would suggest not to load your whole dataframe into memory. Unfortunately, even if you use TextLMDataBunch.from_folder, it just loads the full DataFrame and pass it to TextLMDataBunch.from_df, you might have to create your own DataBunch constructor. Feel free to comment if you need help on that.

Statistic Dean
  • 4,861
  • 7
  • 22
  • 46
  • Yes. I need help on that. Do you have a reference? – Jerry George Mar 04 '19 at 15:58
  • Modifying the chunksize was not enough? Could you monitor your memory usage while you're doing this. Do you have an out of memory at the moment of reading the csv, or during the processing of the csv? – Statistic Dean Mar 04 '19 at 16:00
  • Another solution that I can think of (I don't know if it is possible for you) is to create the databunch on another machine (with more memory), to save it using its save method, and to load it back on your actual machine. – Statistic Dean Mar 04 '19 at 16:02
  • I am having the same issue. I have been considering trying to implement a pytorch iterable dataset somehow, but I am having issues working this out. Any help would be greatly appreciated. For a corpus that doesn't fit into memory, I am wondering how one can manage to split the data randomly into training and test. – user3225087 Mar 14 '20 at 05:02