0

I have a file with a size of more than 7 GB, which contains almost 70 million lines. I want to read the file line by line, convert each line to a list, append that list to a previously defined list and finally save that list to a text file. Here is what I have written:

corpus = []
for line in open('file.txt'):
    new = line.strip()
    new = word_punctuation_tokenizer.tokenize(new)
    corpus.append(new)
import pickle
with open("newfile.txt", "wb") as fp:   #Pickling
    pickle.dump(corpus, fp)

However, the list seems to get very large and after having read about 5 million lines, the program stops responding What should I do?

Shimon Cohen
  • 489
  • 3
  • 11
  • 1
    Why not write to the file line by line, and not try to save all lines to memory? – Shimon Cohen Dec 11 '20 at 13:37
  • 1
    Process data effectively. Do not try to load entire dataset in memory. You Ram will hardly be 16gb. – sameera sy Dec 11 '20 at 13:41
  • What do you need to do with the corpus afterwards? – Jasmijn Dec 11 '20 at 13:50
  • take a look at the filedict/sqldict code example from this [video](https://www.youtube.com/watch?v=S_ipdVNSFlo), it might be of your interest to move your stuff from memory to disk and back in a easy way – Copperfield Dec 11 '20 at 14:25

2 Answers2

3

What should I do?

Change the order in which you do things. There's no reason to load the entire file, process all of it, and only then start writing if the later parts of your processing don't depend on earlier ones. Just read, process and write one line at a time.

Cubic
  • 14,902
  • 5
  • 47
  • 92
  • Are you sure you can write pickled data in chunks? – Matthias Dec 11 '20 at 13:39
  • @Matthias Sure you can, you just need to read it like that too. – Cubic Dec 11 '20 at 13:40
  • @Matthias: yes, see e.g. https://stackoverflow.com/a/12762056/1204143 – nneonneo Dec 11 '20 at 13:40
  • The problem is not with writing to the file, but with creating the list. After almost 5 million items, the list gets so big and the program stops responding. –  Dec 11 '20 at 13:41
  • I stopped working with `pickle` long time ago. Seems I have to reread the documentation. – Matthias Dec 11 '20 at 13:41
  • @BNoor: instead of writing a single big list, try writing the items individually (or in chunks). Of course, this will depend a bit on what you're doing with the data afterwards. – nneonneo Dec 11 '20 at 13:42
  • I need to have all items in one list and save it so that I can use it later for another task. –  Dec 11 '20 at 13:44
  • 1
    @BNoor Why not use sqlite for this? Pickling seems horribly sub-optimal. – ekhumoro Dec 11 '20 at 13:45
  • If you need all the data in a list then the simple answer is: you can't. Not with the available memory. – Matthias Dec 11 '20 at 13:45
  • @BNoor If you need all the items in a list either buy more RAM or come up with an approach that doesn't need you to keep the items in the list. – Cubic Dec 11 '20 at 13:59
0

Assuming that your RAM will hardly be 16gb. Process data effectively and do not try to load the entire data set in-memory. You are only Killing your microprocessor and OS. The data set should be processed in runtime and not at once. Try to break down your files to multiple subfiles and process them individually.

sameera sy
  • 1,708
  • 13
  • 19
  • I have broken down the files but I need all items to be stored in one list. But the program seems to be unable to process them. –  Dec 11 '20 at 13:46
  • @BNoor Even if you managed to create that gigantic pickle, how are you going to work with it? Does your system have sufficient resources to load it all into memory? – ekhumoro Dec 11 '20 at 13:54