I have a handful of text files, ranging from 1 to 5 GBs. Content are simple unique one-liners.
I would like to:
1. mine text (find patterns, word frequency, clustering etc.). 2. compare text patterns to another large file to find similarities
Problem:
Memory runs out. IDE can't cope. Even when using generators.
Question:
What is the best approach to work with such large files?
Batching? Map/reduce? Hadoop? Using database instead of Python? What I don't want is to write a function to find a pattern and then wait an hour for processing (there is a lot to write, let alone wait for response). Obviously, conventional way of working with normal sized files doesn't apply here).