What is the best way to mine text on large files (1 GB+) in Python?

Question

I have a handful of text files, ranging from 1 to 5 GBs. Content are simple unique one-liners.

I would like to:
1. mine text (find patterns, word frequency, clustering etc.).
2. compare text patterns to another large file to find similarities

Problem:

Memory runs out. IDE can't cope. Even when using generators.

Question:

What is the best approach to work with such large files?

Batching? Map/reduce? Hadoop? Using database instead of Python? What I don't want is to write a function to find a pattern and then wait an hour for processing (there is a lot to write, let alone wait for response). Obviously, conventional way of working with normal sized files doesn't apply here).

As written, this question is way too broad (and is doomed to be closed). It would definitely help if you tell us what sort of patterns yo are looking for, how you want to cluster, etc — inspectorG4dget, Jan 21 '15 at 07:42
@InspectorG4dget: Think any complex text processing. The problem is not really in how to process, but in how to avoid even the simplest concatenation to not take 10 minutes of time. — user1552294, Jan 21 '15 at 07:54

score 1 · Answer 1 · answered Jan 21 '15 at 07:42

1

I would recommend Apache Spark which can be used from Python.

Apache Spark™ is a fast and general engine for large-scale data processing.

Write applications quickly in Java, Scala or Python.

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.

file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR" in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL" in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL" in line).collect()

answered Jan 21 '15 at 07:42

Boris Pavlović

63,078
28
122
148

Thanks, Boris. I still wonder has anyone dealt with large files like these in practice. I would like to know whether hadoop, apache-spark are the only option or is it actually possible to use pure python. – user1552294 Jan 21 '15 at 20:14
There are several testimonies: https://www.youtube.com/results?search_query=apache+spark – Boris Pavlović Jan 22 '15 at 07:52
Spark could spare you a lot of headaches with problems that it has already solved. – Boris Pavlović Jan 22 '15 at 07:53
1

Appreciate it. IHowever I have found something that works very well for me: Pandas library. Up to few GBs, it's perfect. No need to set up a distributed system. However, I'm sure I could use spark too. – user1552294 Jan 23 '15 at 07:59

What is the best way to mine text on large files (1 GB+) in Python?

1 Answers1