Read files up to 4GB without MemoryError

Question

I can extract output with simple small size file but when i use big file i am getting memory error

Big File SIZE upto 4GB

Here is the code


with open('file2.txt', 'r') as k:
    keywords = k.read().splitlines()
    
#2
with open('file1.txt') as f, open('output.txt', 'w') as o:
    for line in f:
        if any(key in line for key in keywords):
            o.writelines(line)

Error

Traceback (most recent call last):
  File "C:\crack\match.py", line 2, in <module>
    keywords = k.read().splitlines()
MemoryError

@OneCricketeer `readlines()` leaves all the newlines in the strings so the keyword matching won't work. — Barmar, Nov 10 '21 at 17:08
@Barmar The error message says that it is the keywords file. — ekhumoro, Nov 10 '21 at 17:10
Welcome to Stack Overflow! Please take the [tour] and read [ask]. What have you already tried to fix the problem? Have you considered processing `file2.txt` in smaller chunks? If `file2.txt` has lots of duplicates, have you tried using a `set`? You can [edit] to add the details. — wjandrea, Nov 10 '21 at 17:12
How big is your computer's memory? Have you considered upgrading it or clearing space so that this program can run? Maybe you could add a swapfile or pagefile, though it could be slow. — wjandrea, Nov 10 '21 at 17:13
What are you doing that you have 4GB of keywords? That seems like more words than there are in the language. — Barmar, Nov 10 '21 at 17:14
`any(key in line for key in keywords)` for every line in the other file is going to be really slow with so many keywords. — Barmar, Nov 10 '21 at 17:16
I think he can use something like https://docs.python.org/3/library/linecache.html — Mehrdad Heshmat, Nov 10 '21 at 17:23
@Barmar I have 2 file file1 = 'email:pass', 'file2 = 'email' i want to match email from file1 when i use small file around 100-200 Mb its work but not working with Big file — mrinfoleet, Nov 10 '21 at 17:23
And you have 100's of millions of emails? What is this, a spam list? — Barmar, Nov 10 '21 at 17:25
@mrinfoleet So does file1 contain lots of duplicates? If so, read it line by line and add the passwords to a set. — ekhumoro, Nov 10 '21 at 17:32
@ekhumoro No sir file1 does not have a duplicates How to read line by line? — mrinfoleet, Nov 10 '21 at 17:34
@wjandrea Processor : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.29 GHz Ram 8GB — mrinfoleet, Nov 10 '21 at 17:37
@mrinfoleet Processor? What do you mean? **edit**: Oh I see you edited to add the RAM — wjandrea, Nov 10 '21 at 17:38
@mrinfoleet `keywords = set(); for line in k: keywords.add(line.strip())`. But if you have less than 4Gb of accessible memory, that probably won't solve the problem. You should really put the keywords in a sqlite db. — ekhumoro, Nov 10 '21 at 17:39
@mrinfoleet Sorry, I meant to ask above if *file2* has lots of duplicates (i.e. the one with the keywords). If you do both `read()` **and** `splitlines()` it will more than double the memory usage, since it must create both a string and a list. Reading the file line by line into a set might help, but your algorithm for checking each line in the emails is going to be ***really*** slow if the length of the keyword set is very large. — ekhumoro, Nov 10 '21 at 17:52
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Nov 11 '21 at 12:43

Manparvesh · Answer 1 · 2021-11-10T18:13:00.997

Looks like you need a memory efficient data structure to store keywords and later check if they exist in file2.txt or not. I suggest using a Trie data structure so that your memory doesn't run out.

Also, avoid reading the whole file at once, read the contents of the keyword file one line at a time and insert it into the trie.

A quick refactor of your code might look like below:

from colletions import defaultdict

class Node:
    def __init__(self):
        self.children = defaultdict(Node)
        self.end = False

class Trie:

    def __init__(self):
        self.root = Node()

    def insert(self, word: str) -> None:
        cur = self.root
        for c in word:
            cur = cur.children[c]
        cur.end = True

    def contains(self, word: str) -> bool:
        cur = self.root
        for c in word:
            if c not in cur.children: return False
            cur = cur.children[c]
        return cur.end

# changes to your code below:
keywords = Trie()
with open('file2.txt', 'r') as k:
    for keyword in k:
        keywords.insert(keyword)
    
#2
with open('file1.txt') as f, open('output.txt', 'w') as o:
    for line in f:
        if keywords.contains(line):
            o.writelines(line)

Note: A possible disadvantage of using a Trie here is that if the keywords are unique, there might be very less overlap and the memory used by a trie might still be a lot. But still it should be significantly less than 4GB.

Your test is the wrong way round. The OP wants to know which lines include any of the keywords, not which lines are contained in the keywords. You need to split up each line into a set of words and then check each one. It might be more efficient to also use a set for the keywords, since that would allow checking all the words of a line in a single operation - i.e. `if not keywords.isdisjoint(line.split()): o.write(line)`. (NB: don't use [writelines](https://docs.python.org/3/library/io.html#io.IOBase.writelines), since that takes a list of lines). — ekhumoro, Nov 10 '21 at 18:49
@Manparvesh I am getting invalid syntax error File "match.py", line 13 def insert(self, word: str) -> None: ^ SyntaxError: invalid syntax — mrinfoleet, Nov 10 '21 at 19:05

Read files up to 4GB without MemoryError

1 Answers1