-1

I can extract output with simple small size file but when i use big file i am getting memory error

Big File SIZE upto 4GB

Here is the code


with open('file2.txt', 'r') as k:
    keywords = k.read().splitlines()
    
#2
with open('file1.txt') as f, open('output.txt', 'w') as o:
    for line in f:
        if any(key in line for key in keywords):
            o.writelines(line)

Error

Traceback (most recent call last):
  File "C:\crack\match.py", line 2, in <module>
    keywords = k.read().splitlines()
MemoryError
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Do not use `k.read().splitlines()`. Use `k.readlines()` – OneCricketeer Nov 10 '21 at 17:06
  • 3
    Which file is the big file? – Barmar Nov 10 '21 at 17:07
  • @OneCricketeer `readlines()` leaves all the newlines in the strings so the keyword matching won't work. – Barmar Nov 10 '21 at 17:08
  • 1
    @Barmar The error message says that it is the keywords file. – ekhumoro Nov 10 '21 at 17:10
  • Welcome to Stack Overflow! Please take the [tour] and read [ask]. What have you already tried to fix the problem? Have you considered processing `file2.txt` in smaller chunks? If `file2.txt` has lots of duplicates, have you tried using a `set`? You can [edit] to add the details. – wjandrea Nov 10 '21 at 17:12
  • 1
    How big is your computer's memory? Have you considered upgrading it or clearing space so that this program can run? Maybe you could add a swapfile or pagefile, though it could be slow. – wjandrea Nov 10 '21 at 17:13
  • 2
    What are you doing that you have 4GB of keywords? That seems like more words than there are in the language. – Barmar Nov 10 '21 at 17:14
  • 2
    `/usr/share/dict/words` is only 2MB – Barmar Nov 10 '21 at 17:14
  • 1
    `any(key in line for key in keywords)` for every line in the other file is going to be really slow with so many keywords. – Barmar Nov 10 '21 at 17:16
  • I think he can use something like https://docs.python.org/3/library/linecache.html – Mehrdad Heshmat Nov 10 '21 at 17:23
  • @Barmar I have 2 file file1 = 'email:pass', 'file2 = 'email' i want to match email from file1 when i use small file around 100-200 Mb its work but not working with Big file – mrinfoleet Nov 10 '21 at 17:23
  • And you have 100's of millions of emails? What is this, a spam list? – Barmar Nov 10 '21 at 17:25
  • I just want to know how many emails password file1 have – mrinfoleet Nov 10 '21 at 17:27
  • @mrinfoleet So does file1 contain lots of duplicates? If so, read it line by line and add the passwords to a set. – ekhumoro Nov 10 '21 at 17:32
  • @ekhumoro No sir file1 does not have a duplicates How to read line by line? – mrinfoleet Nov 10 '21 at 17:34
  • @wjandrea Processor : Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2.29 GHz Ram 8GB – mrinfoleet Nov 10 '21 at 17:37
  • @mrinfoleet Processor? What do you mean? **edit**: Oh I see you edited to add the RAM – wjandrea Nov 10 '21 at 17:38
  • @mrinfoleet `keywords = set(); for line in k: keywords.add(line.strip())`. But if you have less than 4Gb of accessible memory, that probably won't solve the problem. You should really put the keywords in a sqlite db. – ekhumoro Nov 10 '21 at 17:39
  • @mrinfoleet Sorry, I meant to ask above if *file2* has lots of duplicates (i.e. the one with the keywords). If you do both `read()` **and** `splitlines()` it will more than double the memory usage, since it must create both a string and a list. Reading the file line by line into a set might help, but your algorithm for checking each line in the emails is going to be ***really*** slow if the length of the keyword set is very large. – ekhumoro Nov 10 '21 at 17:52
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Nov 11 '21 at 12:43

1 Answers1

0

Looks like you need a memory efficient data structure to store keywords and later check if they exist in file2.txt or not. I suggest using a Trie data structure so that your memory doesn't run out.

Also, avoid reading the whole file at once, read the contents of the keyword file one line at a time and insert it into the trie.

A quick refactor of your code might look like below:

from colletions import defaultdict

class Node:
    def __init__(self):
        self.children = defaultdict(Node)
        self.end = False

class Trie:

    def __init__(self):
        self.root = Node()

    def insert(self, word: str) -> None:
        cur = self.root
        for c in word:
            cur = cur.children[c]
        cur.end = True

    def contains(self, word: str) -> bool:
        cur = self.root
        for c in word:
            if c not in cur.children: return False
            cur = cur.children[c]
        return cur.end

# changes to your code below:
keywords = Trie()
with open('file2.txt', 'r') as k:
    for keyword in k:
        keywords.insert(keyword)
    
#2
with open('file1.txt') as f, open('output.txt', 'w') as o:
    for line in f:
        if keywords.contains(line):
            o.writelines(line)

Note: A possible disadvantage of using a Trie here is that if the keywords are unique, there might be very less overlap and the memory used by a trie might still be a lot. But still it should be significantly less than 4GB.

  • Your test is the wrong way round. The OP wants to know which lines include any of the keywords, not which lines are contained in the keywords. You need to split up each line into a set of words and then check each one. It might be more efficient to also use a set for the keywords, since that would allow checking all the words of a line in a single operation - i.e. `if not keywords.isdisjoint(line.split()): o.write(line)`. (NB: don't use [writelines](https://docs.python.org/3/library/io.html#io.IOBase.writelines), since that takes a list of lines). – ekhumoro Nov 10 '21 at 18:49
  • @Manparvesh I am getting invalid syntax error File "match.py", line 13 def insert(self, word: str) -> None: ^ SyntaxError: invalid syntax – mrinfoleet Nov 10 '21 at 19:05