0

This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")

with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input:
    for line in input.readlines():
        if re.findall(regex,line):
            output.write(line)

input.close()
output.close()

However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.

The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.

Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!

kormak
  • 495
  • 2
  • 5
  • 15
  • Try using `top` to see how much memory the process is using. And/or add some progress output. – Jesse W at Z - Given up on SE Oct 28 '14 at 18:22
  • As a side note, you probably don't want `findall` if you're just checking whether there are any matches. It probably won't have a _huge_ performance impact to find all the matches instead of just the first one, but it can't help, and since it's also conceptually a little confusing, better to just not do it. – abarnert Oct 28 '14 at 18:25
  • Also, if you're going to compile a pattern to a regex object, use its methods (`regex.findall(line)`), not the top-level functions (`re.findall(regex, line)`). The performance impact is probably even smaller here; again, it's about readability. (Also, the methods are more flexible, if you ever want to, say, extend things to, e.g., ignore the first 3 characters.) – abarnert Oct 28 '14 at 18:28

2 Answers2

7

It stopped because it ran out of memory. input.readlines() reads the entire file into memory before returning a list of the lines.

Instead, use input as an iterator. This only reads a few lines at a time, and returns them immediately.

Don't do this:

for line in input.readlines():

Do do this:

for line in input:

Taking everyone's advice into account, your program becomes:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")

with open("dbtropes.nt", "rb") as input:
    with open("dbtropes-v2.nt", "a") as output
        for line in input:
            if regex.search(line):
                output.write(line)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Beat me to it. Even in python when working with enough data you have be careful of how you handle it because you can use too much memory then your computer has. – Brandon Nadeau Oct 28 '14 at 18:27
1

Use for line in input rather than readlines() to keep it from reading the whole file.

A minor point: You don't need to close files if you open them as context managers. You might find it cleaner like this:

with open("dbtropes-v2.nt", "a") as output
     with open("dbtropes.nt", "rb") as input:
          for line in input:
              if re.findall(regex,line):
                  output.write(line)
theodox
  • 12,028
  • 3
  • 23
  • 36
  • 1
    I like this code sample. I might reorder the `with` statements, though, to open input before opening output. That way, if the input file is not present, no extra resources will be allocated, and no spurious output files will be created. – Jonathan Eunice Oct 28 '14 at 18:29
  • Why is it not necessary to close the file in this context? I read earlier (see here for example: http://stackoverflow.com/questions/5972277/write-not-working-in-python) that due to buffering a file might not get written at all if it's not properly closed. – kormak Oct 28 '14 at 18:38
  • 1
    It is not necessary to explicitly close the file because `with` automatically calls `.close()` at the end of the indented statement. It *is* necessary to close the file: that's why you used `with`. See the example at [`file.close()`](https://docs.python.org/2/library/stdtypes.html#file.close). – Robᵩ Oct 28 '14 at 19:42