17

I am trying to find a string near the end of a text file. The problem is that the text file can vary greatly in size. From 3MB to 4GB. But everytime I try to run a script to find this string in a text file that is around 3GB, my computer runs out of memory. SO I was wondering if there was anyway for python to find the size of the file and then read the last megabyte of it.

The code I am currently using is as follows, but like I said earlier, I do not seem to have a big enough memory to read such large files.

find_str = "ERROR"
file = open(file_directory)                           
last_few_lines​ = file.readlines()[-20:]   

error​ = False  

for line in ​last_few_lines​:
    if find_str in line:
    ​    error​ = True
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
nkshakya1
  • 173
  • 1
  • 4

3 Answers3

36

Use file.seek():

import os
find_str = "ERROR"
error = False
# Open file with 'b' to specify binary mode
with open(file_directory, 'rb') as file:
    file.seek(-1024 * 1024, os.SEEK_END)  # Note minus sign
    if find_str in file.read():
        error = True

You must specify binary mode when you open the file or you will get 'undefined behavior.' Under python2, it might work anyway (it did for me), but under python3 seek() will raise an io.UnsupportedOperation exception if the file was opened in the default text mode. The python 3 docs are here. Though it isn't clear from those docs, the SEEK_* constants are still in the os module.

Update: Using with statement for safer resource management, as suggested by Chris Betti.

Aryeh Leib Taurog
  • 5,370
  • 1
  • 42
  • 49
  • @nkshakya1 I updated the answer. You don't need `with` at all; it's just a convenience for closing the file. I added that at the end in a comment as a reminder to close it if you're finished with it. – Aryeh Leib Taurog Sep 30 '13 at 10:37
  • 1
    Alternatively if you have jython 2.5 you can add `from __future__ import with_statement` at the top of your code, then you can use `with` statements. – Aryeh Leib Taurog Sep 30 '13 at 10:43
  • Thanks for the help everyone! I think its working now. For another case, if I have a text file that is 5-10 kB and I want to look at the last 2 kB of it, do i just replace the '-1024 * 1024' with '-2 * 2' ? – nkshakya1 Sep 30 '13 at 12:09
  • 1
    @nkshakya1 Replace with `-2 * 1024` or just `-2048`. (1024 = kB) – Aryeh Leib Taurog Sep 30 '13 at 12:23
  • Thanks guys! Its finally working!! Really appreciate all the help! – nkshakya1 Sep 30 '13 at 12:36
  • @AryehLeibTaurog you edited to remove the 'with open' syntax, which seems to work with seek from end. 'with open' is a safer approach to filehandles, so out of curiosity, why? – Chris Betti Jul 02 '15 at 01:29
  • @ChrisBetti less relevant now, but as I recall primarily for compatibility with older python versions which don't support the `with` statement. – Aryeh Leib Taurog Jul 05 '15 at 14:58
  • @ChrisBetti changed it back now. – Aryeh Leib Taurog Jul 05 '15 at 15:08
  • I'm not sure what `os.SEEK_END` represents. Basically before invocation I will get size of file and then want to read 256 bytes before that. File is constantly growing 30 times a second. If it matters it is output from `ffplay` that states current seconds played of song. – WinEunuuchs2Unix Mar 14 '21 at 00:34
3

You can use the tail recipe with a deque to get the last n lines of a large file:

from collections import deque

def tail(fn, n):
    with open(fn) as fin:
        return list(deque(fin, n))

Now test this.

First create a big file:

>>> with open('/tmp/lines.txt', 'w') as f:
...    for i in range(1,10000000+1):
...       print >> f, 'Line {}'.format(i)  # Python 3: print('Line {}'.format(i), file=f)

# about 128 MB on my machine

Then test:

print tail('/tmp/lines.txt', 20) 
# ['Line 9999981\n', 'Line 9999982\n', 'Line 9999983\n', 'Line 9999984\n', 'Line 9999985\n', 'Line 9999986\n', 'Line 9999987\n', 'Line 9999988\n', 'Line 9999989\n', 'Line 9999990\n', 'Line 9999991\n', 'Line 9999992\n', 'Line 9999993\n', 'Line 9999994\n', 'Line 9999995\n', 'Line 9999996\n', 'Line 9999997\n', 'Line 9999998\n', 'Line 9999999\n', 'Line 10000000\n']

This will return the last n lines rather than the last X bytes of a file. The size of the data is the same as the size of lines -- not the size of the file. The file object fin is used as an iterator over lines of the file, so the entire file is not resident in memory all at once.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
dawg
  • 98,345
  • 23
  • 131
  • 206
  • won't that still load the whole file in memory? (or will it use fin as if it is a generator expression?) – Daan Timmer Sep 27 '13 at 10:08
  • `fin` will be used as a generator, so the entire file will not be in memory at the same time. – dawg Sep 27 '13 at 10:13
  • But, you can't specify an amount of bytes to read back this way? (Or perhaps OP doesn't need that specific 1MB) – Daan Timmer Sep 27 '13 at 10:26
  • @DaanTimmer: If you specify lines that a variable width, you cannot specify bytes; if you specify bytes of a file that contains variable width lines, you cannot specify exact lines. – dawg Sep 27 '13 at 15:04
1

The proposed answer using seek is a correct answer to your question, but I think it's not what you really want to do. Your solution loads the whole file into memory, just to get the last 20 lines. That's the main cause of your problem. The following would solve your memory issue:

for line in file(file_directory):
    if find_str in line:
        error = True

This will iterate over all lines in the file, but releasing the lines after they have been processed. I would guess, that this solution is already much faster than yours so no further optimization is needed. But if you really want to have just the last 20 lines, but the lines in a deque with a max length of 20.

Achim
  • 15,415
  • 15
  • 80
  • 144