6

Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.

Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?

James Lin
  • 25,028
  • 36
  • 133
  • 233

5 Answers5

3

How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...

I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.

Also, you'll need to define the function isRequired(line).

writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
    while True:
        line = f.readline()

        #manual EOF check; not sure of the correct
        #Python way to do this manually...
        if line == "":
            break

        #save how far we've read
        readLoc = f.tell()

        #if we need this line write it and
        #update the write location
        if isRequired(line):
            f.seek( writeLoc )
            f.write( line )
            writeLoc = f.tell()
            f.seek( readLoc )

    #finally, chop off the rest of file that's no longer needed
    f.truncate( writeLoc )
DMA57361
  • 3,600
  • 3
  • 27
  • 36
  • +1: Almost exactly my solution, but with all the unclear details filled in. And tested. – Björn Pollex Dec 17 '10 at 11:55
  • Thanks for the suggestion, I am a bit worried that if anything wrong happened during this process, my file would not be in the original state? Given the fact that line numbers actually matters in isRequired(line) function. I am aware of that I can log/write to a file to "remember" what has been changed and continue afterward, but I would like to see if there is a effort-less way to achieve this. – James Lin Dec 17 '10 at 12:22
  • @James Correct, the file is modified immediately, so if this fails for any reason the file *will* have been changed. You could record `readLoc` and `writeLoc` to allow you to resume running (I guess a `f.flush()` is probably then a good idea), but this won't help you roll back any changes. Importantly, note that the file size is not reduced until the call to `truncate()` - which is the very last action - so any change logs you create would have to fit in the spare space that is already available. Do you have any idea how much of the original file you need to keep? – DMA57361 Dec 17 '10 at 12:37
  • @DMA57361 I was only wondering how easy this can be achieved if I only want to remove the first few lines. It seems in-efficient that even removing the very first line of file will cause the whole 10gb file data to shift. I thought there might be an "clever" function which does some magically changes file pointer LMAO. – James Lin Dec 17 '10 at 12:48
  • @James - if there is (who knows, there might be!) it'd be a fairly low-level file system operation, and outside my realm of knowledge. If you only need to trim the front of the file - not lines spread throughout it as I'd assumed when writing the above - then you might consider asking another question to see if such a thing exists. – DMA57361 Dec 17 '10 at 12:54
2

Try this:

currentReadPos = 0
removedLinesLength = 0
for line in file:
    currentReadPos = file.tell()
    if remove(line):
        removedLinesLength += len(line)
    else:
        file.seek(file.tell() - removedLinesLength)
        file.write(line + "\n")
        file.flush()
    file.seek(currentReadPos)

I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
1

Update:

I have tried fileinput with inplace by creating a 1GB file. What I expected was different from what happened. I read the documentation properly this time.

Optional in-place filtering: if the keyword argument inplace=1 is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently).

from docs/fileinput

So, this doesn't seem to be an option now for you. Please check other answers.


Before Edit:

If you are looking for editing the file inplace, then check out Python's fileinput module - Docs.

I am really not sure about its efficiency when used with a 10gb file. But, to me, this seemed to be the only option you have using Python.

dheerosaur
  • 14,736
  • 6
  • 30
  • 31
0

Just sequentially read and write to the files.

f.readlines() returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

Source

phant0m
  • 16,595
  • 5
  • 50
  • 82
0

Process the file getting 10/20 or more MB of chunks. This would be the fastest way.

Other way of doing this is to stream this file and filter it using AWK for example.

example pseudo code:

file = open(rw)
linesCnt=50
newReadOffset=0
tmpWrtOffset=0
rule=1
processFile()
{
  while(rule)
  {
      (lines,newoffset)=getLines(file, newReadOffset)
      if lines:
          [x for line in lines if line==cool: line]
          tmpWrtOffset = writeBackToFile(file, x, tmpWrtOffset) #should return new offset to write for the next time
      else:
          rule=0
  }
}

To resize file at the end use truncate(size=None)

bua
  • 4,761
  • 1
  • 26
  • 32
  • '-1 hater' please explain why do you think it's crap? – bua Dec 17 '10 at 10:52
  • 4
    Why would you write pseudocode for Python? For that matter, why would you write pseudocode that looks lower-level than Python itself normally does? – Karl Knechtel Dec 17 '10 at 10:59
  • Because i have no possibility to check is it running and I'm not native python programmer. This should just give an idea what he should be aware of. What the final code will look like its not my problem.... – bua Dec 17 '10 at 11:05