0

Bonjour Stack0verflow

I am trying to get this code to write the data to stored_output without line 1 (title line)

What I have tried:

with open(filenamex, 'rb') as currentx:
    current_data = currentx.read()
    ## because of my filesize I dont want to go through each line the the route shown below to remove the first line (title row)
    for counter, line in enumerate(current_data):
        if counter != 0:
            data.writeline(line)
    #stored_output.writelines(current_data)

Because of the filesize I dont want to do a for loop (efficiency)

Any constructive comments or code snippets would be appreciated.
Thanks AEA

Community
  • 1
  • 1
AEA
  • 213
  • 2
  • 12
  • 34
  • 1
    I think since the filesize is large, it's actually a good idea to loop through it, otherwise it would fill up your memory. – aIKid Nov 01 '13 at 00:46
  • 1
    Also, the way you've done it, `current_data` is one giant string, so `enumerate(current_data)` gives the index and value of each _character_, not each _line_. If you really want to read the whole file into memory as lines (which you probably don't), either do `current_data = currentx.read().splitlines()` or, beter, `current_data = list(currentx)`. – abarnert Nov 01 '13 at 00:48
  • @aIKid Thanks for the comment, I am storing it in memory. Otherwise we would be performing a file write a large number of times. Memory isn't the issue for me here. – AEA Nov 01 '13 at 00:49
  • hcwhsa's answer finishes the problem completely, check it out. – aIKid Nov 01 '13 at 00:52
  • @abarnert I am opening files sequentially and appending them in order. I do not need to read or process the data in any way. Would you suggest an alternative method? Thanks – AEA Nov 01 '13 at 00:52
  • @AEA: The most important thing by far is using an appropriate buffer size for reads and writes. The wasted CPU cost of splitting lines and looping over them will probably be nothing compared to the I/O cost of the reads and writes, even on a very slow computer with a very fast drive (e.g., an early MacBook Air), so I wouldn't worry about it at all unless you have good measurements that tell you otherwise. But, if you need it, I'll edit my answer to show what you can do. – abarnert Nov 01 '13 at 00:59

2 Answers2

9

You can use next() on the file iterator to skip the first line and then write rest of the content using file.writelines:

with open(filenamex, 'rb') as currentx, open('foobar', 'w') as data:
    next(currentx)            #drop the first line
    data.writelines(currentx) #write rest of the content to `data`

Note: Don't use file.read() if you want to read a file line by line, simply iterate over the file object to get one line at a time.

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • No idea we can do that.. Awesome solution! – aIKid Nov 01 '13 at 00:51
  • What to do when you get two answers which are brilliant, I have used this solution but I do love abernets detailed description of different ways of being efficient. Thanks :) – AEA Nov 01 '13 at 02:08
6

You first problem is that currentx.read() returns one giant string, so looping over it loops over each of the characters in that string, not each of the lines in the file.

You can read a file into memory as a giant list of strings like this:

current_data = list(currentx)

However, this is almost guaranteed to be slower than iterating over the file a line at a time (because you waste time allocating memory for the whole file, rather than letting Python pick a reasonable-size buffer) or processing the whole file at once (because you're wasting time splitting on lines). In other words, you get the worst of both worlds this way.

So, either keep it as an iterator over lines:

next(currentx) # skip a line
for line in currentx:
    # do something with each line

… or keep it as a string and split off the first line:

current_data = currentx.read()
first, _, rest = current_data.partition('\n')
# do something with rest

What if it turns out that reading and writing a file at a time is too slow (which is likely—it forces the early blocks out of any cache before they can be written, prevents interleaving, and wastes time allocating memory), but a line at a time is also too slow (which is unlikely, but not impossible--searching for newlines, copying small strings, and looping in Python isn't free, it's just that CPU time is so much cheaper than I/O time that it rarely matters)?

The best you can do is pick an ideal block size and do unbuffered reads and writes yourself, and only waste time searching for newlines until you find the first one.

If you can assume that the first line will never be longer than the block size, this is pretty easy:

BLOCK_SIZE = 8192 # a usually-good default--but if it matters, test
with open(inpath, 'rb', 0) as infile, open(outpath, 'wb', 0) as outfile:
    buf = infile.read(BLOCK_SIZE)
    first, _, rest = buf.partition(b'\n')
    outfile.write(rest)
    while True:
        buf = infile.read(BLOCK_SIZE)
        if not but:
            break
        outfile.write(buf)

If I were going to do that more than once, I'd write a block file iterator function (or, better, look for a pre-tested recipe—they're all over ActiveState and the mailing lists).

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I wanted to make sure I thanked you for this answer, I have accepted the other one, since It was the one I used. But I really appreciate some of the content of this answer. Thanks :) – AEA Nov 01 '13 at 02:11