16

I have a file in which lines are separated using a delimeter say .. I want to read this file line by line, where lines should be based on presence of . instead of newline.

One way is:

f = open('file','r')
for line in f.read().strip().split('.'):
   #....do some work
f.close()

But this is not memory efficient if my file is too large. Instead of reading a whole file together I want to read it line by line.

open supports a parameter 'newline' but this parameter only takes None, '', '\n', '\r', and '\r\n' as input as mentioned here.

Is there any way to read files line efficiently but based on a pre-specified delimiter?

Abhishek Gupta
  • 6,465
  • 10
  • 50
  • 82

3 Answers3

22

You could use a generator:

def myreadlines(f, newline):
  buf = ""
  while True:
    while newline in buf:
      pos = buf.index(newline)
      yield buf[:pos]
      buf = buf[pos + len(newline):]
    chunk = f.read(4096)
    if not chunk:
      yield buf
      break
    buf += chunk

with open('file') as f:
  for line in myreadlines(f, "."):
    print line
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 2
    Could be simplified a bit by changing the start of the outer loop to `for chunk in iter(functools.partial(f.read, 4096), ''): buf += chunk` and adding `if buf: yield buf` after the loop (not inside). – Harvey Mar 21 '17 at 17:56
  • @Harvey Nice! Is there a reason to use `functools.partial(f.read, 4096)` instead of `lambda: f.read(4096)`? – wjandrea Nov 04 '18 at 18:30
  • 1
    Lambdas are discouraged, and all the cool kids use `partial`. There's probably a good reason. For example, lambdas can't be pickled and therefore can't be passed to thread pools. – Harvey Nov 07 '18 at 21:24
3

Here is a more efficient answer, using FileIO and bytearray that I used for parsing a PDF file -

import io
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  

# the end-of-file char
EOF = b'%%EOF'



def readlines(fio):
    buf = bytearray(4096)
    while True:
        fio.readinto(buf)
        try:
            yield buf[: buf.index(EOF)]
        except ValueError:
            pass
        else:
            break
        for line in re.split(EOL_REGEX, buf):
            yield line


with io.FileIO("test.pdf") as fio:
    for line in readlines(fio):
        ...

The above example also handles a custom EOF. If you don't want that, use this:

import io
import os
import re


# the end-of-line chars, separated by a `|` (logical OR)
EOL_REGEX = b'\r\n|\r|\n'  


def readlines(fio, size):
    buf = bytearray(4096)
    while True:
        if fio.tell() >= size:
            break               
        fio.readinto(buf)            
        for line in re.split(EOL_REGEX, buf):
            yield line

size = os.path.getsize("test.pdf")
with io.FileIO("test.pdf") as fio:
    for line in readlines(fio, size):
         ...
Dev Aggarwal
  • 7,627
  • 3
  • 38
  • 50
2

The easiest way would be to preprocess the file to generate newlines where you want.

Here's an example using perl (assuming you want the string 'abc' to be the newline):

perl -pe 's/abc/\n/g' text.txt > processed_text.txt

If you also want to ignore the original newlines, use the following instead:

perl -ne 's/\n//; s/abc/\n/g; print' text.txt > processed_text.txt
Bruno Gomes
  • 1,134
  • 6
  • 11