0

I am reading a file using Python, and within the file there are sections that are enclosed with the '#' character:

#HEADER1, SOME EXTRA INFO
data first section
1 2
1 233 
...
// THIS IS A COMMENT
#HEADER2, SECOND SECTION
452
134
// ANOTHER COMMENT
...
#HEADER3, THIRD SECTION

Now I wrote code to read the file as follows:

with open(filename) as fh:

    enumerated = enumerate(iter(fh.readline, ''), start=1)

    for lino, line in enumerated:

        # handle special section
        if line.startswith('#'):

            print("="*40)
            print(line)

            while True:

                start = fh.tell()
                lino, line = next(enumerated)

                if line.startswith('#'):
                    fh.seek(start)
                    break

                print("[{}] {}".format(lino,line))

The output is:

========================================
#HEADER1, SOME EXTRA INFO

[2] data first section

[3] 1 2

[4] 1 233 

[5] ...

[6] // THIS IS A COMMENT

========================================
#HEADER2, SECOND SECTION

[9] 452

[10] 134

[11] // ANOTHER COMMENT

[12] ...

========================================
#HEADER3, THIRD SECTION

Now you see that the line counter lino is no longer valid because I'm using seek. Also, it won't help I decrease it before breaking the loop because this counter is increased with each call to next. So is there an elegant way to solve this problem in Python 3.x? Also, is there a better way of solving the StopIteration without putting a pass statement in an Except block?

UPDATE

So far I have adopted an implementation based on the suggestion made by @Dunes. I had to change it a bit so I can peek ahead to see if a new section is starting. I don't know if there's a better way to do this, so please jump in with comments:

class EnumeratedFile:

    def __init__(self, fh, lineno_start=1):
        self.fh = fh
        self.lineno = lineno_start

    def __iter__(self):
        return self

    def __next__(self):
        result = self.lineno, self.fh.readline()
        if result[1] == '':
            raise StopIteration

        self.lineno += 1
        return result

    def mark(self):
        self.marked_lineno = self.lineno
        self.marked_file_position = self.fh.tell()

    def recall(self):
        self.lineno = self.marked_lineno
        self.fh.seek(self.marked_file_position)

    def section(self):
        pos = self.fh.tell()
        char = self.fh.read(1)
        self.fh.seek(pos)
        return char != '#'

And then the file is read and each section is processed as follows:

# create enumerated object
e = EnumeratedFile(fh)

header = ""
for lineno, line, in e:

    print("[{}] {}".format(lineno, line))

    header = line.rstrip()

    # HEADER1
    if header.startswith("#HEADER1"):

        # process header 1 lines
        while e.section():

            # get node line
            lineno, line = next(e)
            # do whatever needs to be done with the line

     elif header.startswith("#HEADER2"):

         # etc.
aaragon
  • 2,314
  • 4
  • 26
  • 60
  • 3
    You cannot reset the `enumerate()` count, no. Mixing seeking and iteration is not a good idea, anyway. – Martijn Pieters Dec 09 '14 at 15:33
  • What is the goal here? To number the lines in each section, starting at 1 for each new section? – Martijn Pieters Dec 09 '14 at 15:46
  • The goal is to alert the user there's a problem in certain line number from the input file in case something goes wrong when reading it. I could replace enumerate by a counter and increase it every time I call next, and decrease it every time I find a new section when calling seek. – aaragon Dec 09 '14 at 15:47
  • I'm not sure why you need to seek *at all*. Why not store the lines read in a buffer instead? – Martijn Pieters Dec 09 '14 at 15:48
  • I don't want to use a buffer, because the data contained in each section may be really huge. The thing is that each section is delimited by finding a new starting character '#', and thus I need to stop processing that section and move to the next one (and therefore the use of seek). – aaragon Dec 09 '14 at 15:49
  • I see you are using `iter(fh.readline, '')`. That trick is not needed in Python 3, where `TextIOWrapper` can handle seeking and iteration together. – Martijn Pieters Dec 09 '14 at 15:49
  • All you need to buffer then is the section header.. You are effectively seeking just to reread one line. – Martijn Pieters Dec 09 '14 at 15:50
  • I'm not familiar with TextIOWrapper, I'll check it out right away. – aaragon Dec 09 '14 at 15:50
  • No need to check out TextIOWrapper; I am just pointing out you are applying a work-around for a Python 2 problem that no longer exists in Python 3. – Martijn Pieters Dec 09 '14 at 15:51
  • Do you have an example that uses code in Python 3? – aaragon Dec 09 '14 at 15:52
  • I posted an answer; the `iter(fh.readline, '')` work-around can just be replaced by `fh`; iteration directly over the file object. – Martijn Pieters Dec 09 '14 at 15:59

2 Answers2

2

You cannot alter the counter of the enumerate() iterable, no.

You don't need to at all here, nor do you need to seek. Instead use a nested loop and buffer the section header:

with open(filename) as fh:
    enumerated = enumerate(fh, start=1)
    header = None
    for lineno, line in enumerated:
        # seek to first section
        if header is None:
            if not line.startswith('#'):
                continue
            header = line

        print("=" * 40)
        print(header.rstrip())
        for lineno, line in enumerated:
            if line.startswith('#'):
                # new section
                header = line
                break

            # section line, handle as such
            print("[{}] {}".format(lineno, line.rstrip()))

This buffers the header line only; every time we come across a new header, it is stored and the current section loop is ended.

Demo:

>>> from io import StringIO
>>> demo = StringIO('''\
... #HEADER1, SOME EXTRA INFO
... data first section
... 1 2
... 1 233 
... ...
... // THIS IS A COMMENT
... #HEADER2, SECOND SECTION
... 452
... 134
... // ANOTHER COMMENT
... ...
... #HEADER3, THIRD SECTION
... ''')
>>> enumerated = enumerate(demo, start=1)
>>> header = None
>>> for lineno, line in enumerated:
...     # seek to first section
...     if header is None:
...         if not line.startswith('#'):
...             continue
...         header = line
...     print("=" * 40)
...     print(header.rstrip())
...     for lineno, line in enumerated:
...         if line.startswith('#'):
...             # new section
...             header = line
...             break
...         # section line, handle as such
...         print("[{}] {}".format(lineno, line.rstrip()))
... 
========================================
#HEADER1, SOME EXTRA INFO
[2] data first section
[3] 1 2
[4] 1 233
[5] ...
[6] // THIS IS A COMMENT
========================================
#HEADER2, SECOND SECTION
[9] 134
[10] // ANOTHER COMMENT
[11] ...
>>> header
'#HEADER3, THIRD SECTION\n'

The third section remains unprocessed because there were no lines in it, but had there been, the header variable has already been set in anticipation.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

You can copy the iterator, and then restore the iterator from that copy. However, you can't copy file object. You could take a shallow copy of the enumerator and then seek to the respective part of the of file when you start using the copied enumerator.

However, the best thing to do would be to write your generator class, with a __next__ method to produce line numbers and lines, and mark and recall methods to record and return to a previously recorded state.

class EnumeratedFile:

    def __init__(self, fh, lineno_start=1):
        self.fh = fh
        self.lineno = lineno_start

    def __iter__(self):
        return self

    def __next__(self):
        result = self.lineno, next(self.fh)
        self.lineno += 1
        return result

    def mark(self):
        self.marked_lineno = self.lineno
        self.marked_file_position = self.fh.tell()

    def recall(self):
        self.lineno = self.marked_lineno
        self.fh.seek(self.marked_file_position)

You would use it like thus:

from io import StringIO
demo = StringIO('''\
#HEADER1, SOME EXTRA INFO
data first section
1 2
1 233 
...
// THIS IS A COMMENT
#HEADER2, SECOND SECTION
452
134
// ANOTHER COMMENT
...
#HEADER3, THIRD SECTION
''')

e = EnumeratedFile(demo)
seen_header2 = False
for lineno, line, in e:
    if seen_header2:
        print(lineno, line)
        assert (lineno, line) == (2, "data first section\n")
        break
    elif line.startswith("#HEADER1"):
        e.mark()
    elif line.startswith("#HEADER2"):
        e.recall()
        seen_header2 = True
Dunes
  • 37,291
  • 7
  • 81
  • 97
  • I am checking your code and it gives me an error at line `self.marked_file_position = self.fh.tell()`, it says `OSError: telling position disabled by next() call`. Any ideas? – aaragon Dec 10 '14 at 13:55
  • That wasn't an issue for my version of Python. Hmm. I was using `next` as it will raise `StopIteration` automatically at the end of the file. However, you can replace `next(self.fh)` with `self.fh.readline()` and test if the line is `''` and `raise StopIteration` when it is. – Dunes Dec 10 '14 at 14:03
  • That worked. Could you comment a bit on your implementation? I mean, I don't understand how it works as my level of Python is not that high yet. What's the purpose of `mark` and `recall`? – aaragon Dec 10 '14 at 14:12
  • They're just methods to remember the current position of the iterator and to restore the position from a remembered position. Perhaps `record_position` and `restore_position` would have been better names. So after you see the first header, you mark this position so you can jump back to this position if you need to. – Dunes Dec 10 '14 at 14:39
  • Now if every time I enter a section of the file, I would like to do some processing of the data until another section starts, how would I do that? That was the original purpose of my post. As far as I understand your implementation, I can only loop over the iterable and then go back to a saved state. But how do I loop over the data lines until I reach another section? (the section starts with a # character) – aaragon Dec 10 '14 at 14:44
  • Maybe inside the `next` function I could test if the new found line starts with #, and if so, recall to that state? But then what I return? – aaragon Dec 10 '14 at 14:48