4

I have a large text file which have values separated by a header starting with "#". If the condition matches the one in the header I would like to read the file until the next header "#" and SKIP rest of the file.

To test that I'm trying to read the following text file named as test234.txt:

# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr

The code I wrote is:

file_t = open('test234.txt')
cond = True
while cond:
    for line_ in file_t:
        print(line_)
        if file_t.read(1) == "#":
            cond = False
file_t.close()

But, the output I'm getting is:

# abcdefgh

fnrnf

rkfr

foiernfr

erfnr

something

jndjen kj

jkndjke

vcrvr

Instead I would like the output between two headers separated by "#" which is:

1fnrnf
mrkfr
nfoiernfr
nerfnr      

How can I do that? Thanks!

EDIT: Reading in file block by block using specified delimiter in python talks about reading file in groups separated by headers but I don't want to read all the headers. I only want to read the header where a given condition is met and as soon as the line reaches the next header marked by '#' it stops reading the file.

Light_B
  • 1,660
  • 1
  • 14
  • 28
  • 2
    The line has a new line character at the end and print adds another one. Use `print(line.rstrip())` to remove the trailing new line.. – Matthias Feb 26 '18 at 15:52
  • Is your file using windows line endings `\r\n`? If so, use the `rsrip` method. – Alexander Ejbekov Feb 26 '18 at 15:52
  • Yes, the file has \n character but I simply want the output between the 2 headers specified by "#" – Light_B Feb 26 '18 at 15:55
  • 1
    Possible duplicate of [Reading in file block by block using specified delimiter in python](https://stackoverflow.com/questions/38655176/reading-in-file-block-by-block-using-specified-delimiter-in-python) – Chris_Rands Feb 26 '18 at 16:16

2 Answers2

3

itertools.groupby can help:

from io import StringIO
from itertools import groupby

text = '''# abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
# something
njndjen kj
ejkndjke
#vcrvr'''


with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'
    for key, group in groupby(lines, key=lambda x: x[0]=='#'):

        if key is True:
            # found a line that starts with '#'
            print('found header: {}'.format(next(group)))

        if key is False:
            # group now contanins all lines that do not start with '#'
            print('\n'.join(group))

note that all of this is lazy. you'd only ever have all the items between two headers in memory.

you'd have to replace the with StringIO(text) as file: with; with open('test234.txt', 'r') as file:...

the output for your test is:

found header: # abcdefgh
1fnrnf
mrkfr
nfoiernfr
nerfnr
found header: # something
njndjen kj
ejkndjke
found header: #vcrvr

UPDATE as i misunderstood. here is a fresh attempt:

from io import StringIO
from collections import deque
from itertools import takewhile

from_line = '# abcdefgh'
to_line = '# something'

with StringIO(text) as file:
    lines = (line.strip() for line in file)  # removing trailing '\n'

    # fast-forward up to from_line
    deque(takewhile(lambda x: x != from_line, lines), maxlen=0)

    for line in takewhile(lambda x: x != to_line, lines):
        print(line)

where i use itertools.takewhile to get an iterator over the lines until a contition is met (until the first header is found in your case).

the deque part is just the consume pattern suggested in the itertools recipes. it just fast-forwards to the point where the given condition does not hold anymore.

hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • Could you explain how the groupby is working? Is group iterating line by line in the for loop? Also, since the original file which I would try to read is pretty big. So, is the code reading the file line by line or all the lines at once? Thanks! – Light_B Feb 26 '18 at 16:14
  • 1
    as mentioned: this is lazy. everything used here is a generator. so yes: the file is treated line-by-line and not read as a whole. `groupby` reads until the condition (first character in the line == '#'?) changes and treturns the `key` (the value of the condition) and an iterator over the `group` (which is all the lines in between). the documentation is pretty helpful. – hiro protagonist Feb 26 '18 at 16:18
  • It took me some time to understand your solution being a beginner. It's removing the headers and grouping all the data as one. Instead, I would only like to read the data between two given headers and skip the whole file as specified in the question. Maybe you already meant it but I'm not able to work it out. Also, the question asked by Chris is reading the whole data in different sections separated by headers whereas I only want to read one section of my data specified by a given header and skip everything else. – Light_B Feb 26 '18 at 17:11
  • Thanks for the update. I also found using regex to be very simple for understanding as a beginner as suggested by @accumulatorax in the other solution. Do you think what you suggested is faster & efficient over using regex? I've developed my own solution building on accumulatorax solution. I can post it for comparison? – Light_B Feb 26 '18 at 20:30
  • if you wonder about speed, there is the `timeit` module. i'd say that regex is overkill if you know exactly what the header you are looking for looks like. regex is great if you know it's structure only. – hiro protagonist Feb 26 '18 at 20:56
  • Is it that it only works if the whole header matches instead of a part of it? As in my case, it's only possible to specify the first few characters of the header line beforehand. The user doesn't know that whole header before-hand to specify it as a condition. – Light_B Feb 26 '18 at 21:14
  • I figured I can index over the line to match whichever characters I want. So, it will work. Thanks a lot :) – Light_B Feb 26 '18 at 21:24
1

Learn and use regex. It will help you for all document signification processes.

import re #regex library

with open('test234.txt') as f:  #file stream
    lines = f.readlines()       #reads all lines

p = re.compile('^#.*')          #regex pattern creation

for l in lines:
    if p.match(l) == None:      #looks for non-matching lines
        print(l[:-2])
mujdecisy
  • 11
  • 3
  • Could you add some comments for a beginner like me to understand more? What is re.compile doing there? – Light_B Feb 26 '18 at 16:15
  • 1
    Regular expression logic provides you to finding patterns (described by yourself) in strings. _^#.*_ means that you are looking for string pieces starts with # mark. Check [that](https://www.computerhope.com/jargon/r/regex.htm) out. For some more info. – mujdecisy Feb 26 '18 at 16:28
  • Will it work if I don't want to read all the lines at once since the file is pretty large? – Light_B Feb 26 '18 at 16:33
  • Of course you can do it in "with" indent in while loop with readline() function – mujdecisy Feb 26 '18 at 16:38