Efficient way to read large block of text between start and end sequence with python

Question

Suppose I have a text file like

asdfsa
fasdf
asdf
-1
2412
asdf
fddsfw
efe
st
-1
ghhgg

I need a way to export the entire chunk between 2412 and -1 in an efficient way, perhaps export the chunk as a dataframe that I can do transformations on later. Notice that before 2412, there is also a -1 that I need to control for, so the trigger to copy always begins with -1 then followed immediately by 2412 and ends with another -1. I was building the dataframe like this:

# Build df while looping through text file
df = pd.DataFrame(columns = ['ID', 'string'])
i = 1
with open (PATH, 'rt') as file: 
    lines = file.readlines()
    for index, line in enumerate(lines):
        if line.strip('\r\n').strip(' ') == '-1': 
            if lines[index + 1].strip('\r\n').strip(' ') == '2412':
                while lines[index + i+1].strip('\r\n').strip(' ') != '-1':
                    transformed_strings= do_transforms_with_multiple lines() #some transformation function on lines
                    df = df.append(transformed_strings) #append transforms here to df
                    i = i + 1 # go to next line
                break # break out of original for loop when next -1 is reached

You can see I'm trying to build a dataframe by looping line by line once I see -1, 2412 and then stop at the next -1. This works quick for small files, but for larger ones it is much too slow. I'm hoping I can export the whole chunk between 2412 and -1 somehow, then apply pd.DataFrame() and my transformations afterwards to speed things up. I found this post here but it doesn't seem to get me what I want. Exporting simply as a txt file would also be fine. I could pull in the txt file later with pd and do my transforms, so appending to a df is not necessary.

Something like

df = pd.DataFrame(columns = ['ID', 'string'])
i = 1
with open (PATH, 'rt') as file: 
    lines = file.readlines()
    for index, line in enumerate(lines):
        if line.strip('\r\n').strip(' ') == '-1': 
            if lines[index + 1].strip('\r\n').strip(' ') == '2412':
                while lines[index + i+1].strip('\r\n').strip(' ') != '-1':
                    write_line_to_txt_file() #OR
                    df = df.append(line)
                    i = i + 1 # go to next line
                break # break out of original for loop when next -1 is reached

Would also be a solution Thanks for the help!

you could use `read(size)` to read some part of file and check if there is `\n-1\n` - using `find()`. If there is no `\n-1\n` then `read(size)` another part and append to previous and check again. And when you find `\n-1\n` then you `slice using index` or `split()` on `\n-1\n` - to get all before `\n-1\n` and keep rest in variable to read next part. This methos is common in reading data from sockets - and you may say it is `"reading with buffer"` — furas, Sep 30 '22 at 12:18

score 0 · Answer 1 · answered Oct 03 '22 at 06:31

This seems to work but maybe there's something more efficient. The idea is to locate the beginning of the sequence -1, 2412 with a for loop, then write the lines until you reach the next -1. This sequence only occurs once so we can break out of the original for loop.

with open(PATH, 'rt') as file, open(EXP_PATH + 'test.txt', 'w') as outfile:
    i = 1
    lines = file.readlines()
    for index, line in enumerate(lines):
        if line.strip('\r\n').strip(' ') == '-1': 
            if lines[index + 1].strip('\r\n').strip(' ') == '2412':
                while lines[index + i+1].strip('\r\n').strip(' ') != '-1':
                    outfile.write(lines[index + i+1])
                    i = i + 1
                break

Efficient way to read large block of text between start and end sequence with python

1 Answers1