Extract Values between two strings in a text file

Question

Lets say I have a Text file with the below content

fdsjhgjhg
fdshkjhk
 Start
     Good Morning
     Hello World
 End
dashjkhjk
dsfjkhk
Start
  hgjkkl
  dfghjjk
  fghjjj
Start
   Good Evening
   Good 
End

I wrote the following code:

infile = open('test.txt','r')
outfile= open('testt.txt','w')
copy = False
for line in infile:
    if line.strip() == "Start":
        copy = True
    elif line.strip() == "End":
        copy = False
    elif copy:
        outfile.write(line)

I have this result in outfile:

     Good Morning
     Hello World
     hgjkkl
     dfghjjk
     fghjjj
     Good Evening
     Good

My problem is I want to take just the data between start and end but not between start and start or End and End

You should try to use a buffer variable to store what is encountered after "Start" until you meet "End" then write it to your file. — Jacques Gaudin, Apr 11 '16 at 21:20

Adib · Accepted Answer · 2016-04-11T21:46:58.760

Great problem! This is a bucket problem where each start needs an end.

The reason why you got the result is because there are two consecutive 'Start'.

It's best to store the information somewhere until 'End' is triggered.

infile = open('scores.txt','r')
outfile= open('testt.txt','w')
copy = False
for line in infile:

    if line.strip() == "Start":
        bucket = []
        copy = True

    elif line.strip() == "End":
        for strings in bucket:
            outfile.write( strings + '\n')
        copy = False

    elif copy:
        bucket.append(line.strip())

score 0 · Answer 2 · answered Apr 11 '16 at 21:21

You could keep a temporary list of lines, and only commit them after you know that a section meets your criteria. Maybe try something like the following:

infile = open('test.txt','r')
outfile= open('testt.txt','w')
copy = False
tmpLines = []
for line in infile:
    if line.strip() == "Start":
        copy = True
        tmpLines = []
    elif line.strip() == "End":
        copy = False
        for tmpLine in tmpLines:
            outfile.write(tmpLine)
    elif copy:
        tmpLines.append(line)

This gives the output

     Good Morning
     Hello World
 Good Evening
 Good

score 0 · Answer 3 · answered Apr 11 '16 at 21:28

Here's a hacky but perhaps more intuitive way using regex. It finds all text that exists between "Start" and "End" pairs, and the print statement trims them off.

import re 
infile = open('test.txt','r')
text = infile.read() 

matches = re.findall('Start.*?End',text)
for m in matches: 
    print m.strip('Start ').strip(' End')

score 0 · Answer 4 · answered Apr 11 '16 at 21:45

0

You can do this with regular expressions. This will exclude rogue Start and End lines. Here is a live example

import re

f = open('test.txt','r')
txt = f.read()
matches = re.findall(r'^\s*Start\s*$\n((?:^\s*(?!Start).*$\n)*?)^\s*End\s*$', txt, flags=re.M)

answered Apr 11 '16 at 21:45

Brendan Abel

35,343
14
88
118

`Start\s*((?:(?!Start).*$\s)+?)\s*End` will be more efficient. – James Buck Apr 11 '16 at 21:56

Felk · Answer 5 · 2016-04-11T21:36:19.330

If you don't expect to get nested structures, you could do this:

# match everything between "Start" and "End"
occurences = re.findall(r"Start(.*?)End", text, re.DOTALL)
# discard text before duplicated occurences of "Start"
occurences = [oc.rsplit("Start", 1)[-1] for oc in occurences]
# optionally trim whitespaces
occurences = [oc.strip("\n") for oc in occurences]

Which prints

>>> for oc in occurences: print(oc)
     Good Morning
     Hello World
   Good Evening
   Good

You can add the \n as part of Start and End if you want

Extract Values between two strings in a text file

5 Answers5

Linked

Related