2

I'm using Python version 3.6 on a Windows machine. I'm reading in a text file using with open() and readlines(). After reading in the text file lines, I want to write certain lines to a new text file, but exclude certain ranges of lines. I do not know the line numbers of the lines to exclude. The text files are massive and the range of lines to exclude vary among the text files that I'm reading. There are known keywords I can search for to find the start and end of the range to exclude from the text file I want to write to.

I've searched everywhere online but I can't seem to find an elegant solution that works. The following is an example of what I'm trying to achieve.

a  
b  
BEGIN  
c  
d  
e  
END  
f  
g  
h  
i  
j  
BEGIN  
k  
l  
m  
n  
o  
p  
q  
END  
r  
s  
t  
u  
v  
BEGIN  
w  
x  
y  
END  
z 

In summary, I want to read the above into Python. Afterwards, write to a new file but exclude all lines starting at BEGIN and stopping at END keywords.

The new file should contain the following:

a  
b  
f  
g  
h  
i  
j  
r  
s  
t  
u  
v  
z  
Ashish Ranjan
  • 5,523
  • 2
  • 18
  • 39
probat
  • 1,422
  • 3
  • 17
  • 33

3 Answers3

1

You can use the following regex to achieve this:

regex = r"(\bBEGIN\b([\w\n]*?)\bEND\b\n)"

Live demo here

You can match using the above regex and then replace with empty string ('')

Here's an working example in Python for the same.

CODE

result = re.sub(regex, '', test_str, 0) # test_str is your file's content
>>> print(result)
>>> 
a
b
f
g
h
i
j
r
s
t
u
v
z
Ashish Ranjan
  • 5,523
  • 2
  • 18
  • 39
1

If the text files are massive, as you say, you'll want to avoid using readlines() as that will load the entire thing in memory. Instead, read line by line and use a state variable to control whether you're in a block where output should be suppressed. Something sort of like,

import re

begin_re = re.compile("^BEGIN.*$")
end_re = re.compile("^END.*$")
should_write = True

with open("input.txt") as input_fh:
    with open("output.txt", "w", encoding="UTF-8") as output_fh:
        for line in input_fh:
            # Strip off whitespace: we'll add our own newline
            # in the print statement
            line = line.strip()

            if begin_re.match(line):
                should_write = False
            if should_write:
                print(line, file=output_fh)
            if end_re.match(line):
                should_write = True
Rob Hansen
  • 317
  • 1
  • 4
  • I ended up using this. I don't need to use a regular expression in my particular situation so I'm not going to use the re module. Also, I changed 'print(line, file=output_fh)' to output_fh.write(line) since the print statement raised the following warning: Expected type 'Optional[IO[str]]', got 'TextIOWrapper[str]' instead. Thank you everyone for your support! – probat Oct 29 '17 at 17:25
0

Have you tried something like this:

with open("<readfile>") as read_file:
    with open("<savefile>", "w") as write_file:
        currently_skipping = False
        for line in read_file:
            if line == "BEGIN":
                currently_skipping = True
            else if line == "END":
                currently_skipping = False

            if currently_skipping:
                continue

            write_file.write(line)

That should basically do what you need to do. Basically don't read everything into memory via 'readlines' but go for more line by line approach - that should also be leaner for memory.

actionjezus6
  • 47
  • 1
  • 5