How to extract text between two substrings from a Python file

Question

I want to read the text between two characters (“#*” and “#@”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

My Input:

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

My Output:

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

Expected output:

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems

Could you perhaps highlight your input/output/expected output? — norok2, Jul 22 '19 at 10:19
Remove your `for` cycle and add `contents = myfile.read()` and then `print(re.findall(r'#\*(.*?)#@', contents, re.S))` — Wiktor Stribiżew, Jul 22 '19 at 10:24
Do you mean like this? `^#\*(.*)(?:\r?\n){2}#@` https://regex101.com/r/5ouxbw/1 — The fourth bird, Jul 22 '19 at 10:24
@Wiktor Stribiżew thank you for help. This code is returning all data between these two strings from a file. I basically want all this one by one. For example, I want to read the first title between "#*" and "@". Then next title between "#*" and "@" and so on. — BiSarfraz, Jul 22 '19 at 10:36
So, no problem: `for match in re.findall(r'#\*(.*?)#@', contents, re.S): // do something with the match` — Wiktor Stribiżew, Jul 22 '19 at 11:17

score 2 · Answer 1 · answered Jul 22 '19 at 10:24

2

Use the following regex:

#\*([\s\S]*?)#@ /g

This regex captures all whitespace and non-whitespace characters between #* and #@.

Demo

answered Jul 22 '19 at 10:24

CinCout

9,486
12
49
67

score 2 · Accepted Answer · answered Jul 22 '19 at 11:34

You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:

import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
    contents = myfile.read()                     # Read file into a variable
    for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
        # Process each match individually

See the regex demo.

How to extract text between two substrings from a Python file

2 Answers2

Linked