3

I want to read the text between two characters (“#*” and “#@”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

My Input:

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

My Output:

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

Expected output:

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
CinCout
  • 9,486
  • 12
  • 49
  • 67
BiSarfraz
  • 459
  • 1
  • 3
  • 14

2 Answers2

2

Use the following regex:

#\*([\s\S]*?)#@ /g

This regex captures all whitespace and non-whitespace characters between #* and #@.

Demo

CinCout
  • 9,486
  • 12
  • 49
  • 67
2

You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:

import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
    contents = myfile.read()                     # Read file into a variable
    for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
        # Process each match individually

See the regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563