1

Lets say I have an input file like this:

#Backup TOC
boot.tar.gz    /boot/

#Filesystems
/boot               /dev/mapper/VolGroup-lv_root xfs

#Devices
/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0-part1 PHY /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0

#UnhandledFS
/var/
/var/log
/var/log/audit
/var/tmp

I want to extract content between every #header (and the last #UnhandledFS can be ignored), once extracted I have to check whether there is any entry available or not.

Below code I use to extract content between two #header. But it is however not repeating

lines = open("./input").readlines()
re.compile('#\w+(.*?)#\w+', re.DOTALL | re.M).findall(''.join(lines))
petezurich
  • 9,280
  • 9
  • 43
  • 57
Ibrahim Quraish
  • 3,889
  • 2
  • 31
  • 39

1 Answers1

0

The problem with your regex is that it consumes the "end" #header which causes it to skip #Filesystems and mess up your match.

What you need is called "lookahead" - it is a way to match a pattern without consuming it.

Here is a regex that will work for you:

re.compile(r'#[^\n]*\n([^#]*)(?=#)', re.DOTALL | re.M).findall(''.join(lines))

It also fixes the problem where a header with a space gets included in the match, like the first header in your example: the word TOC will be part of the match.

But, if you want minimum fixes to your regex, this will work too (except the TOC part):

re.compile('#\w+(.*?)(?=#\w+)', re.DOTALL | re.M).findall(''.join(lines))
Lev M.
  • 6,088
  • 1
  • 10
  • 23