want to find pattern with given starting and ending pattern but may contain any number of new line in that

Question

Below is line that i want to extract from file, i.e starting with <XYZ> and ending with </XYZ> but there may be any number of new lines in it

<XYZ>
<beta1>aaaaa</beta1>
<beta>aaaaa</beta>
<beta0>aaaaa</beta0>
<identity>key01_adent</identity>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
</XYZ>

f=open('D:\\pyth_project\\policy.xml', 'r')
read_object=f.read()
f.close()
print(re.findall("<XYZ>\n+.*\n</XYZ>",read_object))

Use a XML parser like [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) — gogaz, Jul 05 '19 at 11:52

score 1 · Answer 1 · answered Jul 05 '19 at 12:19

You shouldn't use regular expressions for XML-like files. You can use lxml instead.

from lxml import etree

root = etree.parse('D:\\pyth_project\\policy.xml')
xyzs = root.findall('.//xyz') # find all xyz tags recursively.

for xyz in xyzs:
    print(etree.tostring(xyz))

See How to find recursively for a tag of XML using LXML? for more information.

score 0 · Accepted Answer · answered Jul 05 '19 at 12:29

As said in other answers, if you are dealng wth XML sintax there are better solutions than simple regex.

But if you really want to use regex, this is how you can do it:

f = open('yourfile', 'r')
read_object = f.read()
f.close()
print(re.findall(r"<XYZ>.*?</XYZ>", read_object, flags=re.DOTALL))

The re.DOTALL flag allows the . special character to match also newlines (by default, it matches all characters except newlines).
The *? is the non-greedy version of *, matching as few characters as possible. So if you have multiple <XYZ>...</XYZ> tags each one will be a separate match.

The assumption here is that you don't have nested <XYZ>...</XYZ> tags. If you have nested tags, better use lxml as in @blueteeth answer.

If so, please consider to accept the answer, thanks! https://stackoverflow.com/help/someone-answers — Valentino, Jul 10 '19 at 17:36

Vinod Srivastav · Answer 3 · 2019-07-05T12:53:11.430

The following sample shows how to read the key01_adent value where stuff is the imaginary xml document

import xml.etree.ElementTree as ET

input = '''
<stuff>
    <XYZ>
      <beta1>aaaaa</beta1>
      <beta>aaaaa</beta>
      <beta0>aaaaa</beta0>
      <identity>key01_adent</identity>
      <beta>aaaaa</beta>
      <beta>aaaaa</beta>
      <beta>aaaaa</beta>
    </XYZ>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('.XYZ')
print('count:', len(lst))


for item in lst:
    print('identity = {}'.format(item.find('identity').text))

The item can have any number of items in it, i expect the tags will be unique

You can test the same here and play with it

want to find pattern with given starting and ending pattern but may contain any number of new line in that

3 Answers3