0

Below is line that i want to extract from file, i.e starting with <XYZ> and ending with </XYZ> but there may be any number of new lines in it

<XYZ>
<beta1>aaaaa</beta1>
<beta>aaaaa</beta>
<beta0>aaaaa</beta0>
<identity>key01_adent</identity>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
<beta>aaaaa</beta>
</XYZ>
f=open('D:\\pyth_project\\policy.xml', 'r')
read_object=f.read()
f.close()
print(re.findall("<XYZ>\n+.*\n</XYZ>",read_object))
Vinod Srivastav
  • 3,644
  • 1
  • 27
  • 40
  • 1
    Use a XML parser like [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – gogaz Jul 05 '19 at 11:52

3 Answers3

1

You shouldn't use regular expressions for XML-like files. You can use lxml instead.

from lxml import etree

root = etree.parse('D:\\pyth_project\\policy.xml')
xyzs = root.findall('.//xyz') # find all xyz tags recursively.

for xyz in xyzs:
    print(etree.tostring(xyz))

See How to find recursively for a tag of XML using LXML? for more information.

blueteeth
  • 3,330
  • 1
  • 13
  • 23
0

As said in other answers, if you are dealng wth XML sintax there are better solutions than simple regex.

But if you really want to use regex, this is how you can do it:

f = open('yourfile', 'r')
read_object = f.read()
f.close()
print(re.findall(r"<XYZ>.*?</XYZ>", read_object, flags=re.DOTALL))

The re.DOTALL flag allows the . special character to match also newlines (by default, it matches all characters except newlines).
The *? is the non-greedy version of *, matching as few characters as possible. So if you have multiple <XYZ>...</XYZ> tags each one will be a separate match.

The assumption here is that you don't have nested <XYZ>...</XYZ> tags. If you have nested tags, better use lxml as in @blueteeth answer.

Valentino
  • 7,291
  • 6
  • 18
  • 34
0

The following sample shows how to read the key01_adent value where stuff is the imaginary xml document

import xml.etree.ElementTree as ET

input = '''
<stuff>
    <XYZ>
      <beta1>aaaaa</beta1>
      <beta>aaaaa</beta>
      <beta0>aaaaa</beta0>
      <identity>key01_adent</identity>
      <beta>aaaaa</beta>
      <beta>aaaaa</beta>
      <beta>aaaaa</beta>
    </XYZ>
</stuff>'''

stuff = ET.fromstring(input)
lst = stuff.findall('.XYZ')
print('count:', len(lst))


for item in lst:
    print('identity = {}'.format(item.find('identity').text))

The item can have any number of items in it, i expect the tags will be unique

You can test the same here and play with it

Vinod Srivastav
  • 3,644
  • 1
  • 27
  • 40