0

I have this XML file :

<?xml version="1.0" encoding="UTF-8" standalone="true"?>

<Component>

<Custom/>
<ID>1</ID>
<LongDescription>
<html><html> <head> <style type="text/css"> <!-- .style9 { color: #ffff33; background-color: #ff00ff } .style8 { color: #990099; background-color: #66ffcc } .style7 { color: #0066cc; background-color: #ccffcc } .style6 { color: #009900; background-color: #ffffcc } .style11 { color: #000066; background-color: #ccffcc } .style5 { color: #cc0033; background-color: #99ff99 } .style10 { color: #99ff99; background-color: #00cccc } .style4 { color: #cc0033; background-color: #ccffff } .style3 { color: #0000dd; background-color: teal } .style2 { color: #0000cc; background-color: aqua } .style1 { color: blue; background-color: silver } .style0 { color: #000099; background-color: #ffffcc } --> </style> </head> <body> </body> </html> </html>
</LongDescription>
<Name>ip_bridge</Name>
</component>

I am reading this file using the library xml.etree.ElementTree as follows :

def getTokens(xml_string_file):
tokensList = []
tree = ET.parse(xml_string_file)
root = tree.getroot()
tokensList.append('<component>')
for child in root: 
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL):
        tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
tokensList.append('</component>')
return tokensList

with the function extractTags

def extractTags(root):
tokensList = []
for child in root:
    firstTag = '<' + child.tag + '>'
    lastTag = '</' + child.tag + '>'
    tokensList.append(firstTag)
    if child.text == None:
        tokensList.append('')
    elif re.findall(r"\n", child.text, re.DOTALL): #To extract the children of the children
            tokensList = tokensList + extractTags(root=child)
    else:
        tokensList.append(child.text)
    tokensList.append(lastTag)
return tokensList

I get as a result the tokens list ['<omponent>', '<custom>', '', '</custom>', '<ID>', '1', '</ID>', '<LongDescription>', '<html>', '</html>', '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</component>'] I want to extract also what is between the html tags as one token (one text).

Emna Jaoua
  • 361
  • 6
  • 18
  • Can you post your expected output? – Rakesh Jun 01 '18 at 10:28
  • expected output ['', '', '', '', '', '1', '', '', '', '',' – Emna Jaoua Jun 01 '18 at 10:31
  • @Rakesh I forgot to add the extractTags function also. It's updated now in the post. – Emna Jaoua Jun 01 '18 at 10:49
  • This looks like a very complicated approach for creating a replica of the original tree. The output you create has nothing the actual XML tree doesn't have; I'm convinced it would be much simpler to skip creating this strange "token list" and work with the XML tree directly. What's the purpose or goal you want to achieve? – Tomalak Jun 01 '18 at 11:14
  • The purpose of the project is to regenerate an unseen xml file using Machine Learning methods. The token list is first encoded using the one hot Encder. the encoded vector is then fed to the autoEncoder model so we can regenerate using the decoder layer specifically. So I need that tokens list. After regenrating the same tokens list , it will be written to an xml file. – Emna Jaoua Jun 01 '18 at 11:23
  • Ah, I see. It would make sense to add that information to the question. – Tomalak Jun 01 '18 at 11:26
  • I apologize for not being that clear – Emna Jaoua Jun 01 '18 at 11:43
  • No problem. Sometimes people do something seemingly pointless because they have not thought through the task properly, or cannot think of a better way of doing something. I think it's better to ask for the purpose rather than blindly solve a problem that might not really exist. – Tomalak Jun 01 '18 at 11:46
  • yes I agree with you ! – Emna Jaoua Jun 01 '18 at 12:59

1 Answers1

0

I would suggest a simple recursive generator that traverses the tree and yields tokens.

These can be put into a list very easily through a list comprehension.

from io import StringIO

xml = """<Component>
    <Custom/>
    <ID>1</ID>
    <LongDescription>
        <html>
            <html>
                <head>
                    <style type="text/css">
                        <!-- .style9 { color: #ffff33; } ... --> 
                    </style>
                </head>
                <body>
                </body>
            </html>
        </html>
    </LongDescription>
    <Name>ip_bridge</Name>
</Component>"""
xml_string_file = StringIO(xml)

# -----------------------------------------------------------------------
import xml.etree.ElementTree as ET

def tokenize_tree(element):
    yield '<%s>' % element.tag 
    yield element.text if element.text else ''
    for child in element:
        yield from tokenize_tree(child)
    yield '</%s>' % element.tag 

tree = ET.parse(xml_string_file)    

token_list = [token for token in tokenize_tree(tree.getroot())]
print(token_list)

The output for me is:

['<Component>', '\n    ', '<Custom>', '', '</Custom>', '<ID>', '1', '</ID>', 
 '<LongDescription>', '\n        ', '<html>', '\n            ', '<html>', 
 '\n                ',  '<head>', '\n                    ',  '<style>', 
 '\n                         \n                    ', '</style>', '</head>', 
 '<body>', '\n                ', '</body>', '</html>', '</html>', 
 '</LongDescription>', '<Name>', 'ip_bridge', '</Name>', '</Component>']

You can handle whitespace-only text nodes and comments (such as the one in the <style> element) as you see fit. For example by doing:

if element.text and element.text.strip():
    yield element.text.strip()

For text node processing with ElementTree, look at Python element tree - extract text from element, stripping tags - you might instead want to add something like:

for text in element.itertext():
    yield text

to the function above.

For HTML in general, which will have text nodes and element nodes intermixed, see Python ElementTree - iterate through child nodes and text in order

Tomalak
  • 332,285
  • 67
  • 532
  • 628