Python lxml : how to get encoding scheme before parsing

Question

I have some utf-16 xml documents that i need to parse using python's lxml ElementTree. If i just pass the documents as strings, it fails to build a tree. i need to use some xpath queries on the documents and hence i need the tree structure. here is some code i have been trying

    from lxml.html.soupparser import fromstring
    root = fromstring(inString)

    backups = root.xpath(".//p3")
    nodes = root.xpath("./doc/p1/p2/p3[contains(text(),'ABC')]//preceding::p1//p3")

    if not nodes:

        print "No XYZ"
        nodes = root.xpath("./doc/p1/p2/p3[contains(text(),'XYZ')]//preceding::p1//p3") 

        if not nodes:

            print "No ABC"
            return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in backups])

        else:

            return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in nodes])
    else:
        return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in nodes])

note that i i want to look for tag <p3> that has a text of ABC. If this node is found, i will ignore everything that comes after this. Hence the xpath. Else, i look for tag <p3> with text XYZ. If this is found, i ignore everything that comes after this. Otherwise, i just process all the <p3> nodes and return.

This works fine for utf-8 documents but fails for utf-16. for any utf-16 document, i always get an empty string. even though i can see that there are xml nodes of the tag <p3> that have text like ABC and XYZ. I noticed that instead of the expected

<p3>ABC</p3>

the utf-16 document text appears as

&lt;p3&gt;ABC&lt;/p3&gt;

hence the lxml.etree is not able to parse it as proper xml.

how should i solve this? is there another library that i can use for this?

Edit

i found something here

Why does ElementTree reject UTF-16 XML declarations with "encoding incorrect"?

this suggests that i should do

 root = fromstring(inString.encode('utf-16-be'))

but for this, i need to know whether the incoming document is utf16 encoded or not. how would i do that? i have both utf-8 and utf-16 documents in my data. the utf-8 documents look like

<doc>
<p1>
    <p2 dd="ert" ji="pp">

        <p3>sfsdg</p3>
        <p3>sgsg</p3>
        <p3>ABC</p3>
        <p3>agewg</p3>

     </p2>
</p1>
</doc>

so here i am easily able to navigate to the <p3> node with text ABC using my xpath. However i also get some documents like

<doc>
&lt;?xml version="1.0" encoding="UTF-16" standalone="yes"?&gt;
&lt;p1&gt;
    &lt;p2 dd="ert" ji="pp"&gt;

        &lt;p3&gt;sfsdg&lt;/p3&gt;
        &lt;p3&gt;sgsg&lt;/p3&gt;
        &lt;p3&gt;ABC&lt;/p3&gt;
        &lt;p3&gt;agewg&lt;/p3&gt;

     &lt;/p2&gt;
&lt;/p1&gt;
</doc>

so here it is explicitly specified that it is utf-16. how can i detect the encoding so that i know what to parse?

Quick and Dirty Solution

i found a quick and dirty way

 if "UTF-16" in inString or "utf-16" in inString:
     root=fromstring(inString.replace("&lt;","<").replace("&gt;",">"))
 else:
     root=fromString(inString)

is there a better way?

It seems that the problem you are having has nothing to do with UTF-16. The < and > are escape sequences, which denote the < and > characters that should not be considered as part of XML mark-up. I suggest you examine why your data comes escaped in this way in the first place. — KT., Dec 02 '15 at 16:28
makes sense. it would be great if they gave me the data in proper format in the first place. i thought that i was getting < because the data was in utf-16 — AbtPst, Dec 02 '15 at 16:32
still, how can i check for the encoding scheme before parsing? — AbtPst, Dec 02 '15 at 16:32
In theory, you should have no need for that here. The data always comes to you as a byte stream at the place you read it. The only thing you need to do is make sure that you feed this byte stream as-is to the XML parser (i.e. without converting it to a unicode stream somewhere along the way). The parser will read out the encoding declaration from the stream and figure out the rest. — KT., Dec 02 '15 at 16:50
Also note that wrapping an XML file with an encoding declaration in the way you have it wrapped may be rather meaningless, if the external XML document is provided in a different encoding and the internal one is not kept in a CDATA section. — KT., Dec 02 '15 at 16:52
true. unfortunately, thats how i get the documents. i will check with them if they can improve the formatting before sending — AbtPst, Dec 02 '15 at 16:58

Python lxml : how to get encoding scheme before parsing

0 Answers0