4

I am taking products information from an endpoint. In order to parse that information I am using a filter which is suds MessagePlugin.

The incoming data like as follows: (That is not contains the hole request. It contains a small part of it)

<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/'><env:Header></env:Header><env:Body><prod:getProductsResponse xmlns:prod='https://product.individual.ns.listinsgapi.aa.com'><return><ackCode>success</ackCode><responseTime>13/09/2021 09:47:34</responseTime><timeElapsed>211 ms</timeElapsed><productCount>199</productCount><products><product><productId>01201801947</productId><product><categoryCode>cn1g</categoryCode><storeCategoryId>0</storeCategoryId><title>Morphy Richards Sensörlü çöp kutusu, 30 litre, yuvarlak, siyah paslanmaz çelik</title><specs><spec required="false" value="Standart Çöp Kovası" name="Ürün Tipi"/><spec required="false" value="Montajsız" name="Montaj Tipi"/><spec required="false" value="Sensörlü Kapak" name="Kapak Tipi"/><spec required="false" value="26 lt-30 lt" name="İç Hacim"/><spec required="false" value="Çelik" name="Malzeme"/><spec required="false" value="Sıfır" name="Durum"/></specs><photos><photo photoId="0"><url>https://mcdn301.gi1ttigidliyor.net/622080/620801947_0.jpg</url></photo><photo photoId="1"><url>https://mcdn011.gittigidliyor.net/620380/62081101947_1.jpg</url></photo><photo photoId="2"><url>https://mcdn021.gittigidliyor.net/620180/6210801947_2.jpg</url></photo><photo photoId="3"><url>https://mcdn201.gittigidliyor.net/620850/6208013947_3.jpg</url></photo><photo photoId="4"><url>https://mcdn301.gittigidliyor.net/623080/6208101947_4.jpg</url></photo><photo photoId="5"><url>https://mcdn01.gittigidiyor.net/62080/620801947_5.jpg</url></photo><photo photoId="6"><url>https://mcdn01.gittigidiyor.net/62080/620801947_6.jpg</url></photo></photos><pageTemplate>4</pageTemplate><description>&lt;body&gt;
 &lt;ul class=&quot;a-unordered-list a-vertical a-spacing-mini&quot; style=&quot;padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: &quot;&gt; 
  &lt;li style=&quot;box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;&quot;&gt;&amp;nbsp; &lt;h2 style=&quot;box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: &quot;&gt;Ürün Bilgileri&lt;/h2&gt; &lt;span style=&quot;background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px&quot;&gt;Renk:&lt;strong style=&quot;box-sizing:border-box; font-weight:700&quot;&gt;Paslanmaz Çelik&lt;/strong&gt;&lt;/span&gt; 
   &lt;div class=&quot;a-row a-spacing-top-base&quot; style=&quot;box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: &quot;&gt; 
    &lt;div class=&quot;a-column a-span6&quot; style=&quot;box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;&quot;&gt; 
     &lt;div class=&quot;a-row a-spacing-base&quot; style=&quot;box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;&quot;&gt; 
      &lt;div class=&quot;a-row a-expander-container a-expander-extend-container&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt; 
       &lt;div class=&quot;a-row&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt;

I just want to apply html decoding to description part of the information. Because for some reason an error occurs in the description part of some products since the html tags are not fully parsed in the incoming information.

For example:

0979c08d37cd.CR0,0,2000,2000_PT0_SX220_.jpg style=-webkit-tap-highlight-color:transparent; border:none; box-sizing:border-box; display:block; margin:0px auto; max-width:100%; padding:0px; vertical-align:top/p /div /th /tr /tbody /table /div /div /div /div /div /div /div /div /div /body

As far as I do to solving that problem I tried 2 different approaches.

Before dig into the approaches:

context: The reply context. The I{reply} is the raw text. context.reply = incoming data type(context.reply) = Bytes

class UnicodeFilter(MessagePlugin):

    def received(self, context):
        from lxml import etree
        from io import BytesIO

        parser = etree.XMLParser(recover=True)
        request_string = context.reply.decode("utf-8")
        replaced_string = request_string.replace("&gt;", ">").replace("&lt;", "<")

        byte_rep_string = str.encode(replaced_string)
      
        doc = etree.parse(BytesIO(byte_rep_string), parser)
        byte_str_doc = etree.tostring(doc)
        context.reply = byte_str_doc

This approach didn't work. It didnt throws an error but there were no changes in the html tag which were in the description body, the html tags in the product description section were still broken.

2.

class UnicodeFilter(MessagePlugin):

    def received(self, context):
        from lxml import etree
        from io import BytesIO
        import html

        parser = etree.XMLParser(recover=True) # Initialize the parser
        request_string = context.reply.decode("utf-8") # Converting incoming data byte to string
        html_decoded = html.unescape(request_string) # Html decoding to the data
        byte_rep_string = str.encode(html_decoded) # Converting the data from string to byte
      
        doc = etree.parse(BytesIO(byte_rep_string), parser)
        byte_str_doc = etree.tostring(doc)
        context.reply = byte_str_doc

In this approach, I got the TypeNotFound: Type not found: 'body' error.

To summarize what I want to do. I want to parse the incoming data using the lxml library because some characters in the data can cause problems and I get the "not well-formed (invalid token)" error. (I solved this). Secondly, I want to html decode only the description part of this data. (to fix html tags issue)

Any help would be great.

bufferoverflow
  • 81
  • 1
  • 1
  • 4
  • Please provide a proper [mcve]. Does reproducing the problem really require a class that inherits from MessagePlugin? – mzjn Sep 22 '21 at 06:59
  • The "incoming data" sample is not well-formed (end tags missing). – mzjn Sep 22 '21 at 09:00
  • @mzjn Actually that's what I got as an error message in the beginning. But I solved that problem with using filtering (suds MessagePlugin). The given incoming data is not the full data it's a sample of the original data and original data is a huge file. So, my problem is about the html decoding part. (You can see the html part below the tag in incoming data sample) while using `html.unescape(request_string)` for some reason it throws an error (TypeNotFound: Type not found: 'body' error). The question is how can I parse that coming data and its inner decsp tag separately. – bufferoverflow Sep 22 '21 at 09:34
  • It is hard to reproduce to that problem but i tried to give more detailed explanation. Basically you are getting some xml data like ...... And that data contains some description part in it and in that description tag has html body with html tags. So I would like to parse the xml data and inner description part separately. I am using `parser = etree.XMLParser(recover=True)` in order to parse coming xml data and `html_decoded = html.unescape(request_string)` to parse html part. – bufferoverflow Sep 22 '21 at 10:23

1 Answers1

0

I'm not sure I can reproduce your specific error, but I would use this approach using etree.fromstring() once you have the string from the request. (I've tried to clean up and close the tags for the test data so it can be parsed to demonstrate the solution. There's also an extra <product> tag in there that prevents parsing that you may have to deal with.)


In [104]: import lxml

In [105]: string = '''<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/'><env:Header></env:Header><env:Body><pr
     ...: od:getProductsResponse xmlns:prod='https://product.individual.ns.listinsgapi.aa.com'><return><ackCode>success</ackCode
     ...: ><responseTime>13/09/2021 09:47:34</responseTime><timeElapsed>211 ms</timeElapsed><productCount>199</productCount><pro
     ...: ducts><product><productId>01201801947</productId><categoryCode>cn1g</categoryCode><storeCategoryId>0</storeCategoryId>
     ...: <title>Morphy Richards Sensörlü çöp kutusu, 30 litre, yuvarlak, siyah paslanmaz çelik</title><specs><spec required="fa
     ...: lse" value="Standart Çöp Kovası" name="Ürün Tipi"/><spec required="false" value="Montajsız" name="Montaj Tipi"/><spec 
     ...: required="false" value="Sensörlü Kapak" name="Kapak Tipi"/><spec required="false" value="26 lt-30 lt" name="İç Hacim"/
     ...: ><spec required="false" value="Çelik" name="Malzeme"/><spec required="false" value="Sıfır" name="Durum"/></specs><phot
     ...: os><photo photoId="0"><url>https://mcdn301.gi1ttigidliyor.net/622080/620801947_0.jpg</url></photo><photo photoId="1"><
     ...: url>https://mcdn011.gittigidliyor.net/620380/62081101947_1.jpg</url></photo><photo photoId="2"><url>https://mcdn021.gi
     ...: ttigidliyor.net/620180/6210801947_2.jpg</url></photo><photo photoId="3"><url>https://mcdn201.gittigidliyor.net/620850/
     ...: 6208013947_3.jpg</url></photo><photo photoId="4"><url>https://mcdn301.gittigidliyor.net/623080/6208101947_4.jpg</url><
     ...: /photo><photo photoId="5"><url>https://mcdn01.gittigidiyor.net/62080/620801947_5.jpg</url></photo><photo photoId="6"><
     ...: url>https://mcdn01.gittigidiyor.net/62080/620801947_6.jpg</url></photo></photos><pageTemplate>4</pageTemplate><descrip
     ...: tion>&lt;body&gt;
     ...:  &lt;ul class=&quot;a-unordered-list a-vertical a-spacing-mini&quot; style=&quot;padding-right: 0px; padding-left: 0px
     ...: ; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: &quot;&gt; 
     ...:   &lt;li style=&quot;box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;&quot;&gt;&amp;n
     ...: bsp; &lt;h2 style=&quot;box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizeleg
     ...: ibility; line-height: 32px; font-family: &quot;&gt;Ürün Bilgileri&lt;/h2&gt; &lt;span style=&quot;background-color:rgb
     ...: (255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14
     ...: px&quot;&gt;Renk:&lt;strong style=&quot;box-sizing:border-box; font-weight:700&quot;&gt;Paslanmaz Çelik&lt;/strong&gt;
     ...: &lt;/span&gt; 
     ...:    &lt;div class=&quot;a-row a-spacing-top-base&quot; style=&quot;box-sizing: border-box; width: 1213px; color: rgb(15
     ...: , 17, 17); font-family: &quot;&gt; 
     ...:     &lt;div class=&quot;a-column a-span6&quot; style=&quot;box-sizing: border-box; margin-right: 24.25px; float: left;
     ...:  min-height: 1px; overflow: visible; width: 593.734px;&quot;&gt; 
     ...:      &lt;div class=&quot;a-row a-spacing-base&quot; style=&quot;box-sizing: border-box; width: 593.734px; margin-botto
     ...: m: 12px !important;&quot;&gt; 
     ...:       &lt;div class=&quot;a-row a-expander-container a-expander-extend-container&quot; style=&quot;box-sizing: border-
     ...: box; width: 593.734px;&quot;&gt; 
     ...:        &lt;div class=&quot;a-row&quot; style=&quot;box-sizing: border-box; width: 593.734px;&quot;&gt;
     ...: </description>
     ...: </product>
     ...: </products>
     ...: </return>
     ...: </prod:getProductsResponse>
     ...: </env:Body>
     ...: </env:Envelope>'''

In [106]: root = lxml.etree.fromstring(string)

In [108]: descriptions = root.xpath('//description')

In [109]: description = descriptions[0]

In [110]: description.text
Out[110]: '<body>\n <ul class="a-unordered-list a-vertical a-spacing-mini" style="padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: "> \n  <li style="box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;">&nbsp; <h2 style="box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: ">Ürün Bilgileri</h2> <span style="background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px">Renk:<strong style="box-sizing:border-box; font-weight:700">Paslanmaz Çelik</strong></span> \n   <div class="a-row a-spacing-top-base" style="box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: "> \n    <div class="a-column a-span6" style="box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;"> \n     <div class="a-row a-spacing-base" style="box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;"> \n      <div class="a-row a-expander-container a-expander-extend-container" style="box-sizing: border-box; width: 593.734px;"> \n       <div class="a-row" style="box-sizing: border-box; width: 593.734px;">\n'

In [112]: html_root = lxml.etree.fromstring(description.text, lxml.etree.HTMLParser())

In [114]: lxml.etree.tostring(html_root)
Out[114]: b'<html><body>\n <ul class="a-unordered-list a-vertical a-spacing-mini" style="padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: "> \n  <li style="box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;">&#160; <h2 style="box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: ">&#220;r&#252;n Bilgileri</h2> <span style="background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px">Renk:<strong style="box-sizing:border-box; font-weight:700">Paslanmaz &#199;elik</strong></span> \n   <div class="a-row a-spacing-top-base" style="box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: "> \n    <div class="a-column a-span6" style="box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;"> \n     <div class="a-row a-spacing-base" style="box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;"> \n      <div class="a-row a-expander-container a-expander-extend-container" style="box-sizing: border-box; width: 593.734px;"> \n       <div class="a-row" style="box-sizing: border-box; width: 593.734px;">\n</div></div></div></div></div></li></ul></body></html>'

If you need to manipulate the html after this, it would be better to manipulate the html_root rather than attempt to manipulate the string. If so, I can expand the answer as needed.

Forensic_07
  • 1,125
  • 1
  • 6
  • 10
  • First of all thank you for your reply. At root = etree.fromstring(string_context_reply) that part I got an error which was lxml.etree.XMLSyntaxError: PCDATA invalid Char value. Could the reason that I am getting that error be because the your "string" value is a byte object for my case, and I convert it to a string object with using `string_context_reply = context.reply.decode("UTF-8")` then I used as `root = etree.fromstring(string_context_reply)` – bufferoverflow Sep 23 '21 at 11:15