I am taking products information from an endpoint. In order to parse that information I am using a filter which is suds MessagePlugin.
The incoming data like as follows: (That is not contains the hole request. It contains a small part of it)
<env:Envelope xmlns:env='http://schemas.xmlsoap.org/soap/envelope/'><env:Header></env:Header><env:Body><prod:getProductsResponse xmlns:prod='https://product.individual.ns.listinsgapi.aa.com'><return><ackCode>success</ackCode><responseTime>13/09/2021 09:47:34</responseTime><timeElapsed>211 ms</timeElapsed><productCount>199</productCount><products><product><productId>01201801947</productId><product><categoryCode>cn1g</categoryCode><storeCategoryId>0</storeCategoryId><title>Morphy Richards Sensörlü çöp kutusu, 30 litre, yuvarlak, siyah paslanmaz çelik</title><specs><spec required="false" value="Standart Çöp Kovası" name="Ürün Tipi"/><spec required="false" value="Montajsız" name="Montaj Tipi"/><spec required="false" value="Sensörlü Kapak" name="Kapak Tipi"/><spec required="false" value="26 lt-30 lt" name="İç Hacim"/><spec required="false" value="Çelik" name="Malzeme"/><spec required="false" value="Sıfır" name="Durum"/></specs><photos><photo photoId="0"><url>https://mcdn301.gi1ttigidliyor.net/622080/620801947_0.jpg</url></photo><photo photoId="1"><url>https://mcdn011.gittigidliyor.net/620380/62081101947_1.jpg</url></photo><photo photoId="2"><url>https://mcdn021.gittigidliyor.net/620180/6210801947_2.jpg</url></photo><photo photoId="3"><url>https://mcdn201.gittigidliyor.net/620850/6208013947_3.jpg</url></photo><photo photoId="4"><url>https://mcdn301.gittigidliyor.net/623080/6208101947_4.jpg</url></photo><photo photoId="5"><url>https://mcdn01.gittigidiyor.net/62080/620801947_5.jpg</url></photo><photo photoId="6"><url>https://mcdn01.gittigidiyor.net/62080/620801947_6.jpg</url></photo></photos><pageTemplate>4</pageTemplate><description><body>
<ul class="a-unordered-list a-vertical a-spacing-mini" style="padding-right: 0px; padding-left: 0px; box-sizing: border-box; margin: 0px 0px 0px 18px; color: rgb(17, 17, 17); font-family: ">
<li style="box-sizing: border-box; list-style: disc; overflow-wrap: break-word; margin: 0px;">&nbsp; <h2 style="box-sizing: border-box; padding: 0px 0px 4px; margin: 3px 0px 7px; text-rendering: optimizelegibility; line-height: 32px; font-family: ">Ürün Bilgileri</h2> <span style="background-color:rgb(255, 255, 255); box-sizing:border-box; color:rgb(15, 17, 17); font-family:amazon ember,arial,sans-serif; font-size:14px">Renk:<strong style="box-sizing:border-box; font-weight:700">Paslanmaz Çelik</strong></span>
<div class="a-row a-spacing-top-base" style="box-sizing: border-box; width: 1213px; color: rgb(15, 17, 17); font-family: ">
<div class="a-column a-span6" style="box-sizing: border-box; margin-right: 24.25px; float: left; min-height: 1px; overflow: visible; width: 593.734px;">
<div class="a-row a-spacing-base" style="box-sizing: border-box; width: 593.734px; margin-bottom: 12px !important;">
<div class="a-row a-expander-container a-expander-extend-container" style="box-sizing: border-box; width: 593.734px;">
<div class="a-row" style="box-sizing: border-box; width: 593.734px;">
I just want to apply html decoding to description part of the information. Because for some reason an error occurs in the description part of some products since the html tags are not fully parsed in the incoming information.
For example:
0979c08d37cd.CR0,0,2000,2000_PT0_SX220_.jpg style=-webkit-tap-highlight-color:transparent; border:none; box-sizing:border-box; display:block; margin:0px auto; max-width:100%; padding:0px; vertical-align:top/p /div /th /tr /tbody /table /div /div /div /div /div /div /div /div /div /body
As far as I do to solving that problem I tried 2 different approaches.
Before dig into the approaches:
context: The reply context. The I{reply} is the raw text. context.reply = incoming data type(context.reply) = Bytes
class UnicodeFilter(MessagePlugin):
def received(self, context):
from lxml import etree
from io import BytesIO
parser = etree.XMLParser(recover=True)
request_string = context.reply.decode("utf-8")
replaced_string = request_string.replace(">", ">").replace("<", "<")
byte_rep_string = str.encode(replaced_string)
doc = etree.parse(BytesIO(byte_rep_string), parser)
byte_str_doc = etree.tostring(doc)
context.reply = byte_str_doc
This approach didn't work. It didnt throws an error but there were no changes in the html tag which were in the description body, the html tags in the product description section were still broken.
2.
class UnicodeFilter(MessagePlugin):
def received(self, context):
from lxml import etree
from io import BytesIO
import html
parser = etree.XMLParser(recover=True) # Initialize the parser
request_string = context.reply.decode("utf-8") # Converting incoming data byte to string
html_decoded = html.unescape(request_string) # Html decoding to the data
byte_rep_string = str.encode(html_decoded) # Converting the data from string to byte
doc = etree.parse(BytesIO(byte_rep_string), parser)
byte_str_doc = etree.tostring(doc)
context.reply = byte_str_doc
In this approach, I got the TypeNotFound: Type not found: 'body' error.
To summarize what I want to do. I want to parse the incoming data using the lxml library because some characters in the data can cause problems and I get the "not well-formed (invalid token)" error. (I solved this). Secondly, I want to html decode only the description part of this data. (to fix html tags issue)
Any help would be great.