3

I was trying to parse the following xml and fetch specific tags that i'm interested in around my business need. and i guess i'm doing something wrong. Not sure how to parse my required tags?? Wanted to leverage pandas, so that i can further filter for specifics. Apprentice all the support

My XMl coming from URI

<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
   <promotiontype id="11">Percentage off</promotiontype>
  </promotiontypes>
  <offerdescription>25% Off Boys' Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
  <offerstartdate>2020-07-24</offerstartdate>
  <offerenddate>2020-07-26</offerenddate>
  <clickurl>https://click.synergy.com/fs-bin/click?id=ZZvk49eM&offerid=777210.100474695&type=3&subid=0</clickurl>
  <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZZvk49NAwbids=777210.100474695&type=3&subid=0</impressionpixel>
  <advertiserid>3184</advertiserid>
  <advertisername>cys.com</advertisername>
  <network id="1">US Network</network>
 </link>

My Code

from xml.dom import minidom
import urllib
import pandas as pd 
url = "http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500"
xmldoc = minidom.parse(urllib.request.urlopen(url)) 

#itemlist = xmldoc.getElementsByTagName('clickurl')


df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]
rows = []

for entry in xmldoc.couponfeed:
    s_promotiontype = couponfeed.get("promotiontype","")
    s_category = couponfeed.get("category","")
    s_offerdescription = couponfeed.get("offerdescription", "")
    s_offerstartdate = couponfeed.get("offerstartdate", "")
    s_offerenddate = couponfeed.get("offerenddate", "")
    s_clickurl = couponfeed.get("clickurl", "")
    s_impressionpixel = couponfeed.get("impressionpixel", "")
    s_advertisername = couponfeed.get("advertisername","")
    s_network = couponfeed.get ("network","")
       
        
    rows.append({"promotiontype":s_promotiontype, "category": s_category, "offerdescription": s_offerdescription, 
                 "offerstartdate": s_offerstartdate, "offerenddate": s_offerenddate,"clickurl": s_clickurl,"impressionpixel":s_impressionpixel,
                 "advertisername": s_advertisername,"network": s_network})

out_df = pd.DataFrame(rows, columns=df_cols)


out_df.to_csv(r"C:\\Users\rai\Downloads\\merchants_offers_share.csv", index=False)

Trying easy way but i dont get any results

import lxml.etree as ET 
import urllib

response = urllib.request.urlopen('http://couponfeed.synergy.com/coupon?token=xxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500')
xml = response.read()

root = ET.fromstring(xml)

for item in root.findall('.//item'):
    title = item.find('category').text
    print (title)

another try

from lxml import etree
import pandas as pd 
import urllib

    url = "http://couponfeed.synergy.com/coupon?token=xxxxxxd39f4e5fe392a25538bb122b&network=1&resultsperpage=500"
    xtree = etree.parse(urllib.request.urlopen(url)) 
    
    for value in xtree.xpath("/root/couponfeed/categories"):
        print(value.text)
sunny babau
  • 195
  • 1
  • 3
  • 17

3 Answers3

2

Another method.

from simplified_scrapy import SimplifiedDoc, utils, req
# html = req.get('http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage=500')
html = '''
<couponfeed>
 <TotalMatches>1459</TotalMatches>
 <TotalPages>3</TotalPages>
 <PageNumberRequested>1</PageNumberRequested>
 <link type="TEXT">
  <categories>
   <category id="1">Apparel</category>
  </categories>
  <promotiontypes>
    <promotiontype id="11">Percentage off</promotiontype>
   </promotiontypes>
   <offerdescription>25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!</offerdescription>
   <offerstartdate>2020-07-24</offerstartdate>
   <offerenddate>2020-07-26</offerenddate>
   <clickurl>https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0</clickurl>
    <impressionpixel>https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0</impressionpixel>
    <advertiserid>3184</advertiserid>
    <advertisername>cys.com</advertisername>
    <network id="1">US Network</network>
  </link>
 </couponfeed>
'''
doc = SimplifiedDoc(html)
df_cols = [
    "promotiontype", "category", "offerdescription", "offerstartdate",
    "offerenddate", "clickurl", "impressionpixel", "advertisername", "network"
]
rows = [df_cols]

links = doc.couponfeed.links  # Get all links
for link in links:
    row = []
    for col in df_cols:
        row.append(link.select(col).text)  # Get col text
    rows.append(row)

utils.save2csv('merchants_offers_share.csv', rows)  # Save to csv file

Result:

promotiontype,category,offerdescription,offerstartdate,offerenddate,clickurl,impressionpixel,advertisername,network
Percentage off,Apparel,25% Off Boys Quiksilver Apparel. Shop now at Macys.com! Valid 7/23 through 7/25!,2020-07-24,2020-07-26,https://click.synergy.com/fs-bin/click?id=Z&offerid=777210.100474694&type=3&subid=0,https://ad.synergy.com/fs-bin/show?id=ZNAweM&bids=777210.100474694&type=3&subid=0,cys.com,US Network

Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Remove the last empty row

import io
with io.open('merchants_offers_share.csv', "rb+") as f:
    f.seek(-1,2)
    l = f.read()
    if l == b"\n":
        f.seek(-2,2)
        f.truncate()
dabingsou
  • 2,469
  • 1
  • 5
  • 8
  • THANK YOU. Works like a charm. quick question why does csv output leaves one empty row (i.e) i get records with an alternative empty records. How do i get a nice .csv without empty rows? – sunny babau Jul 27 '20 at 04:23
  • @sunnybabau I'm glad to be able to help you. The last empty row is for the convenience of appending data. If you don't need it, you can delete it in the above way. I changed the answer. – dabingsou Jul 27 '20 at 08:51
  • can u plz suggest a feasible solution for this https://stackoverflow.com/questions/63122779/python-parsing-html-from-url-into-pd-valueerror-no-tables-found?noredirect=1#comment111624853_63122779 – sunny babau Jul 28 '20 at 04:11
0

First, the xml document wasn't parsing because you copied a raw ampersand & from the source page, which is like a keyword in xml. When your browser renders xml (or html), it converts &amp; into &.

As for the code, the easiest way to get the data is to iterate over df_cols, then execute getElementsByTagName for each column, which will return a list of elements for the given column.

from xml.dom import minidom
import pandas as pd
import urllib

limit = 500
url = f"http://couponfeed.synergy.com/coupon?token=xxxxxxxxx122b&network=1&resultsperpage={limit}"


xmldoc = minidom.parse(urllib.request.urlopen(url))

df_cols = ["promotiontype","category","offerdescription", "offerstartdate", "offerenddate", "clickurl","impressionpixel","advertisername","network"]

# create an object for each row
rows = [{} for i in range(limit)]

nodes = xmldoc.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:

    # get results for each row_name
    nodes = xmldoc.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

nodes = et.getElementsByTagName("promotiontype")
node = nodes[0]

for row_name in df_cols:
    nodes = et.getElementsByTagName(row_name)
    for i, node in enumerate(nodes):
        rows[i][row_name] = node.firstChild.nodeValue


out_df = pd.DataFrame(rows, columns=df_cols)

This isn't the most efficient way to do this, but I'm not sure how else to using minidom. If efficiency is a concern, I'd recommend using lxml instead.

Lord Elrond
  • 13,430
  • 7
  • 40
  • 80
0

Assuming no issue with parsing your XML from URL (since link is not available on our end), your first lxml can work if you parse on actual nodes. Specifically, there is no <item> node in XML document.

Instead use link. And consider a nested list/dict comprehension to migrate content to a data frame. For lxml you can swap out findall and xpath to return same result.

df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//link')])
                       
print(df)
#   categories  promotiontypes                                   offerdescription  ... advertiserid advertisername     network
# 0    Apparel  Percentage off  25% Off Boys Quiksilver Apparel. Shop now at M...  ...         3184        cys.com  US Network
# 1    Apparel  Percentage off  25% Off Boys' Quiksilver Apparel. Shop now at ...  ...         3184        cys.com  US Network
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • thanks .. So I get AttributeError: 'NoneType' object has no attribute 'strip' – sunny babau Jul 24 '20 at 22:30
  • You might have `` nodes without child elements. See edit filtering out `None` `item` in inner `for` – Parfait Jul 25 '20 at 00:01
  • sorry i get . But at the same time the below solution provided by @dabingsou works, which implies nothing wrong with the URL. However Thank you for your support – sunny babau Jul 27 '20 at 04:22
  • I like your approach, not sure what is the issue here can u please help with this one parsing this into a datafram http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200 – sunny babau Jul 27 '20 at 21:55
  • Did you try at all? Simply replace `.//link` with `.//item` for ebay XML. As for other issue, don't know why you receive the error but has nothing to do with this code. This answer above works on your posted XML. Check API instructions of accessing URL. Maybe you have to send params/headers? – Parfait Jul 27 '20 at 22:12
  • Thank you. Yes i get my code import lxml.etree as ET import urllib response = urllib.request.urlopen('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US') xml = response.read() root = ET.fromstring(xml) for item in root.findall('.//item'): df = pd.DataFrame([{item.tag: item.text if item.text.strip() != "" else item.find("*").text for item in lnk.findall("*") if item is not None} for lnk in root.findall('.//item')]) print(df) – sunny babau Jul 27 '20 at 22:19
  • Try putting the whole URL path including `https://...`. – Parfait Jul 27 '20 at 23:53
  • Yes sir I did no luck. I have posted my ebay question here and tagged you as well https://stackoverflow.com/questions/63124680/feedparser-to-dataframe-doesnt-ouput-all-columns and https://stackoverflow.com/questions/63122779/python-parsing-html-from-url-into-pd-valueerror-no-tables-found?noredirect=1#comment111624853_63122779 – sunny babau Jul 28 '20 at 00:07
  • But an empty data frame means you do read the XML but are not parsing it correctly. This is different than URL error. – Parfait Jul 28 '20 at 00:50