Python Parsing HTML from url into PD ValueError: No tables found

Question

I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help

<table><tr><td><a <table><tr><td><a

Error

ValueError: No tables found

My code

import pandas as pd 
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()

Also tried with feedparser and no luck. I dont get any data

import feedparser
import pandas as pd
import time

rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
    
posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()

What you have posted as "below HTML" is not proper HTML. The ``, none of the elements are closed. For example `...` — AlpacaJones, Jul 27 '20 at 21:54
Looks like this guy has done the dataframe part. https://pythonprogramming.net/community/722/Building%20An%20RSS%20Scraper%20How%20do%20I%20get%20my%20data%20into%20a%20database/ — AlpacaJones, Jul 27 '20 at 22:05
@sunnybabau I posted an answer, but I don't know if it's what you want. — dabingsou, Jul 28 '20 at 06:22

dabingsou · Accepted Answer · 2020-07-29T02:18:38.213

The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.

import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
               timeout=600)

posts = {'title': [], 'link': [], 'description': []}
doc = SimplifiedDoc(html)
items = doc.selects('item')
for item in items:
    posts['title'].append(item.title.text)
    posts['link'].append(item.link.text)
    posts['description'].append(item.description.text)

df = pd.DataFrame(posts)
df.tail()

Get data from description

posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
doc = SimplifiedDoc(html)
descriptions = doc.selects('item').description # Get all descriptions
for table in descriptions:
    d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
    img = d.img.src # Get the image src
    listPrice = d.getElementByText('List Price:')
    if listPrice:
        listPrice=listPrice.strike.text
    else: listPrice = ''

    dealPrice = d.getElementByText('Deal Price: ')
    if dealPrice:
        dealPrice = dealPrice.text[len('Deal Price: '):]
    else: dealPrice = ''

    expires = d.getElementByText('Expires ')
    if expires:
        expires = expires.text[len('Expires '):]
    else: expires = ''

    posts['listPrice'].append(listPrice)
    posts['dealPrice'].append(dealPrice)
    posts['expires'].append(expires)
df = pd.DataFrame(posts)
df.tail()

The page data I get is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Amazon.com Gold Box Deals</title>
    <link>http://www.amazon.com/gp/goldbox</link>
    <description>Amazon.com Gold Box Deals</description>
    <pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
    <dc:date>2018-06-28T08:50:16Z</dc:date>
    <image>
      <title>Amazon.com Gold Box Deals</title>
      <url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
      <link>http://www.amazon.com/gp/goldbox</link>
    </image>
    <item>
      <title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
      <link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20</link>
      <description>&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20" target="_blank"&gt;&lt;img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tr&gt;&lt;td&gt;Withings Activit? Steel - Activity and Sleep Tracking Watch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Expires Jun 29, 2018&lt;/td&gt;&lt;/tr&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
      <pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
      <guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
      <dc:date>2018-06-28T07:00:10Z</dc:date>
    </item>

Correct. My goal is ability to load this into a data frame?? — sunny babau, Jul 28 '20 at 12:57
@sunnybabau I updated the answer and loaded the data into the data frame. — dabingsou, Jul 29 '20 at 00:23
Thank you.. The issue is that we need to further extract elements from description tag.. Is that possible? currently everything is lumped in description ---- is there a intriguing way to fetch following tags from description { Product Imag,List Price,strike,Deal Price,Expires} — sunny babau, Jul 29 '20 at 00:42
You're welcome. If you have any questions, just send me a message. — dabingsou, Jul 29 '20 at 04:09
THANK YOU. Everything works.. just one issue the image2255 column is always coming as NULL for Ebay, though the data exists for it in fact it always have values?? Also, how about using the same array style for JSON response. I have updated the example https://stackoverflow.com/questions/63124680/feedparser-to-dataframe-doesnt-ouput-all-columns — sunny babau, Aug 01 '20 at 19:10

Python Parsing HTML from url into PD ValueError: No tables found

1 Answers1

Linked