16

I want to fetch data from another url for which I am using urllib and Beautiful Soup , My data is inside table tag (which I have figure out using Firefox console). But when I tried to fetch table using his id the result is None , Then I guess this table must be dynamically added via some js code.

I have tried all both parsers 'lxml', 'html5lib' but still I can't get that table data.

I have also tried one more thing :

web = urllib.urlopen("my url")
html = web.read()
soup = BeautifulSoup(html, 'lxml')
js  = soup.find("script")
ss = js.prettify()
print ss

Result :

<script type="text/javascript">
 myPage = 'ETFs';
        sectionId = 'liQuotes'; //section tab
        breadCrumbId = 'qQuotes'; //page
        is_dartSite = "quotes";
        is_dartZone = "news";
        propVar = "ETFs";
</script>

But now I don't know how I can get data of these js variables.

Now I have two options either get that table content ot get that the js variables, any one of them can fulfil my task but unfortunately I don't know how to get these , So please tell how I can get resolve any one of the problem.

Thanks

Inforian
  • 1,716
  • 5
  • 21
  • 42
  • 1
    There's no point in guessing whether javascript is generating the table content - you need to confirm that first. Is the URL publicly accessible? If so, what is it? – mhawke Jun 09 '14 at 10:45
  • Yes I confirm table data is generated from js code , you can check here http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx . – Inforian Jun 09 '14 at 10:53

2 Answers2

21

EDIT

This will do the trick using re module to extract the data and loading it as JSON:

import urllib
import json
import re
from bs4 import BeautifulSoup

web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
soup = BeautifulSoup(web.read(), 'lxml')
data  = soup.find_all("script")[19].string
p = re.compile('var table_body = (.*?);')
m = p.match(data)
stocks = json.loads(m.groups()[0])

>>> for stock in stocks:
...     print stock
... 
[u'ASPS', u'Altisource Portfolio Solutions S.A.', 116.96, 2.2, 1.92, 86635, u'N', u'N']
[u'AGNC', u'American Capital Agency Corp.', 23.76, 0.13, 0.55, 3184303, u'N', u'N']
.
.
.
[u'ZION', u'Zions Bancorporation', 29.79, 0.46, 1.57, 2154017, u'N', u'N']

The problem with this is that the script tag offset is hard-coded and there is not a reliable way to locate it within the page. Changes to the page could break your code.

ORIGINAL answer

Rather than try to screen scrape the data, you can download a CSV representation of the same data from http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download.

Then use the Python csv module to parse and process it. Not only is this more convenient, it will be a more resilient solution because any changes to the HTML could easily break your screen scraping code.

Otherwise, if you look at the actual HTML you will find that the data is available within the page in the following script tag:

<script type="text/javascript">var table_body = [["ATVI", "Activision Blizzard, Inc", 20.92, 0.21, 1.01, 6182877,  .1, "N", "N"],
["ADBE", "Adobe Systems Incorporated", 66.91, 1.44, 2.2, 3629837,  .6, "N", "N"],
["AKAM", "Akamai Technologies, Inc.", 57.47, 1.57, 2.81, 2697834,  .3, "N", "N"],
["ALXN", "Alexion Pharmaceuticals, Inc.", 170.2, 0.7, 0.41, 659817,  .1, "N", "N"],
["ALTR", "Altera Corporation", 33.82, -0.06, -0.18, 1928706,  .0, "N", "N"],
["AMZN", "Amazon.com, Inc.", 329.67, 6.1, 1.89, 5246300,  2.5, "N", "N"],
....
["YHOO", "Yahoo! Inc.", 35.92, 0.98, 2.8, 18705720,  .9, "N", "N"]];
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Actually, sorry, CSV is not available for the URL that you posted, in which case you will have to extract it from the javascript variable. It looks like this: ` – mhawke Jun 09 '14 at 12:46
  • 1
    Yes I know I have to extract the array from that variable That's why I asked question of how to get data from a js variable using Beautiful soup , Is there any way to do this ? – Inforian Jun 09 '14 at 12:58
4

Just to add to @mhawke 's answer, rather than hardcoding the offset of the script tag, you loop through all the script tags and match the one that matches your pattern;

web = urllib.urlopen("http://www.nasdaq.com/quotes/nasdaq-financial-100-stocks.aspx")
pattern = re.compile('var table_body = (.*?);')

soup = BeautifulSoup(web.read(), "lxml")
scripts = soup.find_all('script')
for script in scripts:
   if(pattern.match(str(script.string))):
       data = pattern.match(script.string)
       stock = json.loads(data.groups()[0])
       print stock
parkerproject
  • 2,138
  • 1
  • 17
  • 14
  • 1
    `find_all()` also accepts [a `string` argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) (which was called `text` in earlier versions), so you could write `script = soup.find_all('script', string=pattern)` and iterate over those results. – Gregor Jan 01 '23 at 21:44