-1

I'm learning how to programme and I want to scrape a webpage minus the javascript code. I'm following an example from a book. The code below should return just the html code from the website, however it only returns the title of the site and some JavaScript code at the bottom. Can someone please let me know where I went wrong? Cheers.

import urllib2 
from bs4 import BeautifulSoup

url = "http://www.theurl.com/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")

[x.extract() for x in soup.find_all('script')]

print soup.get_text()

This is what it returns after the title.

var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-11092338-1']);
      _gaq.push(['_trackPageview']);
      (function() {
        var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
        ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
        var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
      })();
user2179119
  • 11
  • 1
  • 5
  • possible duplicate of [BeatifulSoup4 get\_text still has javascript](http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript) – Andy Jul 17 '15 at 17:06
  • Also: http://stackoverflow.com/questions/10524387/beautifulsoup-get-text-does-not-strip-all-tags-and-javascript – Andy Jul 17 '15 at 17:07
  • I visited the link of possible duplicate and I tried the code with the highest vote and I still got a Javascript code. – user2179119 Jul 17 '15 at 17:16
  • @Andy Your answer did not resolve my problem and considering I'm a newbie programmer, it's so unkind of you to strip some points off me and not even bother to help answer the question! – user2179119 Jul 17 '15 at 17:26
  • 1
    Before you go accusing someone of down voting you, you should be aware that voting is anonymous. You are making a very big assumption that because I posted a comment, I also down voted you. In this case, [I did not down vote you](http://i.imgur.com/OK9TMzd.png). I *did* provide you with possible alternatives. If they did not work, you are free to do exactly what you did - say they don't work. Finally, "newbie"-ness is NOT a reason for holding back on votes. I suggest you edit your question to explain why those suggested duplicates don't work. You'll get better answers that way. – Andy Jul 17 '15 at 17:34
  • @Andy Fair point - my apologies, thanks for trying to help. – user2179119 Jul 17 '15 at 17:56

1 Answers1

1

Have you tried printing soup.contents? Because when you print soup.get_text(), it shall print relatively the text. Try the following code please.

import urllib2 
from bs4 import BeautifulSoup

url = "http://www.theurl.com/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")

[x.extract() for x in soup.find_all('script')]

html =soup.contents
for i in html:
    print i
Sam Al-Ghammari
  • 1,021
  • 7
  • 23