52

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?

I tried looking at these other pages:

BeautifulSoup get_text does not strip all tags and JavaScript

Currently i'm using the nltk's deprecated functions.

EDIT

Here's an example:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

I still see the following for CNN:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

How can I remove the js?

Only other options I found are:

https://github.com/aaronsw/html2text

The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.

Community
  • 1
  • 1
KVISH
  • 12,923
  • 17
  • 86
  • 162

2 Answers2

96

Based partly on Can I remove script tags with BeautifulSoup?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)
hd1
  • 33,938
  • 5
  • 80
  • 91
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
  • 6
    instead of `script.extract()`, it's better to use `script.decompose()` which only deletes without returning the tag object. – Aminah Nuraini Nov 18 '16 at 18:31
  • 1
    That is a lot of data structures you're building just so you don't have to write `re.sub("[ \n\r\t]{2,}", " ", text)` :) – badp May 28 '18 at 15:11
  • 2
    @badp Especially that you can just say `soup.get_text(" ", strip=True)` ? – Csaba Toth Jul 12 '18 at 09:09
  • 1
    @CsabaToth, @badp, you actually don't want to use `strip=True` because it will cause strings to be concatenated incorrectly. It's important to preserve them, then use `splitlines`, then sanitize each individual string. – Riley Steele Parsons Mar 31 '19 at 18:09
  • 1
    @HughBothwell, still not able to stop script tags completely Eg. [here](https://pastebin.com/h0uV3b8V) – Anu Feb 07 '20 at 07:08
  • For people curious, `soup(['script', 'style'])` is identical to `soup.find_all(['script', 'style'])`, as explained in `help(soup.__call__)`. – whatacold Dec 27 '21 at 23:48
10

To prevent encoding errors at the end...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))
bumpkin
  • 3,321
  • 4
  • 25
  • 32
  • still not able to stop script tags completely Eg. [here](https://pastebin.com/h0uV3b8V) – Anu Feb 07 '20 at 07:08