BeatifulSoup4 get_text still has javascript

Question

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

I tried using nltk which works fine however, clean_html and clean_url will be removed moving forward. Is there a way to use soups get_text and get the same result?

I tried looking at these other pages:

BeautifulSoup get_text does not strip all tags and JavaScript

Currently i'm using the nltk's deprecated functions.

EDIT

Here's an example:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

I still see the following for CNN:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

How can I remove the js?

Only other options I found are:

https://github.com/aaronsw/html2text

The problem with html2text is that it's really really slow at times, and creates noticable lag, which is one thing nltk was always very good with.

It would really help if we could see (a section of) the html including javascript — Hugh Bothwell, Apr 02 '14 at 01:42

score 96 · Accepted Answer · edited Mar 06 '18 at 23:31

96

Based partly on Can I remove script tags with BeautifulSoup?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

edited Mar 06 '18 at 23:31

hd1

33,938
5
80
91

answered Apr 02 '14 at 02:15

Hugh Bothwell

55,315
8
84
99

6

instead of `script.extract()`, it's better to use `script.decompose()` which only deletes without returning the tag object. – Aminah Nuraini Nov 18 '16 at 18:31
1

That is a lot of data structures you're building just so you don't have to write `re.sub("[ \n\r\t]{2,}", " ", text)` :) – badp May 28 '18 at 15:11
2

@badp Especially that you can just say `soup.get_text(" ", strip=True)` ? – Csaba Toth Jul 12 '18 at 09:09
1

@CsabaToth, @badp, you actually don't want to use `strip=True` because it will cause strings to be concatenated incorrectly. It's important to preserve them, then use `splitlines`, then sanitize each individual string. – Riley Steele Parsons Mar 31 '19 at 18:09
1

@HughBothwell, still not able to stop script tags completely Eg. [here](https://pastebin.com/h0uV3b8V) – Anu Feb 07 '20 at 07:08
For people curious, `soup(['script', 'style'])` is identical to `soup.find_all(['script', 'style'])`, as explained in `help(soup.__call__)`. – whatacold Dec 27 '21 at 23:48

score 10 · Answer 2 · answered Jul 26 '14 at 06:51

To prevent encoding errors at the end...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

still not able to stop script tags completely Eg. [here](https://pastebin.com/h0uV3b8V) — Anu, Feb 07 '20 at 07:08

BeatifulSoup4 get_text still has javascript

2 Answers2

Linked