2

I've been searching for a solution to this AttributeError I keep getting, and no solution I've been able to find deals with '_all_strings'.

I want to code a web-crawler, but there's a lot of nonsense at the top and bottom of the page, so I'm trying to clean up the HTML code as a precursor to excluding the unwanted noise at the top and bottom of the webpage.

When I run the code below, specifically, the last line of it, I get an AttributeError:

from __future__ import division
from urllib.request import urlopen
from bs4 import BeautifulSoup

textSource = 'http://celt.ucc.ie/irlpage.html'
html = urlopen(textSource).read()
raw = BeautifulSoup.get_text(html)

This is the full Traceback I get:

Traceback (most recent call last):
  File "...Crawler_Celt_Namelink_Test.py", line 7, in <module>
    raw = BeautifulSoup.get_text(html)
  File "...Python\Python35\lib\site-packages\bs4\element.py", line 950, in get_text
    return separator.join([s for s in self._all_strings(
AttributeError: 'bytes' object has no attribute '_all_strings'

Has anybody encountered this error before? Or can anyone suggest how I can overcome it, please?

AdeDoyle
  • 361
  • 1
  • 14

1 Answers1

3

When you look at the BeautifulSoup docs it is used like this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
textSource = 'http://celt.ucc.ie/irlpage.html'
html = urlopen(textSource).read()

soup = BeautifulSoup(html, 'html.parser')

raw = BeautifulSoup.get_text(soup)
susitsm
  • 465
  • 2
  • 6