1

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can
    # be parsed and indexed
    #
    soup = bs4.BeautifulSoup(response, "html.parser")
    tok = "".join(soup.findAll("p", text=re.compile(".")))
    # pass the text extracted from the web page to the parsetoken routine for indexing
    parsetoken(db, tok)
    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?

joshkmartinez
  • 654
  • 1
  • 12
  • 35
xhenier
  • 49
  • 7
  • 1
    what you are passing to `''.join` is not an iterable of strings, which it must be. `soup.findall` returns a sequence of some type of custom objects I can only assume – juanpa.arrivillaga Jan 02 '19 at 18:13
  • 1
    You probably need `tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))` – C.Nivs Jan 02 '19 at 18:17

1 Answers1

0

There are a few issues here:

  • First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.
  • More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.
  • Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

Bill M.
  • 1,388
  • 1
  • 8
  • 16
  • This is Python 3, no need for `text.encode('utf-8')` – juanpa.arrivillaga Jan 02 '19 at 18:46
  • This will *not* work on Python 3, `.encode` returns `bytes` objects, and you are trying to join using a `str` object, i.e. `" ".join`, this will throw a type error. You could do `b" ".join(...)`, but then, why would you *want* a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have *been no Python 2 and 3*. But otherwise, this is correct. – juanpa.arrivillaga Jan 02 '19 at 18:51
  • OK, I've updated it. Now go back to "pulling out your hair", Juanpa. – Bill M. Jan 02 '19 at 18:55