BeautifulSoup - TypeError: sequence item 0: expected str instance

Question

I made a web crawler using python and everything runs fine until it gets to this section of the code:

    # Use BeautifulSoup modules to format web page as text that can
    # be parsed and indexed
    #
    soup = bs4.BeautifulSoup(response, "html.parser")
    tok = "".join(soup.findAll("p", text=re.compile(".")))
    # pass the text extracted from the web page to the parsetoken routine for indexing
    parsetoken(db, tok)
    documents += 1

The error I get is TypeError: sequence item 0: expected str instance, Tag found around the tok line in the code.
I think my syntax could be the issue but I am not sure. How can I fix this?

what you are passing to `''.join` is not an iterable of strings, which it must be. `soup.findall` returns a sequence of some type of custom objects I can only assume — juanpa.arrivillaga, Jan 02 '19 at 18:13
You probably need `tok = "".join([x.text for x in soup.findAll("p", text=re.compile(".")))` — C.Nivs, Jan 02 '19 at 18:17

Bill M. · Answer 1 · 2019-01-02T18:54:45.497

There are a few issues here:

First, I'm not sure where you're getting response from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.
More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "join" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance". The good news is you can use .text to extract the actual text from a given <p> element.
Though even if you do use .text to extract the actual text from every <p> object, your join() may still fail if your list is a mix of unicode and str formats. So you may have to do some encoding tricks to get everything as the same type before you join.

Here's an example I did using this very page:

>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))

This prints the combined text of everything found in a "P" tag.

EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".

This will *not* work on Python 3, `.encode` returns `bytes` objects, and you are trying to join using a `str` object, i.e. `" ".join`, this will throw a type error. You could do `b" ".join(...)`, but then, why would you *want* a bytes object in Python 3? Look, if Python 2 and 3 could easily be written to handle the issue of unicode strings vs bytes strings, then there would have *been no Python 2 and 3*. But otherwise, this is correct. — juanpa.arrivillaga, Jan 02 '19 at 18:51
OK, I've updated it. Now go back to "pulling out your hair", Juanpa. — Bill M., Jan 02 '19 at 18:55

BeautifulSoup - TypeError: sequence item 0: expected str instance

1 Answers1