There are a few issues here:
- First, I'm not sure where you're getting
response
from, but that should be a string of actual HTML. Make sure you're not just capturing a "response" code from scraping a site that tells you whether it was successful.
- More importantly though, when you do the "findAll", note that this returns a list of BeautifulSoup objects, not a list of strings. So the "
join
" command doesn't know what to do with these. It looks at the first object in the list, sees that it's not a string, and this is why it errors out with a complaint that it "expected str instance
". The good news is you can use .text
to extract the actual text from a given <p>
element.
- Though even if you do use
.text
to extract the actual text from every <p>
object, your join()
may still fail if your list is a mix of unicode
and str
formats. So you may have to do some encoding tricks to get everything as the same type before you join.
Here's an example I did using this very page:
>>> import bs4, re
>>> import urllib2
>>> url = "https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python"
>>> html = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(html, "html.parser")
>>> L = soup.findAll("p", text=re.compile("."))
>>> M = [t.text.encode('utf-8') for t in L]
>>> print(" ".join(M))
This prints the combined text of everything found in a "P" tag.
EDIT: This example was on Python 2.7.x. For 3.x, drop the ".encode('utf-8')".