Questions tagged [beautifulsoup]

Beautiful Soup is a Python package for parsing HTML/XML. The latest version of this package is version 4, imported as bs4.

Beautiful Soup is a Python library for parsing HTML and XML files, which is useful in web scraping. It can use Python's standard HTML parser as well as other parsers such as lxml or html5lib. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup 4 (commonly known as bs4, after the name of its Python module) is the latest version of Beautiful Soup, and is mostly backwards-compatible with Beautiful Soup 3. Beautiful Soup is published under MIT License.

From version 4.7.0, Beautiful Soup supports wide range of CSS4 selectors, adding to already rich collection of tools to select HTML/XML elements. You can read about wide range of CSS selectors and pseudo-classes here (soupsieve library - used by bs4).

To install the latest version with pip use pip install beautifulsoup4. And the library is imported in the project like this: from bs4 import BeautifulSoup

Notice: Beautiful Soup 3 works only on Python 2.x while Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3

32305 questions
79
votes
18 answers

Converting html to text with Python

I am trying to convert an html block to text using Python. Input:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean…

Aaron Bandelli
  • 1,238
  • 2
  • 14
  • 16
78
votes
4 answers

Using BeautifulSoup to search HTML for string

I am using BeautifulSoup to look for user-entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org When I used: find_string = soup.body.findAll(text='Python'), find_string…
kachilous
  • 2,499
  • 11
  • 42
  • 56
77
votes
8 answers

BeautifulSoup innerhtml?

Let's say I have a page with a div. I can easily get that div with soup.find(). Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the…
Matteo Monti
  • 8,362
  • 19
  • 68
  • 114
75
votes
1 answer

Beautifulsoup : Difference between .find() and .select()

When you use BeautifulSoup to scrape a certain part of a website, you can use soup.find() and soup.findAll() or soup.select(). Is there a difference between the .find() and the .select() methods? (e.g. In performance or flexibility, etc.) Or are…
Dieter
  • 2,499
  • 1
  • 23
  • 41
75
votes
3 answers

BeautifulSoup getText from between

, not picking up subsequent paragraphs

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article. This is what I have so far: from BeautifulSoup import BeautifulSoup import…
Darren Wadley
  • 761
  • 1
  • 5
  • 5
73
votes
3 answers

Using BeautifulSoup to find a HTML tag that contains certain text

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

this is cool #12345678901

So, the previous would match by using: soup('h2',text=re.compile(r' #\S{11}')) And the results would be…
sotangochips
  • 2,700
  • 6
  • 28
  • 38
73
votes
6 answers

What should I use to open a url instead of urlopen in urllib3

I wanted to write a piece of code like the following: from bs4 import BeautifulSoup import urllib2 url = 'http://www.thefamouspeople.com/singers.php' html = urllib2.urlopen(url) soup = BeautifulSoup(html) But I found that I have to install urllib3…
niloofar
  • 2,244
  • 5
  • 23
  • 44
73
votes
4 answers

How to get rid of BeautifulSoup user warning?

After I installed BeautifulSoup, whenever I run my Python in from the command line, this warning comes out: D:\Application\python\lib\site-packages\beautifulsoup4-4.4.1-py3.4.egg\bs4\__init__.py:166: UserWarning: No parser was explicitly specified,…
jellyfishhuang
  • 809
  • 1
  • 6
  • 6
73
votes
3 answers

Python BeautifulSoup give multiple tags to findAll

I'm looking for a way to use findAll to get two tags, in the order they appear on the page. Currently I have: import requests import BeautifulSoup def get_soup(url): request = requests.get(url) page = request.text soup =…
DasSnipez
  • 2,182
  • 4
  • 20
  • 29
71
votes
2 answers

UnicodeEncodeError: 'ascii' codec can't encode character at special name

My python (ver 2.7) script is running well to get some company name from local html files but when it comes to some specific country name, it gives this error "UnicodeEncodeError: 'ascii' codec can't encode character" Specially getting error when…
rhb1
  • 753
  • 1
  • 7
  • 8
69
votes
4 answers

Extract the 'src' attribute from an 'img' tag using Beautiful Soup

Consider: I want to extract the source (i.e., src) attribute from an image (i.e., img) tag using Beautiful Soup. I use Beautiful Soup 4, and I cannot…
iDelusion
  • 775
  • 1
  • 8
  • 9
67
votes
8 answers

beautifulsoup, html5lib: module object has no attribute _base

When I updated my packages I have this new error: class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base' I tried to update beautifulsoup, with no more result. How can I fix…
Ehvince
  • 17,274
  • 7
  • 58
  • 79
67
votes
5 answers

Get meta tag content property with BeautifulSoup and Python

I am trying to use python and beautiful soup to extract the content part of the tags below: I'm…
the_t_test_1
  • 1,193
  • 1
  • 12
  • 28
66
votes
4 answers

Using BeautifulSoup to extract text without tags

My webpage looks like this:

YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''

myloginid
  • 1,463
  • 2
  • 22
  • 37
63
votes
3 answers

How to write the output to html file with Python BeautifulSoup

I modified an html file by removing some of the tags using beautifulsoup. Now I want to write the results back in a html file. My code: from bs4 import BeautifulSoup from bs4 import Comment soup =…
Kim Hyesung
  • 727
  • 1
  • 6
  • 13