Questions tagged [beautifulsoup]

Beautiful Soup is a Python package for parsing HTML/XML. The latest version of this package is version 4, imported as bs4.

Beautiful Soup is a Python library for parsing HTML and XML files, which is useful in web scraping. It can use Python's standard HTML parser as well as other parsers such as lxml or html5lib. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup 4 (commonly known as bs4, after the name of its Python module) is the latest version of Beautiful Soup, and is mostly backwards-compatible with Beautiful Soup 3. Beautiful Soup is published under MIT License.

From version 4.7.0, Beautiful Soup supports wide range of CSS4 selectors, adding to already rich collection of tools to select HTML/XML elements. You can read about wide range of CSS selectors and pseudo-classes here (soupsieve library - used by bs4).

To install the latest version with pip use pip install beautifulsoup4. And the library is imported in the project like this: from bs4 import BeautifulSoup

Notice: Beautiful Soup 3 works only on Python 2.x while Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3

32305 questions

votes

7 answers

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn…

asked Dec 17 '09 at 14:08

Monika Sulik

16,498
15
50
52

votes

12 answers

Remove a tag using BeautifulSoup but keep its contents

Currently I have code that does something like this: soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents() Except I don't want to throw away the contents inside…

python beautifulsoup

asked Nov 19 '09 at 19:19

Jason Christa

12,150
14
58
85

votes

2 answers

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug…

python css css-selectors beautifulsoup firebug

asked Jul 17 '14 at 10:48

Flecha

votes

2 answers

Rendered HTML to plain text using Python

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

Some text more text even more text

list item
yet another list…

python beautifulsoup

asked Nov 12 '12 at 02:06

btatarov

votes

4 answers

Beautifulsoup - nextSibling

I'm trying to get the content "My home address" using the following but got the AttributeError: address = soup.find(text="Address:") print address.nextSibling This is my HTML: Address: My home address What is a good way to…

python beautifulsoup

asked May 14 '11 at 04:09

ready

1,189
2
9
9

votes

10 answers

How can I get href links from HTML using Python?

import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html So far so good. But I want only href links from the plain text HTML. How can I solve this problem?

python html hyperlink beautifulsoup href

asked Jun 19 '10 at 12:58

user371012

votes

3 answers

How can I parse a website using Selenium and Beautifulsoup in python?

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction? Any help appreciated -

python selenium beautifulsoup

asked Dec 19 '12 at 20:06

twitch after coffee

votes

4 answers

Deleting a div with a particular class using BeautifulSoup

I want to delete the specific div from soup object. I am using python 2.7 and bs4. According to documentation we can use div.decompose(). But that would delete all the div. How can I delete a div with specific class?

python python-2.7 beautifulsoup

asked Aug 18 '15 at 05:10

Riken Shah

3,022
5
29
56

votes

3 answers

Selenium versus BeautifulSoup for web scraping

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium. Given that I can find elements and get…

javascript python selenium beautifulsoup

asked Jul 02 '13 at 21:19

elie

votes

7 answers

Python BeautifulSoup extract text between element

I try to extract "THIS IS MY TEXT" from the following HTML:

Text

something

THIS IS MY TEXT

something else

…

python beautifulsoup

asked May 30 '13 at 11:54

ɥɔǝnq ɹǝƃloɥ

votes

8 answers

BeautifulSoup: extract text from anchor tag

I want to extract: text from following src of the image tag and text of the anchor tag which is inside the div class data I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.

python html beautifulsoup tags scraper

asked Jul 30 '12 at 06:32

add-semi-colons

18,094
55
145
232

votes

8 answers

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales,…

python screen-scraping beautifulsoup mechanize http-status-code-403

asked May 17 '10 at 00:35

Diego

votes

5 answers

BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

I'm trying to scrape all the inner html from the

elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text. For example, for:

Red

Blue

Yellow

Light…

python beautifulsoup

asked Jun 02 '10 at 10:58

AP257

89,519
86
202
261

votes

2 answers

BeatifulSoup4 get_text still has javascript

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this? I tried using nltk which works fine however, clean_html and clean_url will be removed…

python beautifulsoup nltk

asked Apr 02 '14 at 01:39

KVISH

12,923
17
86
162

votes

2 answers

Beautiful Soup findAll doesn't find them all

I'm trying to parse a website and get some info with the find_all() method, but it doesn't find them all. This is the code: #!/usr/bin/python3 from bs4 import BeautifulSoup from urllib.request import urlopen page = urlopen…

python html python-3.x beautifulsoup

asked May 01 '13 at 17:07

Clepto

Prev 1 2 3

…

99 100 Next