Questions tagged [beautifulsoup]

Beautiful Soup is a Python package for parsing HTML/XML. The latest version of this package is version 4, imported as bs4.

Beautiful Soup is a Python library for parsing HTML and XML files, which is useful in web scraping. It can use Python's standard HTML parser as well as other parsers such as lxml or html5lib. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup 4 (commonly known as bs4, after the name of its Python module) is the latest version of Beautiful Soup, and is mostly backwards-compatible with Beautiful Soup 3. Beautiful Soup is published under MIT License.

From version 4.7.0, Beautiful Soup supports wide range of CSS4 selectors, adding to already rich collection of tools to select HTML/XML elements. You can read about wide range of CSS selectors and pseudo-classes here (soupsieve library - used by bs4).

To install the latest version with pip use pip install beautifulsoup4. And the library is imported in the project like this: from bs4 import BeautifulSoup

Notice: Beautiful Soup 3 works only on Python 2.x while Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3

32305 questions
62
votes
7 answers

Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn…
Monika Sulik
  • 16,498
  • 15
  • 50
  • 52
62
votes
12 answers

Remove a tag using BeautifulSoup but keep its contents

Currently I have code that does something like this: soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents() Except I don't want to throw away the contents inside…
Jason Christa
  • 12,150
  • 14
  • 58
  • 85
60
votes
2 answers

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug…
Flecha
  • 615
  • 1
  • 5
  • 4
60
votes
2 answers

Rendered HTML to plain text using Python

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

Some text more text even more text

  • list item
  • yet another list…
btatarov
  • 657
  • 1
  • 5
  • 8
59
votes
4 answers

Beautifulsoup - nextSibling

I'm trying to get the content "My home address" using the following but got the AttributeError: address = soup.find(text="Address:") print address.nextSibling This is my HTML: Address: My home address What is a good way to…
ready
  • 1,189
  • 2
  • 9
  • 9
59
votes
10 answers

How can I get href links from HTML using Python?

import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html So far so good. But I want only href links from the plain text HTML. How can I solve this problem?
user371012
  • 593
  • 1
  • 4
  • 4
59
votes
3 answers

How can I parse a website using Selenium and Beautifulsoup in python?

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction? Any help appreciated -
57
votes
4 answers

Deleting a div with a particular class using BeautifulSoup

I want to delete the specific div from soup object. I am using python 2.7 and bs4. According to documentation we can use div.decompose(). But that would delete all the div. How can I delete a div with specific class?
Riken Shah
  • 3,022
  • 5
  • 29
  • 56
57
votes
3 answers

Selenium versus BeautifulSoup for web scraping

I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium. Given that I can find elements and get…
elie
  • 581
  • 1
  • 5
  • 4
56
votes
7 answers

Python BeautifulSoup extract text between element

I try to extract "THIS IS MY TEXT" from the following HTML:
Text

something

THIS IS MY TEXT

something else


56
votes
8 answers

BeautifulSoup: extract text from anchor tag

I want to extract: text from following src of the image tag and text of the anchor tag which is inside the div class data I successfully manage to extract the img src, but am having trouble extracting the text from the anchor tag.
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
55
votes
8 answers

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales,…
52
votes
5 answers

BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are

I'm trying to scrape all the inner html from the

elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text. For example, for:

Red

Blue

Yellow

Light…

AP257
  • 89,519
  • 86
  • 202
  • 261
52
votes
2 answers

BeatifulSoup4 get_text still has javascript

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this? I tried using nltk which works fine however, clean_html and clean_url will be removed…
KVISH
  • 12,923
  • 17
  • 86
  • 162
50
votes
2 answers

Beautiful Soup findAll doesn't find them all

I'm trying to parse a website and get some info with the find_all() method, but it doesn't find them all. This is the code: #!/usr/bin/python3 from bs4 import BeautifulSoup from urllib.request import urlopen page = urlopen…
Clepto
  • 685
  • 1
  • 6
  • 7