Why do we still need parser like BeautifulSoup if we can use Selenium?

Question

I am currently using Selenium to crawl data from some websites. Unlike urllib, it seems that I do not really need a parser like BeautifulSoup to parse the HTML. I can simply find an element with Selenium and use Webelement.text to get the data that I need. As I saw there are some people using Selenium and BeautifulSoup together in web crawling. Is it really necessary? Any special features that bs4 can offer to improve the crawling process? Thank you.

http://stackoverflow.com/questions/17436014/selenium-versus-beautifulsoup-for-web-scraping?rq=1 I have read this post and what I am currently crawling are all dynamic websites so I must use Selenium instead of urllib2 — jackycflau, Apr 02 '17 at 03:46

score 12 · Accepted Answer · answered Apr 02 '17 at 03:51

Selenium itself is quite powerful in terms of locating elements and, it basically has everything you need for extracting data from HTML. The problem is, it is slow. Every single selenium command goes through the JSON wire HTTP protocol and there is a substantial overhead.

In order to improve the performance of the HTML parsing part, it is usually much faster to let BeautifulSoup or lxml parse the page source retrieved from .page_source.

In other words, a common workflow for a dynamic web page is something like:

open the page in a browser controlled by selenium
make the necessary browser actions
once the desired data is on the page, get the driver.page_source and close the browser
pass the page source to an HTML parser for further parsing

Very high-quality post, thank you alecxe for your contributions. Selenium can be used with Headless browsers too (I'm sure you know that) speeds up the process and lower memory usage (usually), didn't know how the commands reach the browser >JSON, thank you for that. — innicoder, Apr 02 '17 at 14:24

Why do we still need parser like BeautifulSoup if we can use Selenium?

1 Answers1