4

I am currently working on a webscraping project using Selenium in python.

My code works as intended when run from a web driver in non-headless mode. However, it is not the case when it is run in headless mode. For instance, if I try to extract text from a website, the non-headless mode returns the text, while the headless mode returns None. (I have included some code below for reference).

First, I constructed the webdriver with the following code (the opt.headless is set to True or False in order to switch between headless and non-headless)

def getHeadlessDriver():
     opts = webdriver.ChromeOptions()
     opts.headless = False
     driver = webdriver.Chrome(ChromeDriverManager().install(), options=opts)
     return driver

Then, I used the find_elements_by_xpath function to extract texts data from a website. A sample code is provided below:

driver = getHeadlessDriver()
feedbacks = driver.find_elements_by_xpath(
    "//div[contains(@class, 'LiveFeedbackSectionViewController__LiveFeedbackStatusItem-sc-1ahetk9-4 cUJPkM')]")
for feedback in feedbacks:
     print(feedback.text)

I did some googling to find explanation for why the headless mode does not work, but I am still not sure. From my understanding, a headless mode "acts the same", but just without a Graphical User Interface.

Could there be a problem with the implementation of my code? Or does headless mode have other differences other than not having a graphical user interface?

Thank you.

checodes
  • 79
  • 1
  • 4
  • You're missing parenthesis after `getHeadlessDriver`. Change it to `getHeadlessDriver()` and it should work :D – Leyla Zwolinski Jun 01 '21 at 18:17
  • The missing parenthesis was a typo, sorry. Do headless and non-headless function the same beside the GUI? – checodes Jun 01 '21 at 18:21
  • 1
    There are some cheeky websites which wont load the full page on headless mode. For debugging how the website is different in the headless mode, I would suggest you to use `driver.save_screenshot()` and then view the image to see if the page was loaded properly or not. – Hammad Jun 02 '21 at 04:14

3 Answers3

2

If the website you are trying to scrape has dynamic elements rendered by javascript you will need Xvfb.

sudo apt-get install -y xvfb

"Xvfb or X virtual framebuffer is a display server implementing the X11 display server protocol. In contrast to other display servers, Xvfb performs all graphical operations in virtual memory without showing any screen output."

In python, there are two wrappers for Xvfb.

1- xvfbwrapper

pip install xvfbwrapper

Then add in your python file:

from xvfbwrapper import Xvfb

display = Xvfb()
display.start()

2- pyvirtualdisplay

pip install PyVirtualDisplay

And then in your code:

from pyvirtualdisplay import Display

display = Display(visible=0, size=(1024, 768))
display.start()
Zhor
  • 21
  • 5
2

I can usually bypass this problem with time.sleep(10), however, I got one particular website that I can't scrape with either time.sleep(10) or driver.implicitly_wait(10).

I think that the website has a system that checks the user-agent of the browser.

To try and bypass this issue I've added the user agent to the headless window and it worked.

browser_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30'

options_edge.add_argument(f'user-agent={self.user_agent}')

You can get your user agent from websites like this: https://whatmyuseragent.com/ (not affiliated)

Jules
  • 21
  • 3
1

I think I found a potential answer to this problem.

When utilizing headless browsers in Selenium, it runs faster than when using a non-headless browser. In this case, the python program may execute before the DOM is fully loaded.

In other words, functions trying to access web elements may return None since the elements were not loaded before the function was invoked.

In order to solve this issue, we can utilize implicitly_wait function included in Selenium. For instance,

driver = webdriver.Chrome()
driver.implicitly_wait(3) #units in seconds

Will tell the driver to wait the specified amount of time in seconds passed to the implicitly_wait function such that the DOM are loaded.

I have tested my functions in headless mode using this method, and it seems to be working for now. But please feel free to comment if there are other solutions to this problem!

checodes
  • 79
  • 1
  • 4