PhantomJS stability when rendering multiple pages

Question

I am running PhantomJS on a big set of pages to scrape some specific JS-generated content. I am using the Python Selenium bindings with which it's easy to perform XPath queries on the results. I have noticed that if I try to instantiate a single webdriver.PhantomJS object and perform the entire job with it (by "reusing" it so to speak), my script soon becomes unstable, with sporadic memory and connectivity issues. My next attempt has been to try to instantiate a new driver for every render call (and by calling quit() on it when it's done), which also didn't work for more than a few requests. My final attempt was to use subprocess to insulate the rendering call in its own process space. But even with this technique, which is the stablest by far, I still need to wrap my entire script in supervisor, to handle occasional hiccups. I am really wondering if I might be doing something wrong, or if there is something I should be aware of. I understand that PhantomJS (and other automated browsers) are not really meant for scraping per se (more for testing), but is there a way to make it work with great stability nevertheless?

score 1 · Answer 1 · answered Mar 01 '16 at 04:26

I use Selenium with pyvirtualdisplay with a normal browser in a manner similar to this: Python - Headless Selenium WebDriver Tests using PyVirtualDisplay (though I'm using Chrome; just a matter of a different driver).

Much more stable than my experience with PhantomJS from both node and Python. You'll still likely want to use a process manager, just in case, but this way has been far less error-prone for me.

Also, I suggest writing a little Python wrapper class so you can use a with block and ensure your environment always gets cleaned up; if you don't kill the session appropriately you can end up with an orphaned browser eating memory.

From my project:

import os, time

from selenium import webdriver
from pyvirtualdisplay import Display


class ChromeSession(object):
    def __enter__(self):
        self.display = Display(visible=0, size=(1024, 768))
        self.display.start()

        chromedriver = "/usr/lib/chromium/chromedriver"
        os.environ["websession.chrome.driver"] = chromedriver

        self.driver = webdriver.Chrome(chromedriver)
        # Tell the driver to wait (if necessary) in case UI rendering takes a while...
        self.driver.implicitly_wait(5)

        return self.driver

    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_type:
            print exc_type, exc_val
            print exc_tb
        self.driver.quit()
        self.display.stop()

PhantomJS stability when rendering multiple pages

1 Answers1