I am running PhantomJS on a big set of pages to scrape some specific JS-generated content. I am using the Python Selenium bindings with which it's easy to perform XPath queries on the results. I have noticed that if I try to instantiate a single webdriver.PhantomJS
object and perform the entire job with it (by "reusing" it so to speak), my script soon becomes unstable, with sporadic memory and connectivity issues. My next attempt has been to try to instantiate a new driver for every render call (and by calling quit()
on it when it's done), which also didn't work for more than a few requests. My final attempt was to use subprocess
to insulate the rendering call in its own process space. But even with this technique, which is the stablest by far, I still need to wrap my entire script in supervisor
, to handle occasional hiccups. I am really wondering if I might be doing something wrong, or if there is something I should be aware of. I understand that PhantomJS (and other automated browsers) are not really meant for scraping per se (more for testing), but is there a way to make it work with great stability nevertheless?
Asked
Active
Viewed 265 times
0

cjauvin
- 3,433
- 4
- 29
- 38
1 Answers
1
I use Selenium with pyvirtualdisplay
with a normal browser in a manner similar to this: Python - Headless Selenium WebDriver Tests using PyVirtualDisplay (though I'm using Chrome; just a matter of a different driver).
Much more stable than my experience with PhantomJS from both node and Python. You'll still likely want to use a process manager, just in case, but this way has been far less error-prone for me.
Also, I suggest writing a little Python wrapper class so you can use a with
block and ensure your environment always gets cleaned up; if you don't kill the session appropriately you can end up with an orphaned browser eating memory.
From my project:
import os, time
from selenium import webdriver
from pyvirtualdisplay import Display
class ChromeSession(object):
def __enter__(self):
self.display = Display(visible=0, size=(1024, 768))
self.display.start()
chromedriver = "/usr/lib/chromium/chromedriver"
os.environ["websession.chrome.driver"] = chromedriver
self.driver = webdriver.Chrome(chromedriver)
# Tell the driver to wait (if necessary) in case UI rendering takes a while...
self.driver.implicitly_wait(5)
return self.driver
def __exit__(self, exc_type, exc_val, exc_tb):
if exc_type:
print exc_type, exc_val
print exc_tb
self.driver.quit()
self.display.stop()

kungphu
- 4,592
- 3
- 28
- 37