The Goal:
I am trying to perform some scraping in Python using a headless browser: Selenium with PhantomJs and GhostDriver.
I am using Python 2.7 on a Mac running Mavericks. I work within emacs (although it also didn't work from Terminal). I have already overcome some errors such as "phantomjs - no such file or directory exists", but have got the latest Binaries from here, which promise to be the latest, including a pending patch from the official PhantomJS team.
My Test Script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
desired_cap = {
'phantomjs.page.settings.loadImages' : True,
'phantomjs.page.settings.resourceTimeout' : 10000,
'phantomjs.page.settings.userAgent' : "my_user_agent"
}
driver = webdriver.PhantomJS(executable_path= "/usr/local/bin/phantomjs", desired_capabilities=desired_cap)
driver.set_window_size(1024, 768)
driver.get('https://google.com/')
driver.save_screenshot("testing.png")
driver.page_source("source_code.txt")
element = driver.find_element_by_xpath('--*[@id=-gbqfq-]')
element.send_keys('testing')
element.send_keys(Keys.ENTER)
Here is a link to the webdriver's simple explanation: http://selenium.googlecode.com/svn/trunk/docs/api/py/webdriver_phantomjs/selenium.webdriver.phantomjs.webdriver.html
The Error Message:
What I have tried:
I took a simple example from a tutorial before trying to actually perform anything more complicated, but still get errors. One tutorial... a second
I made one last change to the phantomjs service.py
file, which I found here. Namely, I changed:
self.process = subprocess.Popen(self.service_args,
stdout=self._log, stderr=self._log)
to:
self.process = subprocess.Popen(['/usr/bin/env', 'phantomjs', '--webdriver=59202'])
The last arguement --webdriver
seems rather arbitrary to my inexperienced eyes. I thought it might correlate to the port that it used by ghostdriver, which is displayed after each run in the ghostdriver.log
and is different each time. Because it changes each time, I don't think using it makes sense to have it static in the code - and after trying anyway, it didn't work.
The Question:
Does anybody have any ideas about why the connection is being refused??