3

I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse).

my code:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)

driver.get(url)

trs = driver.find_elements_by_css_selector('.table tbody tr')

print len(trs)

for tr in trs:
    try:
        items.append(tr.text)
    except:
        # because the js update content, so this tr is missing
        pass

print len(items)

len(items) would not match len(trs). How to tell selenium stop executing js or stop working after I run trs = driver.find_elements_by_css_selector('.table tbody tr') ?

I need use trs later, so can not driver.quit()

Exception detail:

---------------------------------------------------------------------------
StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-84-b80e3579efca> in <module>()
     11 items = []
     12 for tr in trs:
---> 13     items.append(tr.text)
     14     #items.append(map_label(hidemyass_label, tr.find_elements_by_tag_name('td')))
     15 

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in text(self)
     69     def text(self):
     70         """The text of the element."""
---> 71         return self._execute(Command.GET_ELEMENT_TEXT)['value']
     72 
     73     def click(self):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in _execute(self, command, params)
    452             params = {}
    453         params['id'] = self._id
--> 454         return self._parent.execute(command, params)
    455 
    456     def find_element(self, by=By.ID, value=None):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.pyc in execute(self, driver_command, params)
    199         response = self.command_executor.execute(driver_command, params)
    200         if response:
--> 201             self.error_handler.check_response(response)
    202             response['value'] = self._unwrap_value(
    203                 response.get('value', None))

C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.pyc in check_response(self, response)
    179         elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
    180             raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 181         raise exception_class(message, screen, stacktrace)
    182 
    183     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: {"errorMessage":"Element is no longer attached to the DOM","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63305","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/text","urlParsed":{"anchor":"","query":"","file":"text","directory":"/","path":"/text","relative":"/text","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/text","queryKey":{},"chunks":["text"]},"urlOriginal":"/session/4bb16340-a3b6-11e5-8ce5-9d0be40203a6/element/%3Awdc%3A1450243990539/text"}}
Screenshot: available via screen

Appearantly tr is missing.

PS: I need use selenium to select element. Other libs like lxml, pyquery don't know which element is display:none or not, .text() often get comment or something in <script> , and so on bugs. It's sad that python do not have a perfect clone of Jquery.

Mithril
  • 12,947
  • 18
  • 102
  • 153
  • Could you remove the `try/except` - what error(s) are you getting if any? – alecxe Dec 16 '15 at 03:19
  • @alecxe I have pasted the exception. – Mithril Dec 16 '15 at 05:40
  • @Mithril maybe you can try getting whole text of the table at one time like `.find_elements_by_css_selector('.table tbody tr').text ` then store the tex by string operation. – Mesut GUNES Dec 16 '15 at 07:39
  • @Mesut Güneş I just write the code for example, I need manipulate some td under the tr, the for loop may cost several second, then tr missing..... – Mithril Dec 16 '15 at 07:49

1 Answers1

1

Use scrapy. Once you are sure the page has loaded, grab the body using:

response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')

You now have a static copy of the page so that you can use scrapy's response.xpath to pull whatever data you need. This answer as more detail.

Community
  • 1
  • 1
Steve
  • 976
  • 5
  • 15
  • scrapy is just a crawler framework, add selenium or splash support to scrapy would not help to my question.Your code is like using `lxml` to parse `driver.page_source` , but `lxml`(I think scrapy is using `lxml` too) don't know which element is `display:none` or not(and many other bugs). I need use selenium to select element. So I ask as title. – Mithril Dec 25 '15 at 06:39
  • @Mithril Once you have the static copy then there's no JS update every 5 seconds. You have a snapshot so even if your for loop takes several seconds nothing changes. You could even save the `response.body` and process it the next day. – Steve Dec 26 '15 at 15:10
  • You cannot pass a static html to selenium. I need use selenium to select. – Mithril Dec 28 '15 at 01:09
  • @Mithril, We may be talking at cross-purposes. I would use `Selenium` to grab the page with all the JS rendered content. Copy that HTML as I described above, then use `scrapy` to extract the information I need from the static HTML. Are you saying that you can't use `scrapy` for some reason? – Steve Dec 28 '15 at 10:58
  • No, I have mentioned that none of a python package can deal html as well as jquery. Antispider site have a lot random rubbish dom elements to disturb extracting real content.Only jquery(node.js) and selenium can extract real content easily. I don't want to waste time to breaking the antispider rule. – Mithril Dec 29 '15 at 06:48