1

The search results page for a local Google search typically looks like this, containing 20 results.

In order to get the full contact details for any given result on the left-hand-side, the result needs to be clicked, bringing up (after a lengthy wait) an overlay (not sure of the technical term) over the Google Maps pane (on Firefox, does something different on other web browsers):

enter image description here

I am extracting the business name. address, phone and website with Python and WebDriver thus:

address = driver.find_element_by_xpath("//div[@id='akp_uid_0']/div/div/ol/li/div/div/div/ol/table/tbody/tr[2]/td/li/div/div/span[2]").text

name = driver.find_element_by_css_selector(".kno-ecr-pt").text.encode('raw_unicode_escape')
phone = driver.find_element_by_css_selector("div._mr:nth-child(2) > span:nth-child(2)").text

website = driver.find_element_by_css_selector("a.lua-button:nth-child(1)").get_attribute("href")

This is working reliably, but is extremely slow. Loading up each Maps overlay can take in the tens of seconds each time. I've tried PhantomJS via WebDriver, but got quickly blocked by Google's bot-detection.

If my reading of Firebug is correct, each of these links on the left hand side is defined like so:

<a data-ved="0CA4QyTMwAGoVChMIj66ruJHGxwIVTKweCh03Sgw0" data-async-trigger="" data-height="0" data-cid="11660382088875336582" data-akp-stick="H4sIAAAAAAAAAGOovnz8BQMDgycHm5SIoaGZmYGxhZGBhYWFuamxsZmphZESVtEoyeSMzKL8gqLE5JL8omLtvNRyhcr8omztvMrkA51e-lt5XiW0n3kw-e7MFfkJwUIAxqbXGGYAAAA" data-akp-oq="Body in Balance Chiropractic New York, NY" jsl="$x 3;" data-rtid="ifLMvGmjeYOk" jsaction="r.UQJvbqFUibg" class="ifLMvGmjeYOk-6WH35iSZ2V0 rllt__link rllt__content" tabindex="0" role="link"><div class="_Ml"><div class="_pl _ki"><div role="heading" aria-level="3" style="margin-right:0px" class="_rl">Body in Balance <wbr></wbr>Chiropractic</div><div class="_lg"><span aria-hidden="true" class="rtng" style="margin-right:5px">5.0</span><g-review-stars><span aria-label="Rated 5.0 out of 5" class="_pxg _Jxg"><span style="width:70px"></span></span></g-review-stars><div style="display:inline;font-size:13px;margin-left:5px"><span>20 reviews</span></div></div><div class="_tf"><span>Chiropractor</span>&nbsp;·&nbsp;W 45th St</div><div class="_CRe"><div><span>Opens at 8:00 am</span></div></div></div></div></a>

My knowledge of CSS and JavaScript is practically nil, so I may not be asking the right question. But is there a way to get at the underlying source of what eventually hovers over the Maps pane (there's probably a more technical term for it), without having to click on the link on the left hand side to bring it up? My thinking is that if I can get that parse that HTML without actually having to trigger it, I can save much time.

Pyderman
  • 14,809
  • 13
  • 61
  • 106

1 Answers1

1

I have tried to check the dom structure of the page you provided. Basically IE has huge differences on such a page with Firefox(IE will direct to another page once you've clicked the left-hand-side items.)

But due to my environmental limit, I can just have done this for IE. For firefox, you may have a try on the following code. There might be minor issues(apologize, I am unable to test it ).

Note: I wrote a java demo(Just for searching Phone num) because I am familiar with java. And I am also not good at cssSelector so I used xpath instead. Hope it can help.

        driver.get("https://www.google.com/search?q=chiropractors%2Bnew%20york%2Bny&rflfq=1&tbm=lcl&tbs=lf:1,lf_ui:2&oll=40.754671143320074,-73.97722375000001&ospn=0.017814865199625274,0.040340423583984375&oz=15&fll=40.75807315356519,-73.99290368792725&fspn=0.01641614335274255,0.040340423583984375&fz=15&ved=0CJIBENAnahUKEwj1jtnmtcbHAhVTCo4KHfkkCYM&bav=on.2,or.r_cp.&biw=1360&bih=608&dpr=1&sei=y4LdVYvcFsa7uATo_LngCQ&ei=4YTdVbWaENOUuAT5yaSYCA&emsg=NCSR&noj=1&rlfi=hd:;si:#emsg=NCSR&rlfi=hd:;si:&sei=y4LdVYvcFsa7uATo_LngCQ");

        //0. Actually no need unless you have low connection speed with google.
        Thread.sleep(5000);


        //1. By xpath '_gt' will extract all of the 20 results' div on left hand side. Both IE and firefox can work well. 
        List<WebElement> elements = driver.findElements(By.xpath("//div[@class='_gt']"));

        //2. Traverse all of the results. Let 'data-cid' as identifier. Note: Only FF can be done. For IE there are no data-cid s
        for(int i=0; i<elements.size(); i++) {
            WebElement e = elements.get(i);


            WebElement aTag = e.findElement(By.tagName("a"));


            String dataCid = aTag.getAttribute("data-cid");


            //3. Here, the div which contains the info we want can be identified by 'data-cid' in firefox
            WebElement parentDivOfTable = driver.findElement(By.xpath("//div[@class='akp_uid_0' and @data-cid='" + dataCid + "']"));

            //4. get the infomation table.
            WebElement table = parentDivOfTable.findElement(By.xpath("//table[@class='_B5g']"));

            //get the phone num.
            String phoneNum = table.findElement(By.xpath("//span[text()='Phone:']/following-sibling")).getText();
        }
J.Lyu
  • 932
  • 7
  • 16
  • If you can not control java code I am so sorry. But as per your requirement, the code and the comment above every line can be a thinking thread for you. Overall I suggest concentrate more on the whole architecture and the correlation between the items. If I misunderstood your problem or there are issues, please escalate here. Thanks. – J.Lyu Aug 26 '15 at 10:29
  • Many thanks, J. I have successfully mapped your code to Python. And your approach seems reasonable to me. **table** is giving some trouble though - Webdriver is not finding this element - see here http://pastebin.com/A5YeuaGy. Your XPath knowledge is clearly better than mine, so I presume the XPath expressions is correct. What would you suggest? – Pyderman Aug 26 '15 at 15:58
  • I meant **parentDivTable**- it's **parentDivTable** that cannot b constructed – Pyderman Aug 26 '15 at 16:43
  • I am so sorry but i'm gonna to tell you that I made a mistake yesterday. I have set up a workaround which is able to perform webdriver in firefox. Unfortunately I have tested and checked the source again and got a conclusion that it's impossible to get the information in detail for every item. Google will load the phone, name and address only after you opening the popup panel and you can just find only the latest opened or currently opening panel's information. So limited by my knowledge, I have no idea for this. If you want to save performing time, you can try to use HtmlUnitDriver. – J.Lyu Aug 27 '15 at 02:25
  • I apologize for my mistake. It's my fatal fault that I hadn't done any tests and posted this. There is a reference if you are able to use HtmlUnitDriver: http://stackoverflow.com/questions/4618373/how-do-i-use-the-htmlunit-driver-with-selenium-from-python/5518175#5518175 Hope it helps. – J.Lyu Aug 27 '15 at 02:37