0

I would like to get the result for google patents, anyone can help?

This is a example from google search,

<?php
require_once('simple_html_dom.php');
$url  = 'https://www.google.com/search?hl=en&q=facebook&num=1';
$html = file_get_html($url);
$linkObjs = $html->find('h3.r a');

foreach ($linkObjs as $linkObj) {
    $title = trim($linkObj->plaintext);
    $link  = trim($linkObj->href);

    // if it is not a direct link but url reference found inside it, then extract
    if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&amp;sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
        $link = $matches[1];
    } else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
        continue;    
    }

    echo '<p>Title: ' . $title . '<br />';
    echo 'Link: ' . $link . '</p>';    
}

?>

Result:

Title: Welcome to Facebook - Log In, Sign Up or Learn More
Link: https://www.facebook.com/

I like this result but I need to search for Google Patents.

If there are other better choices / methods, please tell me, very grateful.

brian kong
  • 17
  • 6
  • Be aware that Google doesn't allow scraping its search result pages, so you may get blocked for doing this at any time. See http://stackoverflow.com/a/22703153/582278 Patent Search has an API - https://developers.google.com/patent-search/ It's deprecated, but will continue to be active. – Dan Blows Jan 07 '15 at 10:28
  • @Blowski But it seems to have been discontinued – brian kong Jan 08 '15 at 02:15
  • The documentation says it's deprecated, but still available. Either way, you would be breaking Google's TOS by scraping their results page and they will probably block you. – Dan Blows Jan 08 '15 at 09:08
  • @Blowski But I can't find out it. Could you help me find out it and tell to me? please. – brian kong Jan 09 '15 at 08:35
  • What are you trying to find out? How to scrape the page (which I'm not going to help with because it breaks the TOS and the internet is already heaving with instructions on how to do that anyway)? Or the API documentation - that's [here](https://developers.google.com/patent-search/). – Dan Blows Jan 09 '15 at 10:42
  • @Blowski But i can not find the .js file for google patents api, or It does not need to be used? – brian kong Jan 14 '15 at 02:35

1 Answers1

0

If you are looking for patent on "multifunctional keypad" set $url as "https://www.google.com/search?tbm=pts&hl=en&q=multi+function+keypad&num=1"

but remember if you are looking for patent on something that is not available on that site you might get result from some other site or may not even get a result. you will need to handle these situations. (e.g. check if the result have www.google.com/patents/ in it).

Much more effective way to search would be using google api. search for patent and php on https://developers.google.com/web-search/docs/

hope this helps

Update: I wrote a little script to show, it can work with what I said. I didn't wanted to learn simple_html_dom.php, so didn't use that. You may apparently figure out if you could improve my code using that simple_html_dom.php.

Sometime it needs couple of refreshes for it to work (In my code it picks an random IP that google doesn't treat valid and returns no result, feel free to use your ip, but that might soon get blocked if you run this too frequent, Randomizing IP may still not prevent blocking your ip if run too frequently(google asks to enter captha if it finds scraping like things), I also randomizing few other things like http header and user agent). well here is the code

<?php

function searchGooglePatent($searchString){
        $url = "https://www.google.com/search?tbm=pts&hl=en&q=".rawurlencode($searchString);//."&num=1"; // add &num=1 if you need only one result
        echo $url;
        $html = geturl($url);
        $ids = match_all('/<a.*?href=\"(https:\/\/www\.google\.com\/patents\/\w\w\d+)\?.*?\".*?>.*?<\/a>/ms', $html, 1);
        return $ids;
    }

function match_all($regex, $str, $i = 0){
        if(preg_match_all($regex, $str, $matches) === false) {
            return false;
        } else {
            return $matches[$i];
        }
    }


function geturl($url){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
        $ip=rand(0,255).'.'.rand(0,255).'.'.rand(0,255).'.'.rand(0,255);
        echo "<br>".$ip."<br>";
        curl_setopt($ch, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/".rand(3,5).".".rand(0,3)." (Windows NT ".rand(3,5).".".rand(0,2)."; rv:2.0.1) Gecko/20100101 Firefox/".rand(3,5).".0.1");
        set_time_limit(90);
        $html = curl_exec($ch);
        curl_close($ch);
        return $html;
    }

$searchResult = searchGooglePatent("Multi function keypad");
echo "<pre>";
var_dump($searchResult);
echo "</pre>";

?>

Result page would look like this

    https://www.google.com/search?tbm=pts&hl=en&q=Multi%20function%20keypad
    71.10.79.131
    array (size=4)
      0 => string 'https://www.google.com/patents/US7724240' (length=40)
      1 => string 'https://www.google.com/patents/US6876312' (length=40)
      2 => string 'https://www.google.com/patents/US8259073' (length=40)
      3 => string 'https://www.google.com/patents/US7523862' (length=40)
karmendra
  • 2,206
  • 8
  • 31
  • 49
  • if i set $url as "https://www.google.com/search?tbm=pts&hl=en&q=multi+function+keypad&num=1", it can not get any result – brian kong Jan 08 '15 at 02:17
  • Test Successfunlly!! Thank you verty much. I have a small request, how to get URL title? – brian kong Jan 09 '15 at 09:50
  • Glad it worked for you, to get title, var_dump the $html and look at source for some pattern that you can match to and pull it out using preg_match (or can even use function match_all above). to understand regular expresion better try [regex101](http://www.regex101.com). – karmendra Jan 11 '15 at 06:18
  • But I will get all URL title. E.g. Search, Images, Maps .etc. How I can find out the URL title from results of google patents? – brian kong Jan 14 '15 at 08:00