4

Update Yahoo error

Ok, so I got it all working, but the preg_match_all wont work towards Yahoo. If you take a look at: http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t then you can see that in their html, they have <span class="url" id="something random"> the actual link </span> But when I try to preg_match_all, I wont get any result.

preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);

Anyone got an idea?

End of update

I'm trying to preg_match_all the results i get from Google using a cURL curl_multi_getcontent method.

I have succeeded to fetch the site and so, but when I'm trying to get the result of the links, it just takes too much.

I'm currently using: preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);

And that starts where it should be, but it doesn't stop, it just keeps going. Check the HTML at www.google.com/search?q=random for example and you will see that all links start with and ends with .

Could someone possible help me with how I should retreive this information? I only need the actual link address to each result.

Update Entire PHP Script

public function multiSearch($question)
{
    $sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
    $sites['bing'] = "http://www.bing.com/search?q={$question}";
    $sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";

    $urlHandler = array();

    foreach($sites as $site)
    {
        $handler = curl_init();
        curl_setopt($handler, CURLOPT_URL, $site);
        curl_setopt($handler, CURLOPT_HEADER, 0);
        curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);

        array_push($urlHandler, $handler);
    }

    $multiHandler = curl_multi_init();
    foreach($urlHandler as $key => $url)
    {
        curl_multi_add_handle($multiHandler, $url);
    }

    $running = null;
    do
    {
        curl_multi_exec($multiHandler, $running);
    }
    while($running > 0);

    $urlContents = array();
    foreach($urlHandler as $key => $url)
    {
        $urlContents[$key] = curl_multi_getcontent($url);
    }

    foreach($urlHandler as $key => $url)
    {
        curl_multi_remove_handle($multiHandler, $url);
    }

    foreach($urlContents as $urlContent)
    {
        preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
        //$this->view_data['results'][] = "Random";
    }
    preg_match_all('#<div id="search"(.*)</ol></div>#i', $urlContents[0], $match);
    preg_match_all('#<cite>(.+)</cite>#si', $urlContents[0], $links);
    var_dump($links);

}
Vanchi
  • 148
  • 1
  • 8
  • Can you please put your PHP script so we can check it ? – Laurent Brieu Oct 17 '12 at 07:12
  • Sure. But as I mentioned, it does retreive the actual HTML document so theres nothing wrong with the script. I used preg_match_all to get just the result section, but it wont work for the links only. Anyway, I'll update the main post with the entire script. – Vanchi Oct 17 '12 at 07:14

2 Answers2

4

run the regular expression in U-ngready mode

preg_match_all('#<cite>(.+)</cite>#siU
Maxim Krizhanovsky
  • 26,265
  • 5
  • 59
  • 89
  • Do you have any idea how to solve my updated issue @Darhazer or @Jack? – Vanchi Oct 17 '12 at 08:18
  • @DanielRunnakkoLöfgren Same problem, but this time with the `(.*)`; you should have `(.*?)`. – Ja͢ck Oct 17 '12 at 08:31
  • @Jack It wont work this time. When I'm using the regex I posted in Notepad++, it works as intended, but not in PHP for some reason. – Vanchi Oct 17 '12 at 08:33
2

As in Darhazer's answer you can turn on ungreedy mode for the whole regex using the U pattern modifier, or just make the pattern itself ungreedy (or lazy) by following it with a ?:

preg_match_all('#<cite>(.+?)</cite>#si', ...
MrWhite
  • 43,179
  • 8
  • 60
  • 84