3

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).

any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)

thanks!

Toby
  • 2,720
  • 5
  • 29
  • 46
  • c'mon guys... seriously? nobody? ;) – Toby Jun 07 '10 at 22:21
  • 1
    Anything like this would need a custom-written tool for each site you were scraping from. Try to avoid `RegEx`, use `DOM` instead. Try to find a raw data feed from the site before scraping their actual web page. If you can't find raw data, I strongly recommend testing with static files stored on your server. – drudge Oct 19 '10 at 22:19
  • dont think so. i already have aworking prototype that generates the same output as the fb scraper only problem is the scalibilty... – Toby Oct 20 '10 at 11:58

3 Answers3

14

FB scrapes the meta tags from the HTML.

I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.

As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.

Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/

I've had a look at how FB does it, and it looks like the scraping is done at server side.


    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

stevenroberts
  • 231
  • 3
  • 6
0

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

0

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

Nev Stokes
  • 9,051
  • 5
  • 42
  • 44
  • oh really ;) - i meant more in terms of mass data crowling. also they use pics within the page not screenshots. – Toby Oct 20 '10 at 11:57
  • What do you think a screenshot is Tobias? It *is* a picture! – Nev Stokes Oct 20 '10 at 13:17
  • they collect all images within the page and chose one as preview thumbnail. they do not generate a screenshot of a page. it's a random picture on the page. – Toby May 19 '12 at 14:52