Facebook like on demand meta content scraper

Question

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).

any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)

thanks!

Anything like this would need a custom-written tool for each site you were scraping from. Try to avoid `RegEx`, use `DOM` instead. Try to find a raw data feed from the site before scraping their actual web page. If you can't find raw data, I strongly recommend testing with static files stored on your server. — drudge, Oct 19 '10 at 22:19
dont think so. i already have aworking prototype that generates the same output as the fb scraper only problem is the scalibilty... — Toby, Oct 20 '10 at 11:58

stevenroberts · Accepted Answer · 2010-10-23T07:23:12.757

FB scrapes the meta tags from the HTML.

I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.

As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.

Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/

I've had a look at how FB does it, and it looks like the scraping is done at server side.


    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

Well you put effort into your answer so go ahead and get the bounty — TheLQ, Oct 24 '10 at 20:51
his point is how can people answer if they don't know the criteria by which an answer is acceptable? — John Nicholas, Oct 25 '10 at 21:57

score 0 · Answer 2 · answered Apr 07 '12 at 11:21

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

score 0 · Answer 3 · answered Oct 20 '10 at 08:44

0

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

answered Oct 20 '10 at 08:44

Nev Stokes

9,051
5
42
44

oh really ;) - i meant more in terms of mass data crowling. also they use pics within the page not screenshots. – Toby Oct 20 '10 at 11:57
What do you think a screenshot is Tobias? It *is* a picture! – Nev Stokes Oct 20 '10 at 13:17
they collect all images within the page and chose one as preview thumbnail. they do not generate a screenshot of a page. it's a random picture on the page. – Toby May 19 '12 at 14:52

Facebook like on demand meta content scraper

3 Answers3