From what I can tell, Facebook pulls from the meta name="description"
tag's content attribute on the linked page.
If no meta description tag is available, it seems to pull from the beginning of the first paragraph <p>
tag it can find on the page.
Images are pulled from available <img>
tags on the page, with a carousel selection available to pick from when posting.
Finally, the link subtext is also user-editable (start a status update, include a link, and then click in the link subtext area that appears).
Personally I would go with such a route: cURL the page, parse it for a meta tag description and if not grab some likely data using a basic algorithm or just the first paragraph tag, and then allow user editing of whatever was presented (it's friendlier to the user and also solves issues with different returns on user-agent). Do the user facing control as ajax so that you don't have issues with however long it takes your site to access the link you want to preview.
I'd recommend using a DOM library (you could even use DOMDocument if you're comfortable with it and know how to handle possibly malformed html pages) instead of regex to parse the page for the <meta>
, <p>
, and potentially also <img>
tags. Building a regex which will properly handle all of the myriad potential different cases you will encounter "in the wild" versus from a known set of sites can get very rough. QueryPath usually comes recommended, and there are stackoverflow threads covering many of the available options.
Most modern sites, especially larger ones, are good about populating the meta description tag, especially for dynamically generated pages.
You can scrape the page for <img>
tags as well, but you'll want to then host the images locally: You can either host all of the images, and then delete all except the one chosen, or you can host thumbnails (assuming you have an image processing library installed and turned on). Which you choose depends on whether bandwidth and storage are more important, or the one-time processing of running an imagecopyresampled
, imagecopyresized
, Gmagick::thumbnailimage
, etc, etc. (pick whatever you have at hand/your favorite). You don't want to hot link to the images on the page due to both the morality of it in terms of bandwidth and especially the likelihood of ending up with broken images when linking any site with hotlink prevention (referrer/etc methods), or from expiration/etc. Personally I would probably go for storing thumbnails.
You can wrap the entire link entity up as an object for handling expiration/etc if you want to eventually delete the image/thumbnail files on your own server. I'll leave particular implementation up to you since you asked for a high level idea.
but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Have you looked at the page's meta tags? I've tested with a few pages so far and this is generally where content not otherwise visible on the rendered linked pages is coming from, and seems to be the first choice for Facebook's algorithm.