1

I am looking for a way to create functionality, similar to when you post a link to the existed web-site in facebook. If this statement is rather ambiguous, I will try to elaborate.

When you paste your link and submit your post, facebook together with you link gives a small preview of the page, you are posting (text and may be a small image)

What are the ways to achieve this?

I read the similar post, but the thing is that I do not need an image so much, text will be sufficient.

Working in PHP, but language is not important, because I am looking for a high level idea. Previously I was thinking about parsing content of the link with cURL but the thing is that in a lot of situations the text returned by facebook is not available on the page.

Is there other ways?

Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753

2 Answers2

1

From what I can tell, Facebook pulls from the meta name="description" tag's content attribute on the linked page.

If no meta description tag is available, it seems to pull from the beginning of the first paragraph <p> tag it can find on the page.

Images are pulled from available <img> tags on the page, with a carousel selection available to pick from when posting.

Finally, the link subtext is also user-editable (start a status update, include a link, and then click in the link subtext area that appears).

Personally I would go with such a route: cURL the page, parse it for a meta tag description and if not grab some likely data using a basic algorithm or just the first paragraph tag, and then allow user editing of whatever was presented (it's friendlier to the user and also solves issues with different returns on user-agent). Do the user facing control as ajax so that you don't have issues with however long it takes your site to access the link you want to preview.

I'd recommend using a DOM library (you could even use DOMDocument if you're comfortable with it and know how to handle possibly malformed html pages) instead of regex to parse the page for the <meta>, <p>, and potentially also <img> tags. Building a regex which will properly handle all of the myriad potential different cases you will encounter "in the wild" versus from a known set of sites can get very rough. QueryPath usually comes recommended, and there are stackoverflow threads covering many of the available options.

Most modern sites, especially larger ones, are good about populating the meta description tag, especially for dynamically generated pages.

You can scrape the page for <img> tags as well, but you'll want to then host the images locally: You can either host all of the images, and then delete all except the one chosen, or you can host thumbnails (assuming you have an image processing library installed and turned on). Which you choose depends on whether bandwidth and storage are more important, or the one-time processing of running an imagecopyresampled, imagecopyresized, Gmagick::thumbnailimage, etc, etc. (pick whatever you have at hand/your favorite). You don't want to hot link to the images on the page due to both the morality of it in terms of bandwidth and especially the likelihood of ending up with broken images when linking any site with hotlink prevention (referrer/etc methods), or from expiration/etc. Personally I would probably go for storing thumbnails.

You can wrap the entire link entity up as an object for handling expiration/etc if you want to eventually delete the image/thumbnail files on your own server. I'll leave particular implementation up to you since you asked for a high level idea.

but the thing is that in a lot of situations the text returned by facebook is not available on the page.

Have you looked at the page's meta tags? I've tested with a few pages so far and this is generally where content not otherwise visible on the rendered linked pages is coming from, and seems to be the first choice for Facebook's algorithm.

Community
  • 1
  • 1
taswyn
  • 4,463
  • 2
  • 20
  • 34
0

Full disclosure upfront, I'm a developer at ThumbnailApp.com.

It's an JSON API service with an optional Javascript SDK which I think does exactly what you're after: It will parse a string to detect any urls and return the title, description and thumbnail of the asset. If the page has OpenGraph tags, it will use those for the image thumbnail. It's currently in private beta but we're adding more accounts each week.

If you feel that you really need a do-it-yourself solution:

Checkout the python based Webkit2Png and the headless browser PhantomJs. They can render webpages to an image (default size is 800x600), then you'll have to write some code to resize and crop the image like taswyn mentioned. Ideally you would then upload the resized image to Amazon S3 and then get it hosted on a CDN such as CloudFront.

To get the title and description, first get the URL content (cURL or whatever) and you will need to check the content-type header to make sure it's a webpage. If it is, you can then use a HTML parser such as the SimpleHTMLDOM PHP library to grab the title and description meta data. If you want it exactly like Facebook you will also need to check for any OpenGraph tags specifically the og:image tag.

Also don't forget about caching. The first render and description parsing can take a long time. Even if your site is fast, the webpage you're rendering could be slow and the best approach is to render / parse it once, then just save and return the resized image and meta data for subsequent requests. Depending on what your requirements are you may need to refresh the cached data every hour or you could get away with refreshing it once a day.

To do it yourself takes quite a bit of work and lots of server configuration. I feel using a 3rd party service is a better way to go, but obviously I have a biased opinion :)

Pete
  • 2,196
  • 1
  • 17
  • 25