7

Here's an example URL:

http://www.motherjones.com/mojo/2012/05/reince-priebus-lgbt-workplace-discrimination

The above used to pull in no image, title or description when pasted into the Facebook status update box -- it remained a bare URL. I then ran it through the debugger, which found no problems. It now pulls in the headline, image and description when pasted into the status update box.

For comparison, here's a post I have not yet debugged. It does not transform when pasted into the update box. As soon as I or anyone else runs it through the debugger, however, it will start pulling in the headline (although this one doesn't have an image or description).

http://www.motherjones.com/kevin-drum/2012/05/health-insurers-required-credit-obama-when-sending-out-rebate-checks

This could simply be a timing issue -- FB is slow to prepare the metadata on our pages -- but we have noticed that it takes hours, maybe days for the sharing to start working properly. That's long after the piece has peaked in traffic, so it does us little good.

We started seeing this around April 9.

My question: is there something about our pages that is making Facebook slow to scrape them? What am I missing? If there is a problem, why doesn't the debugger tell me? It does seem like there's a slightly updated version of the doctype to try, but that doesn't seem likely to be the culprit. Also -- is there any reason I shouldn't write a hook to run everything through the debugger at publish time?

Luke Smith
  • 73
  • 1
  • 3
  • Should also note that clicking "like" on the page itself produces a normal share with metadata (but doesn't fix paste-in sharing). – Luke Smith May 14 '12 at 20:41

1 Answers1

2

Facebook caches the scrapped data on their side for faster response when users share. In the documentation of the Like Button it says:

When does Facebook scrape my page?

Facebook needs to scrape your page to know how to display it around the site.

Facebook scrapes your page every 24 hours to ensure the properties are up to date. The page is also scraped when an admin for the Open Graph page clicks the Like button and when the URL is entered into the Facebook URL Linter. Facebook observes cache headers on your URLs - it will look at "Expires" and "Cache-Control" in order of preference. However, even if you specify a longer time, Facebook will scrape your page every 24 hours.

The user agent of the scraper is: "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

As you can see, when you use the linter (aka debug tool) it clears the cache for the used url and replaces it with the new data, which is why you get different sharing results after you debugged the page. It doesn't sit right though with you saying that it sometimes takes days, but maybe their documentation is not completely accurate on that subject, after all they have a lot to scrap.

If the page is new, that is it wasn't scrapped before then there's no cache and you should get the right result when sharing, it's only when the og data was changed when you need to clear the cache. So if you update the data for a scrapped page be sure to debug it later, you can just issue an http request to the same url they use in the debug tool from the server side, you don't need to use the web interface.

If things still don't work as you expect, you can check the user agent string of incoming requests and compare it with facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) and if it matches log the response you send back, then compare it with the results you get when sharing, if it's inconsistent try to file a bug report. As for "hooking" a debugger request per publish, I would suggest against it, it seems like unnecessary traffic if things work as they should. I believe it's better to solve the problem then to use a work around.

Nitzan Tomer
  • 155,636
  • 47
  • 315
  • 299
  • Thanks for your answer. I will look in logs for the scraper and check my cache headers. It seems that bad data / no data must be getting cached somehow. Sharing with the like button works normally even when paste-in sharing doesn't, and sharing continues to be bad, even after many likes and shares, until cache is cleared by the debugger. This isn't a case where we need to make sure updates get through -- the first scrape must be bad. It does eventually get the metadata but a delay of hours is enough to really hurt us. If I figure out what's causing this I'll be sure to update this space. – Luke Smith May 15 '12 at 17:16
  • I now have a new theory. We stage a lot of content unpublished. The logs reveal that FB is trying to hit this content and getting a 403 (as it should). Then the question is -- what causes FB to know about the unpublished page? Is it the like button itself, the SDK, or both/either? What do I have to keep off of unpublished pages to prevent a scrape? – Luke Smith May 15 '12 at 18:11
  • 1
    There are some triggers for scrapping a page, one of them being the rendering of a like button. And if the url returns 403 then that will get cached. Do you use the same urls for staging and production? – Nitzan Tomer May 15 '12 at 18:30
  • 1
    Yup. And buttons are rendered on unpubbed pages. My likely solution will be to stop the rendering of the buttons on unpublished content and see if that prevents early scraping. The reason I didn't immediately conclude this was that we've been doing this for as long as we've been using the SDK (at least a year) and we only started getting the problem starting April 9 or so. But maybe FB just changed their scraping algorithm. I'll confirm the fix and update here. – Luke Smith May 15 '12 at 19:32
  • 1
    I can confirm the fix. Thanks. – Luke Smith May 21 '12 at 18:03