0

My website is having difficulties working with Facebook/LinkedIn/social crawlers. I think this may be due to the redirect headers it seems to be returning. Browsers can access the site completely fine, but crawlers of all kinds seem to only be able to access it after multiple attempts.

This is the (wordpress autogenerated) .htaccess I have running:

RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

Every page on my website (including the index) seems to be returning a 302 Found status code, as opposed to the 200 I expect. I don't know if 302 is what I should actually be expecting, but the Facebook debugger is complaining that "URL requested a HTTP redirect, but it could not be followed" - but only on the first attempt.

Requesting the server root , curl -I returns, on the first attempt:

HTTP/1.1 302 Found
Connection: close
Pragma: no-cache
cache-control: no-cache
Location: /

with nothing else. (Does the fact that the location is relative and not absolute cause a problem? Supposedly RFC 2616 does not require this any more.)

Subsequent attempts return:

HTTP/1.1 200 OK
Date: Thu, 18 Sep 2014 14:59:54 GMT
Server: XXXXX
X-Powered-By: XXXXX
Set-Cookie: PHPSESSID=XXXXX; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Pingback: http://XXXXX/xmlrpc.php
Content-Type: text/html; charset=UTF-8

and the html follows as normal.

Is this to be expected? Why is the crawler not automatically following the redirect initially? Perhaps more strangely, why is the server returning this as a redirect at all?

For completeness, my DNS has an A record pointing to the IP of the dedicated server. I read that some DNS setups might cause issues like these, but I can't see why mine would?

tom
  • 101
  • 1
  • Don't you get a `Server` or `X-Powered-By` header on the first request? That seems strange, like there's another webserver answering. – jornane Sep 18 '14 at 17:27
  • Nope, that's all I receive on the first request. I've checked in curl verbose mode, and it's connecting to the same IP on port 80 in both requests. Apache and gunicorn are both running, but gunicorn won't serve outside of a specific subdirectory and always provides a Server header. – tom Sep 18 '14 at 18:02
  • Your configuration shows `index.php`, but you talk about gunicorn (which is a Python WSGI server), maybe you can give some more information about your setup? – jornane Sep 19 '14 at 06:16
  • It is just a standard LAMP (PHP) setup. A certain part of the site requires Python. Gunicorn only serves (through a reverse proxy) when a specific subdirectory is requested - this part is working correctly. – tom Sep 19 '14 at 11:24

0 Answers0