2

I'm having some difficulty with a super simple htaccess redirect.

All I want to do is rewrite absolutely everything, except a couple files.

htaccess looks like this:

RewriteEngine On
RewriteCond %{REQUEST_URI} !sitemap
RewriteCond %{REQUEST_URI} !robots
RewriteRule ^(.*)$ http://example.com/$1 [L,R=301]

The part that works is that everything gets redirected to new domain as it should be. And I can also access robots.txt without being forwarded, but not with sitemap.xml. If I try to go to sitemap.xml, the domain forwards along anyway and opens the sitemap file on the new domain.

I have this exact same issue when trying to "ignore" index.html. I can ignore robots, I can ignore alternate html or php files, but if I want to ignore index.html, the regex fails.

Since I can't actually SEE what is in the REQUEST_URI variable, my guess is that somehow index.html and sitemap.xml are some kind of "special" files that don't end up in REQUEST_URI? I know this because of a stupid test. If I choose to ignore index.html like this:

RewriteCond %{REQUEST_URI} !index.html

Then if I type example.com/index.html I will be forwarded. But if I just type example.com/ the ignore actually works and it shows the content of index.html without forwarding!

How is it that when I choose to ignore the regex "index.html", it only works when "index.html" is not actually typed in the address bar!?!

And it gets even weirder! Should I type something like example.com/index.html?option=value, then the ignore rule works and I do NOT get forwarded when there are attributes like this. But index.html by itself doesn't work, and then just having the slash root, the rule works again.

I'm completely confused! Why does it seem like REQUEST_URI is not able to see some filenames like index.html and sitemap.xml? I've been Googling for 2 days and not only can I not find out if this is true, but I can't seem to find any websites which actually give examples of what these htaccess server variables actually contain!

Thanks!

Vigilante
  • 121
  • 1
  • 1
  • 7
  • is that all that's in your htaccess file? – Jon Lin Sep 17 '14 at 17:25
  • Yes, it is a very basic file. And I'm testing on two completely different servers. My guess is if something is "getting in the way" it could be in the server config itself. Perhaps it has something to do with index.html being set as a default index file? I just can't figure it out. – Vigilante Sep 17 '14 at 17:40
  • The htaccess looks like this (doing some testing): `RewriteCond %{REQUEST_URI} !(sitemap|index|alternate|alt) [NC] RewriteRule .* alternate.html [R,L]` Again, if I try to visit sitemap, alternate, or alt files, it is NOT redirected, but if I visit index.html, I am redirected. It's as if index.html is excluded from REQUEST_URI?? I can't confirm this. I upgraded Apache to version 2.4.9. – Vigilante Sep 18 '14 at 23:52

2 Answers2

1

my guess is that somehow index.html and sitemap.xml are some kind of "special" files that don't end up in REQUEST_URI?

This is not true. There is no such special treatment of any requested URL. The REQUEST_URI server variable contains the URL-path (only) of the request. This notably excludes the scheme + hostname and any query string (which are available in their own variables).

However, if there are any other mod_rewrite directives that precede this (including the server config) that rewrite the URL then the REQUEST_URI server variable is also updated to reflect the rewritten URL.

index.html (Directory Index)

index.html is possibly a special case. Although, if you are explicitly requesting index.html as part of the URL itself (as you appear to be doing) then this does not apply.

If, on the other hand, you are requesting a directory, eg. http://example.com/subdir/ and relying on mod_dir issuing an internal subrequest for the directory index (ie. index.html), then the REQUEST_URI variable may or may not contain index.html - depending on the version of Apache (2.2 vs 2.4) you are on. On Apache 2.2 mod_dir executes first, so you would need to check for /subdir/index.html. However, on Apache 2.4, mod_rewrite executes first, so you simply check for the requested URL: /subdir/. It's safer to check for both, particularly if you have other rewrites and there is possibility of a second pass through the rewrite engine.

Caching problems

However, the most probable cause in this scenario is simply a caching issue. If the 301 redirect has previously been in place without these exceptions then it's possible these redirections have been cached by the browser. 301 (permanent) redirects are cached persistently by the browser and can cause issues with testing (as well as your users that also have these redirects cached - there is little you can do about that unfortunately).

RewriteCond %{REQUEST_URI} !(sitemap|index|alternate|alt) [NC]
RewriteRule .* alternate.html [R,L]

The example you presented in comments further suggests a caching issue, since you are now getting different results for sitemap than those posted in your question. (It appears to be working as intended in your second example).

Examining Apache server variables

@zzzaaabbb mentioned one method to examine the value of the Apache server variable. (Note that the Apache server variable REQUEST_URI is different to the PHP variable of the same name.) You can also assign the value of an Apache server variable to an environment variable, which is then readable in your application code.

For example:

RewriteRule ^ - [E=APACHE_REQUEST_URI:%{REQUEST_URI}]

You can then examine the value of the APACHE_REQUEST_URI environment variable in your server-side code. Note that if you have any other rewrites that result in the rewritting process to start over then you could get multiple env vars, each prefixed with REDIRECT_.

MrWhite
  • 43,179
  • 8
  • 60
  • 84
0

With the index.html problem, you probably just need to escape the dot (index\.html). You are in the regex pattern-matching area on the right-hand side of RewriteCond. With the un-escaped dot in there, there would need to be a character at that spot in the request, to match, and there isn't, so you're not matching and are getting the unwanted forward.

For the sitemap not matching problem, you could check to see what REQUEST_URI actually contains, by just creating an empty dummy file (to avoid 404 throwing) and then do a redirect at top of .htaccess. Then, in browser URL, type in anything you want to see the REQUEST_URI for -- it will show in address bar.

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^ /test.php?var=%{REQUEST_URI} [NE,R,L]

Credit MrWhite with that easy test method.

Hopefully that will show that sitemap in URL ends up as something else, so will at least partially explain why it's not pattern-matching and preventing redirect, when it should be pattern-matching and preventing redirect.

I would also test by being sure that the server isn't stepping in front of things with custom 301 directive that for whatever reason makes sitemap behave unexpectedly. Put this at the top of your .htaccess for that test.

ErrorDocument 301 default

Rabbid76
  • 202,892
  • 27
  • 131
  • 174
zzzaaabbb
  • 139
  • 1
  • 10
  • 1
    "With the un-escaped dot in there, there would need to be a character at that spot in the request, to match, and there isn't" - Ah, but there is, there's a dot! An unescaped dot matches almost _anything_, including a literal dot. By escaping the dot, it will _only match a dot_. Whilst the dot should be escaped (in order to avoid matching too much), it would make no difference whether it is escaped or not in the regex as to whether it matches the requested URL in this case. – MrWhite Jun 07 '19 at 00:04
  • 1
    Many of the OPs symptoms could simply be the result of a _cached_ redirect. If they have been testing with 301s (as opposed to 302s) then these are likely to be cached (persistently) by the browser - and can often make testing problematic. If the _exception_ (ie. negated condition) is added later then the browser may not see it, as the 301 is already cached. This is further backed up by the OPs later test in comments (the next day) - now, "sitemap" is _magically_ excluded from the redirect, which contradicts the OPs first example the day before. (?) – MrWhite Jun 07 '19 at 00:15
  • 1
    Nice suggestion about the `ErrorDocument` directive. However, that can't be the problem in this instance, since the redirect IS occuring, when it shouldn't be. An overriding error document could theoretically prevent (or change) the 301 if it was supposed to occur, but in this case the desired action is "nothing", the 301 should not be triggered in the first place, so the defined `ErrorDocument` wouldn't apply. – MrWhite Jun 07 '19 at 00:20