4

quick ref: area = portal type page.

I would like old urls http://domain.com/long/rubbish/url/blah/blah/index.cfm?id=12345 to redirect to http://domain.com/area/12345-short-title

http://domain.com/area/12345-short-title should display the content.

I have worked out so far to do this I could use apache to write all URLs to

http://domain.com/index.cfm/long/rubbish/url/blah/blah/index.cfm?id=12345 and http://domain.com/index.cfm/area/12345-short-title

The index.cfm will either server the content or apply a permanent redirect, but it will need to get the title and area information from the database first.

There are 50,000 pages on this website. I also have other ideas for subdomain redirects, and permanent subdomains and controlling how they act through the index.cfm.

Infrastructure are keen to do as much through Apache rewrite as possible, we suspect it would be faster. However I'm not sure we have that choice if we need to get the area and title information for each page.

Has anyone got some experience with this that can provide input?

--

Something to note, I'm assuming we'll have to keep all the internal URLs used on the website in the old format. It would be a mega job to change them all.

This means all internal URLs will have to use a permanent redirect every time.

Daniel Cook
  • 1,033
  • 1
  • 9
  • 19
  • How many areas are you dealing with? It is a fairly static list? – James Lawruk Oct 11 '12 at 12:49
  • Well there are 100 areas, and there are 50,000 pages. It grows/changes daily. Areas are space missions, and the pages are content on those missions and 'belong' to the specific areas. – Daniel Cook Oct 11 '12 at 13:01
  • I don't understand what you mean with _"...need to get the title and area information from the database first."_ - why do you need those? – Peter Boughton Oct 11 '12 at 13:41
  • If someone tried to access the website with an old URL http://domain.com/long/rubbish/url/blah/blah/index.cfm?id=12345 it was my expectancy they would be redirected to the new URL instead. Is this the normal way of things? To do this the 'controller' would need to know the title and area names. – Daniel Cook Oct 11 '12 at 13:56
  • Yep, and when you send a 301 or 302 response that's what you get. But since you've got different URL formats so the lookup only needs to be done for old URLs. I don't see the benefit in sending both types to a single index.cfm – Peter Boughton Oct 11 '12 at 14:00

2 Answers2

3

Rather than redirecting both groups of URLs to the same script, why not simply send them to two distinct scripts?

Simply like this:

RewriteCond ${REQUEST_URI}  !-f
RewriteRule ^\w+/\d+-[\w-]+$ /content.cfm/$0 [L]

RewriteCond ${REQUEST_URI}  !-f
RewriteRule ^.* /redirect.cfm/$0   [L,QSA]

Then, the redirect.cfm can lookup the replacement URL and do the 301 redirect, whilst content.cfm simply serves the content.

(You haven't specified how your CF is setup; you may need to update the Jrun/Tomcat/other config to support /content.cfm/* and /redirect.cfm/* - it'll be done the same as it's done for index.cfm)


For performance reasons, you still want to avoid the database hits for redirecting if you can, and you can do that by generating rewrite rules for each page that performs the 301 redirect on the Apache side. This can be as simple as appending a line to the .htaccess file, like so:

<cfset NewLine = 'RewriteRule #ReEscape(OldUrl)# #NewUrl#   [L,QSA,R=301]' />

<cffile action="append" file="./.htaccess" output=#NewLine# />

(Where OldUrl and NewUrl have been looked-up from the database.)

You might also want to investigate using mod_alias redirect instead of mod_rewrite RewriteRule, where the syntax would be Redirect permanent #OldUrl# #NewUrl# - since the OldUrl is an exact path match it would likely be faster.

Note that these rules will need to be checked before the above redirect.cfm redirect is done - if they are in the same .htaccess you can't simply do an append, but if they are in the site's general Apache config files then the .htaccess rules will be checked first.

Also, as per Sharon's comment, you should verify if your Apache will handle 50k rules - whilst I've seen it reported that "thousands" of regex-based Apache rewrites are perfectly fine, there may well be some limit (or at least the need to split across multiple files).

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
  • Thanks, I will keep this in mind. I was more concerned about the performance and necessity of using ColdFusion to do what I described. – Daniel Cook Oct 11 '12 at 14:03
  • Heh, just been editing - once CF has looked up the URL once it doesn't need to do it again - it can append a generated rule to a .htaccess file – Peter Boughton Oct 11 '12 at 14:06
  • This is pretty cool. But as someone who's had to deal with generated .htaccess files, I'd *freak out* if I opened one up that had cached redirects for 50,000 possible urls. It makes me wonder if there is a theoretical and/or practical limit to the number of redirect rules you can have in one .htaccess. – Sharondio Oct 11 '12 at 14:33
  • 1
    Also, don't forget appropriate use of around all of those writes and reads. – Sharondio Oct 11 '12 at 14:35
  • Well it's certainly possible to have "thousands of regex-based rewrite rules" for a high profile site ([see last para](https://groups.google.com/group/railo/msg/6fa1cb855f16af39?noredirect)) but yeah, worth investigating if that might be an issue. Also, I suspect (once all rules are generated/identified) it might be possible to compact them in some fashion - in theory only one rule is needed per area, rather than per page, so there's probably some way to reduce it down significantly. – Peter Boughton Oct 11 '12 at 14:42
  • I would _hope_ that a single cffile/append would not need any extra locking - it's opening the file, adding a line, writing, then releasing it again - should be an atomic operation for CF. But yeah, I haven't checked if that's the case, so might be an issue. – Peter Boughton Oct 11 '12 at 14:45
  • Each page would have a unique title in the URL, it is not obvious to me how that could be condensed into fewer that 1 rule per URL. – Daniel Cook Oct 11 '12 at 14:52
  • Well if some of your old/long URLs also contain that short title in any way, you can use regex backreferences. - i.e. `/long/(area)/stuff/(title)/index.cfm?id=(number)` to `$1/$3-$2` - but yeah, it depends whether any of your URLs have that structure (and whether there's a programmatic way to identify those that do). – Peter Boughton Oct 11 '12 at 14:58
  • Unfortunately they don't contain any useful information that can be reused (except for the id!) – Daniel Cook Oct 11 '12 at 15:08
0

Using apache rewrites would only be faster if they were static rewrites, or if they all followed some rule that you could write in regex within the .htaccess file. If you're having to touch the database for these redirects, then it may not make sense to do it in .htaccess.

Another approach is the one used by most CMSs for handling virtual directories and redirects. An index.cfm file at the root of the site handles all incoming requests and returns the correct pages and pathing. MURA CMS uses this approach (as well as Joomla and most of the others.)

Basically you're using the CGI.path_info variable on an incoming request, searching for it in your DB, and doing a redirect to the new path. As usual, Ben Nadel has a good write-up of how to use this approach: Ben Nadel: Using IIS URL Rewriting And CGI.PATH_INFO With IIS MOD-Rewrite

You can, however, use the .htaccess to remove the "index.cfm" from the url string entirely if you want by redirecting all incoming requests to the root URL with something that looks like this in your .htaccess:

RewriteEngine On
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^([a-zA-Z0-9-]{1,})/([a-zA-Z0-9/-]+)$ /$1/index.cfm/$2 [PT]

Basically this would redirect something like http://www.yourdomain.com/your-new-url/ to http://www.yourdomain.com/index.cfm/your-new-url/ where it could be processed as described by the blog post above. The user would never see the index.cfm.

Sharondio
  • 2,605
  • 13
  • 16
  • Thanks, I have seen Ben's article. My 'question' describes use of the /index.cfm/ and I expected to use the CGI.PATH_INFO. So I think what you've described is what I had already intended to do. Funnily enough it was after trying out Mura that I got really interested in finally implementing a SEO URL solution. That and I intend to solve the problem of the site being indexed through many subdomains! – Daniel Cook Oct 11 '12 at 15:11