Recently, I came across a fantabolic article on fixing duplicate content & URL issues on Apace. I was not able to stop myself from blogging it.

Recently, we’ve had a lot of discussion about domain and URL canonicalization, mainly centered around avoiding duplicate-content problems in Google. There has also been some discussion of fixing type-in URLs, typos in inbound links, and badly-coded inbound links.

To be clear, a “canonical” domain is the single domain you want your site to be known by, and a canonical URL is the single URL you want your page to be known by. Any others are non-canonical.

The word canonical is a religion-related term, and means “according to canon law, scripture or doctrine.” But in general use, it just means “usual, standard, conventional, customary, or according to the rules.” So as a Webmaster, you choose what single domain you want to use for your site, and what single URL should be used to request each of your pages.

Member g1smd has posted in several of these threads the very good advice that it’s best to avoid “stacked redirects” –multiple redirects invoked by a single client request– while doing things like index page and domain canonicalization. This was reiterated recently by WebmasterWorld admin tedster in this recent thread.

I have coded various routines to do these kinds of fix-ups on an ad-hoc basis, but have never actually written a single-redirect-does-it-all solution. Actually, that’s not quite true — I had *tried* before, but a nasty mod_rewrite bug in Apache 1.3.x had repeatedly stymied my efforts.

However, after returning to the subject after almost a year, and having spent that year experimenting and dashing off code in the WebmasterWorld Apache forum, one trick I had figured out is a work-around for the bug.

So I set out anew to create a domain/URL canonicalizaton and type-in fixup routine that would do the following:

  • Canonicalize the domain (e.g. redirect non-www and IP address to www)
  • Canonicalize my index pages (redirect “/index.html” to “/”)
  • Remove multiple slashes in the URL
  • Remove spurious query strings (my sites’ pages are mostly ’static’ with a few exceptions)
  • Fix-up common typos in type-in URLs
  • Fix-up invalid inbound links caused by bad HTML mark-up
  • Fix-up URLs resulting from bad copy-and-pastes
  • Fix-up outdated or otherwise incorrect query strings
  • Suppress the fix-up redirect if the resulting URL does not resolve to an existing file
  • Suppress the fix-up if the link is on my own site (In this case, I want to see the 404 error)
  • Suppress the fix-up if the remote user is me or a site tester (Again, we want to see the 404 error)
  • Avoid recursion in mod_rewrite running in a per-directory .htaccess context
  • Avoid the nasty mod_rewrite bug in Apache 1.3.x
  • Do all of the above using a single 301-Moved Permanently redirect
  • Technorati , ,


    No Responses to “A guide to fixing duplicate content & URL issues on Apache”  

    1. No Comments

    Leave a Reply