The Scenario
You've created a web site that is secured, but you want the search engine spiders to be able to crawl the content. Okay, in this case it's not secured in the sense of needing a username and password, but is secured by requiring that the visitor first acknowledge some terms and conditions before they can access any other portion of the site. This initial requirement of a button click seems to stop all web spiders in their tracks, and thus your site's content never gets added to that search engine's indexes.
I posted a question regarding this situation on HouseofFusion and did get one interesting answer, but it didn't really address my dilemma. The answer I received was that it is possible to configure Google Analytics to be able to log in to your site in its efforts to index content, but in investigating this further I found that it is only for the purposes of serving ads contextually, not search engine results. Since this site isn't serving any ads and my goal is to help people find their way to the site's front door based on the content BEHIND the door, I needed another answer. Here's what I ended up doing:
In my application.cfc, I have code in the onRequestStart method that checks to see if the user had already acknowledged the disclaimer (by looking at a session variable set to 'true' when they do). If true, allow the original request to go through; if false, redirect to the disclaimer. I then created an additional, private method in my application.cfc that I called "isSpider" that checks the cgi.http_user_agent against a list of known spider agents, returning either true or false. So, before I check my session variable's value, I first call the isSpider method. If the visitor IS a spider, I set the session variable to true before I do the redirection check against it. Here are the relevant methods:
<cffunction name="onRequestStart" returntype="void" output="false">
<cfif isSpider()>
<cfset session.acknowledged = true />
</cfif>
<cfif not session.acknowledged>
<cflocation url="acknowledgeDisclaimer.cfm" addtoken="no" />
</cfif>
</cffunction>
<cffunction name="isSpider" access="private" returntype="boolean" hint="I check the user agent string for the occurrence of any of the known spider user agent values">
<cfloop index="s" list="#application.spiderlist#">
<cfif findnocase(s,cgi.http_user_agent) gt 0>
<cfreturn true />
</cfif>
</cfloop>
<cfreturn false />
</cffunction>
In my onApplicationStart, I create the string of partial spider user agent values:
<cfset application.spiderlist = "Googlebot,Yahoo,msnbot,AOL,Ask Jeeves,Lycos" />
It is true that there are literally hundreds of other spiders running around out there, but I chose to select only the top six that show up in my site analytics as being the ones most people find my other sites by rather than attempt to validate all possible indexers. I also opted to simply check the user agent for any occurrence of a specific substring rather than match against the entire string, for efficiency's sake, since each particular search engine can have several different user agents (and those could change at any time!). For instance, Google has (to the best of my knowledge) the following User Agent values for its spiders:
- Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
- Googlebot/2.1 ( http://www.google.com/bot.html)
- Googlebot/2.1 ( http://www.googlebot.com/bot.html)
- Googlebot/Test ( http://www.googlebot.com/bot.html)
Hence, my choice to simply search the user agent for the string "Googlebot" in order to determine if it was a Google spider or not.
I found what appears to be a VERY comprehensive list of spider user agent values (and other metadata) at this url: http://www.user-agents.org/index.shtml . They also offer RSS and XML feeds if anybody wants to do something really cool with the data.
I also used the following spider simulation site in order to test my code changes: http://tools.summitmedia.co.uk/spider/
Their user agent value looks like the following: "K2-Summit (+http://tools.summitmedia.co.uk/spider/) leond@summitmedia.co.uk" , so I just added the value "K2-Summit" to my spiderlist variable in order to let them bypass the disclaimer acknowledgement.
Though the site I based this post on doesn't require username and password authentication, I do believe it would be a simple matter to apply the same principle to a site secured in that manner; when a known spider arrives (one that YOU want crawling your site), simply issue them a visitors pass in the form of manually set credentials and let them do their job!
I am by no means a search engine guru, so if anybody out there knows a better way, sees any gaping, dangerous holes in my solution, or just has any suggestions or comments, please do share!
Doug out.