NO MORE CAREER
POLITICIANS!
Get Out Of Our House: Replacing congress with TRUE citizens!
Contact Doug!
Learn About Doug!
View Doug Boude's online resume
updated 11/18/2009

View Doug Boude's profile on LinkedIn
Link to me!

Follow Doug Boude on Twitter
Follow me!

Be Doug's friend on Facebook
Befriend me!
(I promise not to follow you home)
OO Lexicon
Chat with Doug!
Recent Entries
You may also be interested in...
Web Hosting

best web hosting - top web hosting sites, thetop10bestwebhosting.com

Czech your Page Rank!
Check Page Rank of any web site pages instantly:
This free page rank checking tool is powered by Page Rank Checker service
Surf's Up!
Visit Egosurf.org and massage YOUR web ego!
My Score: 9,001
Doug's Books

Read (and recommend)

  • Men are from Mars, Women are from Venus
  • The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations
  • Blink: The Power of Thinking Without Thinking
  • Head First Design Patterns
  • Transact-SQL Programming
  • What's So Amazing About Grace?
  • Just So Stories (Rudyard Kipling collection)

Reading

  • Prayer: Does it Make Any Difference?
  • Data Mining (Practical Machine Learning Tools and Techniques)
<< October, 2008 >>
SMTWTFS
1234
567891011
12131415161718
19202122232425
262728293031
Search Blog

Recent Comments
Re: Using Google as your CF Mail Server (by Mike at 9/07 4:02 PM)
Re: Viewing Option Text (in IE7) that's Wider than the Select List (by Nithin Chacko Ninan at 9/07 1:34 AM)
Re: Viewing Option Text (in IE7) that's Wider than the Select List (by Nithin Chacko Ninan at 9/07 1:33 AM)
Re: Configuring Apache To Use Multiple Versions of ColdFusion (by Lola LB at 9/06 6:28 AM)
Re: Configuring Apache To Use Multiple Versions of ColdFusion (by ComboFusion at 9/06 5:17 AM)
Re: Railo 3.1 on Windows Server 2008 and IIS7 - Part 3 of 3 (by Jon at 8/27 2:04 PM)
Re: Hosts File Changes Not Acknowledged on Vista 64 (by Spacy at 8/24 3:46 PM)
Re: THE DAY CFUNITED DIED (by ComboFusion at 8/23 10:50 AM)
Re: My Grandpa (by Tasha at 8/10 4:29 PM)
Re: Just What IS a 'Service Layer', Anyway? (by dougboude at 8/02 10:10 AM)
Categories
Archives
Photo Albums
Funnies (5)
Family (3)
RSS

Powered by
BlogCFM v1.11

13 October 2008
Sneaking Spiders Past Security

The Scenario
You've created a web site that is secured, but you want the search engine spiders to be able to crawl the content. Okay, in this case it's not secured in the sense of needing a username and password, but is secured by requiring that the visitor first acknowledge some terms and conditions before they can access any other portion of the site. This initial requirement of a button click seems to stop all web spiders in their tracks, and thus your site's content never gets added to that search engine's indexes. 

I posted a question regarding this situation on HouseofFusion and did get one interesting answer, but it didn't really address my dilemma. The answer I received was that it is possible to configure Google Analytics to be able to log in to your site in its efforts to index content, but in investigating this further I found that it is only for the purposes of serving ads contextually, not search engine results. Since this site isn't serving any ads and my goal is to help people find their way to the site's front door based on the content BEHIND the door, I needed another answer. Here's what I ended up doing:

In my application.cfc, I have code in the onRequestStart method that checks to see if the user had already acknowledged the disclaimer (by looking at a session variable set to 'true' when they do). If true, allow the original request to go through; if false, redirect to the disclaimer.  I then created an additional, private method in my application.cfc that I called "isSpider" that checks the cgi.http_user_agent against a list of known spider agents, returning either true or false. So, before I check my session variable's value, I first call the isSpider method. If the visitor IS a spider, I set the session variable to true before I do the redirection check against it. Here are the relevant methods:

<cffunction name="onRequestStart" returntype="void" output="false">
 <cfif isSpider()>
  <cfset session.acknowledged = true />
 </cfif>
 <cfif not session.acknowledged>
  <cflocation url="acknowledgeDisclaimer.cfm" addtoken="no" />
 </cfif>
</cffunction>

 

 

<cffunction name="isSpider" access="private" returntype="boolean" hint="I check the user agent string for the occurrence of any of the known spider user agent values">
 <cfloop index="s" list="#application.spiderlist#">
  <cfif findnocase(s,cgi.http_user_agent) gt 0>
   <cfreturn true />
  </cfif>
 </cfloop>
 <cfreturn false />
</cffunction>

 

 


In my onApplicationStart, I create the string of partial spider user agent values:

<cfset application.spiderlist = "Googlebot,Yahoo,msnbot,AOL,Ask Jeeves,Lycos" />

 

It is true that there are literally hundreds of other spiders running around out there, but I chose to select only the top six that show up in my site analytics as being the ones most people find my other sites by rather than attempt to validate all possible indexers. I also opted to simply check the user agent for any occurrence of a specific substring rather than match against the entire string, for efficiency's sake, since each particular search engine can have several different user agents (and those could change at any time!). For instance, Google has (to the best of my knowledge) the following User Agent values for its spiders:

  • Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • Googlebot/2.1 ( http://www.google.com/bot.html)
  • Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Googlebot/Test ( http://www.googlebot.com/bot.html)

Hence, my choice to simply search the user agent for the string "Googlebot" in order to determine if it was a Google spider or not.

I found what appears to be a VERY comprehensive list of spider user agent values (and other metadata) at this url: http://www.user-agents.org/index.shtml . They also offer RSS and XML feeds if anybody wants to do something really cool with the data.

I also used the following spider simulation site in order to test my code changes: http://tools.summitmedia.co.uk/spider/
Their user agent value looks like the following: "K2-Summit (+http://tools.summitmedia.co.uk/spider/) leond@summitmedia.co.uk" , so I just added the value "K2-Summit" to my spiderlist variable in order to let them bypass the disclaimer acknowledgement.

Though the site I based this post on doesn't require username and password authentication, I do believe it would be a simple matter to apply the same principle to a site secured in that manner; when a known spider arrives (one that YOU want crawling your site), simply issue them a visitors pass in the form of manually set credentials and let them do their job!
 
I am by no means a search engine guru, so if anybody out there knows a better way, sees any gaping, dangerous holes in my solution, or just has any suggestions or comments, please do share!

Doug out.

Posted by dougboude at 3:22 AM | PRINT THIS POST! | Link | 2 comments