Contact Doug!
Learn About Doug!
View Doug Boude's online resume
updated 11/18/2009

View Doug Boude's profile on LinkedIn
Link to me!

Follow Doug Boude on Twitter
Follow me!

Be Doug's friend on Facebook
Befriend me!
(I promise not to follow you home)
OO Lexicon
Chat with Doug!
Recent Entries
You may also be interested in...
Florida web site design



Czech your Page Rank!
Check Page Rank of any web site pages instantly:
This free page rank checking tool is powered by Page Rank Checker service
Surf's Up!
Visit Egosurf.org and massage YOUR web ego!
My Score: 9,001
Doug's Books

Read (and recommend)

  • Men are from Mars, Women are from Venus
  • The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations
  • Blink: The Power of Thinking Without Thinking
  • Head First Design Patterns
  • Transact-SQL Programming
  • What's So Amazing About Grace?
  • Just So Stories (Rudyard Kipling collection)

Reading

  • Prayer: Does it Make Any Difference?
  • Data Mining (Practical Machine Learning Tools and Techniques)
<< March, 2010 >>
SMTWTFS
123456
78910111213
14151617181920
21222324252627
28293031
Search Blog

Recent Comments
Categories
Archives
Photo Albums
Funnies (5)
Family (3)
RSS

Powered by
BlogCFM v1.11

13 October 2008
Sneaking Spiders Past Security

The Scenario
You've created a web site that is secured, but you want the search engine spiders to be able to crawl the content. Okay, in this case it's not secured in the sense of needing a username and password, but is secured by requiring that the visitor first acknowledge some terms and conditions before they can access any other portion of the site. This initial requirement of a button click seems to stop all web spiders in their tracks, and thus your site's content never gets added to that search engine's indexes. 

I posted a question regarding this situation on HouseofFusion and did get one interesting answer, but it didn't really address my dilemma. The answer I received was that it is possible to configure Google Analytics to be able to log in to your site in its efforts to index content, but in investigating this further I found that it is only for the purposes of serving ads contextually, not search engine results. Since this site isn't serving any ads and my goal is to help people find their way to the site's front door based on the content BEHIND the door, I needed another answer. Here's what I ended up doing:

In my application.cfc, I have code in the onRequestStart method that checks to see if the user had already acknowledged the disclaimer (by looking at a session variable set to 'true' when they do). If true, allow the original request to go through; if false, redirect to the disclaimer.  I then created an additional, private method in my application.cfc that I called "isSpider" that checks the cgi.http_user_agent against a list of known spider agents, returning either true or false. So, before I check my session variable's value, I first call the isSpider method. If the visitor IS a spider, I set the session variable to true before I do the redirection check against it. Here are the relevant methods:

<cffunction name="onRequestStart" returntype="void" output="false">
 <cfif isSpider()>
  <cfset session.acknowledged = true />
 </cfif>
 <cfif not session.acknowledged>
  <cflocation url="acknowledgeDisclaimer.cfm" addtoken="no" />
 </cfif>
</cffunction>

 

 

<cffunction name="isSpider" access="private" returntype="boolean" hint="I check the user agent string for the occurrence of any of the known spider user agent values">
 <cfloop index="s" list="#application.spiderlist#">
  <cfif findnocase(s,cgi.http_user_agent) gt 0>
   <cfreturn true />
  </cfif>
 </cfloop>
 <cfreturn false />
</cffunction>

 

 


In my onApplicationStart, I create the string of partial spider user agent values:

<cfset application.spiderlist = "Googlebot,Yahoo,msnbot,AOL,Ask Jeeves,Lycos" />

 

It is true that there are literally hundreds of other spiders running around out there, but I chose to select only the top six that show up in my site analytics as being the ones most people find my other sites by rather than attempt to validate all possible indexers. I also opted to simply check the user agent for any occurrence of a specific substring rather than match against the entire string, for efficiency's sake, since each particular search engine can have several different user agents (and those could change at any time!). For instance, Google has (to the best of my knowledge) the following User Agent values for its spiders:

  • Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • Googlebot/2.1 ( http://www.google.com/bot.html)
  • Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Googlebot/Test ( http://www.googlebot.com/bot.html)

Hence, my choice to simply search the user agent for the string "Googlebot" in order to determine if it was a Google spider or not.

I found what appears to be a VERY comprehensive list of spider user agent values (and other metadata) at this url: http://www.user-agents.org/index.shtml . They also offer RSS and XML feeds if anybody wants to do something really cool with the data.

I also used the following spider simulation site in order to test my code changes: http://tools.summitmedia.co.uk/spider/
Their user agent value looks like the following: "K2-Summit (+http://tools.summitmedia.co.uk/spider/) leond@summitmedia.co.uk" , so I just added the value "K2-Summit" to my spiderlist variable in order to let them bypass the disclaimer acknowledgement.

Though the site I based this post on doesn't require username and password authentication, I do believe it would be a simple matter to apply the same principle to a site secured in that manner; when a known spider arrives (one that YOU want crawling your site), simply issue them a visitors pass in the form of manually set credentials and let them do their job!
 
I am by no means a search engine guru, so if anybody out there knows a better way, sees any gaping, dangerous holes in my solution, or just has any suggestions or comments, please do share!

Doug out.




Posted by dougboude at 3:22 AM | PRINT THIS POST! |Link | 2 comments
Subscription Options

You are not logged in, so your subscription status for this entry is unknown. You can login or register here.

Re: Sneaking Spiders Past Security
comment from JD: "Hi Doug, Great post. Good for the example you gave of only needing to get a bot past terms of service and other minor things like that. But I just wanted to point out that it is not a good idea to put something like this in place on a site that requires a user/pass. It is quite a trivial thing for me to spoof the cgi.http_user_agent variable in my browser. It only requires a simple plugin in Firefox (Opera actually has the functionality built-in) and I can tell your site that my Firefox is really GoogleBot."
Posted by dougboude on October 13, 2008 at 12:04 PM

Re: Sneaking Spiders Past Security
Thanks for the input, JD; point taken. I might point out though that you could also check the spider's IP address and make sure it is within a valid range (info can on the IP addresses can be found at the site cited in the post above). I suppose IP addresses can be spoofed as well, but you could always issue the visiting spider a "read only" pass, just in case it's not really a spider.
Posted by dougboude on October 21, 2008 at 6:49 AM

Name:   Required
Email:   Required your email address will not be publicly displayed.

Want to receive notifications when new comments are added? Login/Register for an account.

Time to take the Turing Test!!!

12 plus 6 equals
Type in the answer to the question you see above:

Your comment:

Sorry, no HTML allowed!