<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: What is the best way to deal with Spiders/Bots/Crawlers?</title>
	<atom:link href="http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=what-is-the-best-way-to-deal-with-spidersbotscrawlers</link>
	<description>Anuj @ Flex, ColdFusion and other RIA stuff....</description>
	<lastBuildDate>Fri, 03 Sep 2010 07:52:52 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: secure vpn</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-5262</link>
		<dc:creator>secure vpn</dc:creator>
		<pubDate>Fri, 27 Aug 2010 14:57:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-5262</guid>
		<description>Usually you should know if you&#039;re connecting to a VPN, because you have to specifically set up a VPN. It&#039;s not something that happens when you go to a website, its something you configure intentionally. A secure https connection is not a VPN connection.</description>
		<content:encoded><![CDATA[<p>Usually you should know if you&#8217;re connecting to a VPN, because you have to specifically set up a VPN. It&#8217;s not something that happens when you go to a website, its something you configure intentionally. A secure https connection is not a VPN connection.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: erik</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3663</link>
		<dc:creator>erik</dc:creator>
		<pubDate>Mon, 03 May 2010 23:45:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3663</guid>
		<description>Does someone have some good code examples of this to share?, or maybe a .cfc?  Would be greatly appreciated.

Erik</description>
		<content:encoded><![CDATA[<p>Does someone have some good code examples of this to share?, or maybe a .cfc?  Would be greatly appreciated.</p>
<p>Erik</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: car magnets</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3644</link>
		<dc:creator>car magnets</dc:creator>
		<pubDate>Mon, 05 Apr 2010 02:48:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3644</guid>
		<description>I might keep this handy for clients as well as for myself. Have a great Easter!</description>
		<content:encoded><![CDATA[<p>I might keep this handy for clients as well as for myself. Have a great Easter!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anuj Gakhar</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3565</link>
		<dc:creator>Anuj Gakhar</dc:creator>
		<pubDate>Mon, 08 Feb 2010 08:03:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3565</guid>
		<description>Michael, since you have something like this ready already, you mind sharing it here?</description>
		<content:encoded><![CDATA[<p>Michael, since you have something like this ready already, you mind sharing it here?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Erik</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3563</link>
		<dc:creator>Erik</dc:creator>
		<pubDate>Wed, 03 Feb 2010 18:52:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3563</guid>
		<description>Michael, Anuj...

Could someone post example code that does what Michael suggests in an App.cfc format? Thanks!</description>
		<content:encoded><![CDATA[<p>Michael, Anuj&#8230;</p>
<p>Could someone post example code that does what Michael suggests in an App.cfc format? Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anuj Gakhar</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3551</link>
		<dc:creator>Anuj Gakhar</dc:creator>
		<pubDate>Fri, 29 Jan 2010 10:00:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3551</guid>
		<description>Hi Michael, Thanks for the explanation. It does appear to be an interesting technique, One that could solve a lot of problems around this &quot;bot&quot; issue. Let me know when you update your blog post on the subject. :)</description>
		<content:encoded><![CDATA[<p>Hi Michael, Thanks for the explanation. It does appear to be an interesting technique, One that could solve a lot of problems around this &#8220;bot&#8221; issue. Let me know when you update your blog post on the subject. <img src='http://www.anujgakhar.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Dinowitz</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3549</link>
		<dc:creator>Michael Dinowitz</dc:creator>
		<pubDate>Fri, 29 Jan 2010 04:17:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3549</guid>
		<description>@ziggy
I create a structure which will use the visitor&#039;s IP address as a key and the number of times they&#039;ve been to the site that second as the value for that key. If the value is greater than 3, I then abort that request. I clear out the structure in the next second. A bit of memory hashing, but not anything major in the least.

@Anuj
Yes, every bot has a cfid/cftoken of 1 so they all have the same session. This avoids the problem of each bot getting a new session, even if the bot&#039;s session timed out after 2 seconds (my previous method). Every time a bot gets it&#039;s own session, it incremented the cfid counter and forced CF to generate a cftoken for it. Again, not any real performance hit but why have the work done if not needed. 

It is rare for a legitimate bot to visit a page more than once or twice a second. I&#039;ve found that certain firewall products will spider a site with a dozen or more requests a second. This can be very detrimental as the same spider has some really strange connection delay which holds the connection open for a while rather than getting the page results immediately. So the obvious result is a dozen CF connections, all hung waiting to deliver data and a few dozen more being added to the queue every second or so. Can you say crash? I can. :(

I&#039;ve got to update my blog post on this sometime.</description>
		<content:encoded><![CDATA[<p>@ziggy<br />
I create a structure which will use the visitor&#8217;s IP address as a key and the number of times they&#8217;ve been to the site that second as the value for that key. If the value is greater than 3, I then abort that request. I clear out the structure in the next second. A bit of memory hashing, but not anything major in the least.</p>
<p>@Anuj<br />
Yes, every bot has a cfid/cftoken of 1 so they all have the same session. This avoids the problem of each bot getting a new session, even if the bot&#8217;s session timed out after 2 seconds (my previous method). Every time a bot gets it&#8217;s own session, it incremented the cfid counter and forced CF to generate a cftoken for it. Again, not any real performance hit but why have the work done if not needed. </p>
<p>It is rare for a legitimate bot to visit a page more than once or twice a second. I&#8217;ve found that certain firewall products will spider a site with a dozen or more requests a second. This can be very detrimental as the same spider has some really strange connection delay which holds the connection open for a while rather than getting the page results immediately. So the obvious result is a dozen CF connections, all hung waiting to deliver data and a few dozen more being added to the queue every second or so. Can you say crash? I can. <img src='http://www.anujgakhar.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p>I&#8217;ve got to update my blog post on this sometime.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robert Zehnder</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3546</link>
		<dc:creator>Robert Zehnder</dc:creator>
		<pubDate>Thu, 28 Jan 2010 15:47:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3546</guid>
		<description>I have a very basic list of bot user agents that I use in WhosOnCFC.  If you would like a starting point you can find the current active list at http://www.kisdigital.com/botlist.xml. I do a findNoCase against the user agent with the list and it does a decent job of finding the bots that advertise themselves.</description>
		<content:encoded><![CDATA[<p>I have a very basic list of bot user agents that I use in WhosOnCFC.  If you would like a starting point you can find the current active list at <a href="http://www.kisdigital.com/botlist.xml" rel="nofollow">http://www.kisdigital.com/botlist.xml</a>. I do a findNoCase against the user agent with the list and it does a decent job of finding the bots that advertise themselves.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anuj Gakhar</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3545</link>
		<dc:creator>Anuj Gakhar</dc:creator>
		<pubDate>Thu, 28 Jan 2010 13:24:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3545</guid>
		<description>That sounds pretty promising, Michael. So you are saying, every bot shares the same session no matter which bot it is. googlebot or yahoo slurp etc. But normally bots visit tens of pages within a second - so you dont let them visit more than 3 (or whatver the threshold is) pages per second ?</description>
		<content:encoded><![CDATA[<p>That sounds pretty promising, Michael. So you are saying, every bot shares the same session no matter which bot it is. googlebot or yahoo slurp etc. But normally bots visit tens of pages within a second &#8211; so you dont let them visit more than 3 (or whatver the threshold is) pages per second ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ziggy</title>
		<link>http://www.anujgakhar.com/2010/01/26/what-is-the-best-way-to-deal-with-spidersbotscrawlers/comment-page-1/#comment-3544</link>
		<dc:creator>ziggy</dc:creator>
		<pubDate>Thu, 28 Jan 2010 05:08:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.anujgakhar.com/?p=645#comment-3544</guid>
		<description>How do you detect how many connections they have?</description>
		<content:encoded><![CDATA[<p>How do you detect how many connections they have?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
