What is the best way to deal with Spiders/Bots/Crawlers?

On January 26, 2010, in ColdFusion, by Anuj Gakhar

Although Search engine Spiders are a good thing (e.g. GoogleBot, YahooSlurp etc) as if they dont spider your website, you will probably not appear on search results, but at the same time, they also cause a lot of un-necessary management to be done on the developer side of things. Here are a few of the things I have noticed :-

1) They cause a lot of un-necessary active sessions created on the server if you have session management turned on (which most sites do).
2) There is a growing number of bots and its really hard to tell which ones to block and which ones not.
3) You do want the major search engines to spider you but you don’t necessarily want every search engine to spider you and it becomes hard to maintain this list (for a developer).

Apart form the above, another major problem is when someone intentionally tries to download/crawl your website using a custom written script or using a site downloader and in this case, the Browser/Referrer is mostly spoofed as someting they are not. These crawlers most probably do not even respect the existence of a robots.txt file.

My question is, apart from checking the name of the bot and restricting their session to like a few seconds , what other checks do people put in place to solve these hurdles ?

13 Responses to What is the best way to deal with Spiders/Bots/Crawlers?

Robert Zehnder says:

January 26, 2010 at 2:37 pm

I will generally check to see how many connections I have from one given IP address. After about 10 hits I will assume they are a bot and adjust their session timeout.

Reply
Anuj Gakhar says:

January 26, 2010 at 5:14 pm

Yep, thats a good idea. I normally just check for the name of the bot in Referrer. but that doesn’t always work.

Reply
Michael Dinowitz says:

January 27, 2010 at 11:22 pm

I do a few things, two of which are key. The first is to detect how many connections an IP is making per second and if that number if over a certain threshold (3), I block all other requests for that second. The next thing I do is to detect if the visitor is a bot based on their http_user_agent and if it’s a bot, assign it a cfide and a cftoken of 1. This means that every bot has the same session and there is no session thrashing or pseudo-memory leak.

Reply
- Anuj Gakhar says:
  
  January 28, 2010 at 2:24 pm
  
  That sounds pretty promising, Michael. So you are saying, every bot shares the same session no matter which bot it is. googlebot or yahoo slurp etc. But normally bots visit tens of pages within a second – so you dont let them visit more than 3 (or whatver the threshold is) pages per second ?
  
  Reply
ziggy says:

January 28, 2010 at 6:08 am

How do you detect how many connections they have?

Reply
Robert Zehnder says:

January 28, 2010 at 4:47 pm

I have a very basic list of bot user agents that I use in WhosOnCFC. If you would like a starting point you can find the current active list at http://www.kisdigital.com/botlist.xml. I do a findNoCase against the user agent with the list and it does a decent job of finding the bots that advertise themselves.

Reply
Michael Dinowitz says:

January 29, 2010 at 5:17 am

@ziggy
I create a structure which will use the visitor’s IP address as a key and the number of times they’ve been to the site that second as the value for that key. If the value is greater than 3, I then abort that request. I clear out the structure in the next second. A bit of memory hashing, but not anything major in the least.

@Anuj
Yes, every bot has a cfid/cftoken of 1 so they all have the same session. This avoids the problem of each bot getting a new session, even if the bot’s session timed out after 2 seconds (my previous method). Every time a bot gets it’s own session, it incremented the cfid counter and forced CF to generate a cftoken for it. Again, not any real performance hit but why have the work done if not needed.

It is rare for a legitimate bot to visit a page more than once or twice a second. I’ve found that certain firewall products will spider a site with a dozen or more requests a second. This can be very detrimental as the same spider has some really strange connection delay which holds the connection open for a while rather than getting the page results immediately. So the obvious result is a dozen CF connections, all hung waiting to deliver data and a few dozen more being added to the queue every second or so. Can you say crash? I can. 🙁

I’ve got to update my blog post on this sometime.

Reply
Anuj Gakhar says:

January 29, 2010 at 11:00 am

Hi Michael, Thanks for the explanation. It does appear to be an interesting technique, One that could solve a lot of problems around this “bot” issue. Let me know when you update your blog post on the subject. 🙂

Reply
Erik says:

February 3, 2010 at 7:52 pm

Michael, Anuj…

Could someone post example code that does what Michael suggests in an App.cfc format? Thanks!

Reply
- Anuj Gakhar says:
  
  February 8, 2010 at 9:03 am
  
  Michael, since you have something like this ready already, you mind sharing it here?
  
  Reply
car magnets says:

April 5, 2010 at 3:48 am

I might keep this handy for clients as well as for myself. Have a great Easter!

Reply
erik says:

May 4, 2010 at 12:45 am

Does someone have some good code examples of this to share?, or maybe a .cfc? Would be greatly appreciated.

Erik

Reply
secure vpn says:

August 27, 2010 at 3:57 pm

Usually you should know if you’re connecting to a VPN, because you have to specifically set up a VPN. It’s not something that happens when you go to a website, its something you configure intentionally. A secure https connection is not a VPN connection.

Reply

Anuj Gakhar