It’s such a pain – my log files show that 80% of all traffic is being generated by robots. This is such a waste of energy
Even worse it slows down my site and makes me loose visitors. Unfortunately most of these bots ignore the robots.txt in the server root, so the only way is to block them by the server. My current .htaccess is taken from stackoverflow but far from being exhaustive
Options +FollowSymlinks RewriteEngine On RewriteBase / SetEnvIfNoCase Referer "^$" bad_user SetEnvIfNoCase User-Agent "^GbPlugin" bad_user SetEnvIfNoCase User-Agent "^Wget" bad_user SetEnvIfNoCase User-Agent "^EmailSiphon" bad_user SetEnvIfNoCase User-Agent "^EmailWolf" bad_user SetEnvIfNoCase User-Agent "^libwww-perl" bad_user Deny from env=bad_user
Unfortunately even that did not prevent unwanted crawler – very much like spam bot lists, IP ranges, etc. But there are some more solutions – bots can be recognized by their behavioral pattern: the try to get prohibited and non existent pages. Let’s dive into the first option
One of my favorite security measures here at Perishable Press is the site’s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site.
To make things a bit more attractive, I modified the Perishable approach by generating dynamic blackholes with permanently changing rules (in robots.txt) and redirects (in .htaccess).
My second approach is to monitor clicks to non-existent pages and put these IPs on a blacklist as well.
ErrorDocument 400 /404.php ErrorDocument 401 /404.php ErrorDocument 403.1 /404.php ErrorDocument 403.14 /404.php ErrorDocument 404 /404.php ErrorDocument 500 /404.php
The 404 page does a quick database lookup and whenever the limit is reached the IP is blacklisted. Only last approaches (blackhole and 404 blacklist) finally reduced my traffic.
Note added in proof: As most spiders come have different IP addresses – I am blocking now everything from top level 172.168.255.xxx