{"id":8009,"date":"2016-01-22T09:20:55","date_gmt":"2016-01-22T08:20:55","guid":{"rendered":"http:\/\/www.wjst.de\/blog\/?p=8009"},"modified":"2016-01-22T09:28:47","modified_gmt":"2016-01-22T08:28:47","slug":"up-to-90-of-webserver-traffic-is-now-by-robots","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2016\/01\/up-to-90-of-webserver-traffic-is-now-by-robots\/","title":{"rendered":"Up to 80% of webserver traffic is now by robots"},"content":{"rendered":"<p>It&#8217;s such a pain &#8211; my log files show that 80% of all traffic is being generated by robots. This is such a waste of energy<br \/>\nEven worse it slows down my site and makes me loose visitors. Unfortunately most of these bots ignore the robots.txt in the server root, so the only way is to block them by the server. My current .htaccess is taken from <a href=\"http:\/\/stackoverflow.com\/questions\/7372551\/block-by-useragent-or-empty-referer\">stackoverflow<\/a> but far from being exhaustive<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\nOptions +FollowSymlinks  \r\nRewriteEngine On  \r\nRewriteBase \/  \r\nSetEnvIfNoCase Referer &quot;^$&quot; bad_user\r\nSetEnvIfNoCase User-Agent &quot;^GbPlugin&quot; bad_user\r\nSetEnvIfNoCase User-Agent &quot;^Wget&quot; bad_user\r\nSetEnvIfNoCase User-Agent &quot;^EmailSiphon&quot; bad_user\r\nSetEnvIfNoCase User-Agent &quot;^EmailWolf&quot; bad_user\r\nSetEnvIfNoCase User-Agent &quot;^libwww-perl&quot; bad_user\r\nDeny from env=bad_user\r\n<\/pre>\n<p>Unfortunately even that did not prevent unwanted crawler &#8211; very much like spam bot lists, IP ranges, etc. But there are some more solutions &#8211; bots can be recognized by their behavioral pattern: the try to get <a href=\"https:\/\/perishablepress.com\/blackhole-bad-bots\/\">prohibited<\/a> and non existent pages. Let&#8217;s dive into the first option<\/p>\n<blockquote><p>One of my favorite security measures here at Perishable Press is the site\u2019s virtual Blackhole trap for bad bots. The concept is simple: include a hidden link to a robots.txt-forbidden directory somewhere on your pages. Bots that ignore or disobey your robots rules will crawl the link and fall into the trap, which then performs a WHOIS Lookup and records the event in the blackhole data file. Once added to the blacklist data file, bad bots immediately are denied access to your site.<\/p><\/blockquote>\n<p>To make things a bit more attractive, I modified the Perishable approach by generating dynamic blackholes with permanently changing rules (in robots.txt) and redirects (in .htaccess).<\/p>\n<p>My second approach is to monitor clicks to non-existent pages and put these IPs on a blacklist as well.<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\r\nErrorDocument 400 \/404.php\r\nErrorDocument 401 \/404.php\r\nErrorDocument 403.1 \/404.php\r\nErrorDocument 403.14 \/404.php\r\nErrorDocument 404 \/404.php\r\nErrorDocument 500 \/404.php\r\n<\/pre>\n<p>The 404 page does a quick database lookup and whenever the limit is reached the IP is blacklisted. Only last approaches (blackhole and 404 blacklist) finally reduced my traffic.<\/p>\n<p><span style=\"text-decoration: underline;\">Note added in proof:<\/span> As most spiders come have\u00a0different IP addresses &#8211; I am blocking now everything from top level\u00a0172.168.255.xxx<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 07.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>It&#8217;s such a pain &#8211; my log files show that 80% of all traffic is being generated by robots. This is such a waste of energy Even worse it slows down my site and makes me loose visitors. Unfortunately most of these bots ignore the robots.txt in the server root, so the only way is &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2016\/01\/up-to-90-of-webserver-traffic-is-now-by-robots\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Up to 80% of webserver traffic is now by robots<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-8009","post","type-post","status-publish","format-standard","hentry","category-computer-software"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/8009","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=8009"}],"version-history":[{"count":6,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/8009\/revisions"}],"predecessor-version":[{"id":8026,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/8009\/revisions\/8026"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=8009"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=8009"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=8009"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}