How to block 'bad bots' and 'unwanted web crawlers'?

Written: 2019-01-14 22:21:54 Last update: 2019-08-29 22:32:59

There are many search engine companies such as Google, Microsoft (Bing), Yahoo, etc. They have created crawler engines or robots to read, parse and index our web pages for their search engine, these attempts by search engine are good and we want to invite them to come, read and store our web page content into their system so that web visitors can find and come to our website from their search engine website, their crawlers or robots will come regularly and the frequency is depending our website content, they will come everyday to try to get new contents and to define our website and web pages ranking.

Unfornately there are also many 'bad bots' and 'unwanted web crawlers' on the net, people created some programs or scripts to read and to get content our web pages but for bad purposes, we call them 'bad bots' or 'unwanted web crawlers', they have different purposes, such as:

To steal web content (read without permission), by calling many random URL paths and parameters trying to guess hidden contents.
To break security, for example trying to find username/password to login with so many attempts.
To attack using DDoS (Distributed Denial of Service) attack by requesting so many URLs or web APIs to make our web server work hard to process all those requests unnecessarily which deplete our web server computing resources (heavy use of CPU and RAM) so our web server will be unable to fulfill other requests from real human visitors, this kind of attack normally done at the same time from many different IP addresses to avoid getting caught if using a single IP address (web server normally has log of visitors' IP addresses).
etc.

Web server administrators normally implement one or more solutions such as:

Firewall to block bad bots, most eCommerce websites use this options. This is the easiest solution because does not need to change any source code.
Reduce (as much as possible) dynamic content created by JSP/PHP/etc. (such as header, footer, menu, navigation, etc.). Replace as much as possible with static content files which may be regularly generated content. This is the hardest way to implement but the best solution because this solution provides 2 things:
- Overall website performance is increased because web server is only providing static content, all dynamic content are provided by ajax call
- Bad bots will only read static resources which will not consume many server's CPU and RAM, unless it is too much like DDoS
If using shared hosting company then probably use Apache web server which has file .htaccess which can be used to block bad bots.

One of my websites which hosted in a shared-hosting was having bad bots problem and I was trying to decide whether to use PHP code to detect and block their IP addresses or just use "blackhole" (https://perishablepress.com/blackhole-bad-bots/), after some consideration, I feel blackhole is overkill for the problem and I don't like to maintain the 'blackhole.dat', after some further research, I updated the Apache's '.htaccess' file with the following additional content:

Options All -Indexes
RewriteEngine on

# flag any user-agent which starts with text 'bot'
SetEnvIfNoCase User-Agent "^bot" bad_bot

# see awstat (or another web statistic tool) to see these bad unwanted bots list.
# this list is just a sample of what I use, must be changed for different website
# use format '.*blabla.*' to search 'blabla' anywhere inside User-Agent string.

SetEnvIfNoCase User-Agent .*SemrushBot.* bad_bot
SetEnvIfNoCase User-Agent .*MJ12bot.* bad_bot
SetEnvIfNoCase User-Agent .*BLEXBot.* bad_bot
SetEnvIfNoCase User-Agent .*AhrefsBot.* bad_bot
SetEnvIfNoCase User-Agent .*baidu.* bad_bot
SetEnvIfNoCase User-Agent .*crawler.* bad_bot
SetEnvIfNoCase User-Agent .*yandex.* bad_bot

# last is to catch empty/blank user-agents which normally use by personal crawler or scrapper (thief)
SetEnvIfNoCase User-Agent ^$ bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

The above simple solution is just an example for my case, this may not be suitable solution for other websites, this solution is only blocking incoming crawlers/robots which has matched 'user-agent' as defined above, at that time I blocked them because I saw too many visits from them and I could not tell whether those visits are legit from their official crawlers or just some bad scripts using custom 'user-agent' trying to imitate 'search engine'.

After .htaccess file is updated then to verify whether this solution is working or not is easy, I personally use curl in terminal to call website, such as :

curl -A "myblabla 000YANdex0000 bla bla" https://quick.work

This curl should be blocked and in my shared hosting it will return error 403 response, below is just an example output from my web server:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /
on this server.<br />
</p>
<p>Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>

Warning, please pay careful attention about blocking web crawlers, if we block all of them permanently then our website may be not visible from Search Engine such as Google, Bing, Yahoo, etc.

Quick.Work website also have a little problem with these bad bots and crawlers, this problem is one of the reason that I've decided to convert from MPA (server render) to SPA (Single Page Application, client render), my SPA is using static html,css and js files only therefore this solution may not be needed for SPA, all 'good' crawlers such as Google, Bing, etc. will read sitemap.xml to understand the website, I may lose a little of SEO but I gain better server performance and it is worth it, hopefully this short article can help others with similar problem ... ^.^

Update 20190830: I converted quick.work from SPA to MPA because I checked my web statistic for a few months and going SPA hurt this web ranking because there are many search engines unable to read dynamic content rendered in client-side (using SPA), it is 2019 and Google crawlers are still having difficulty (not proper) to index all pages if using full SPA, very sad to waste many time and energy to convert to SPA and back to MPA.