Fail2ban and Unwanted Robots, Crawlers and Spiders

Thursday, 29 January, 2015

Besides server attacks, there are some other actions you don’t want to encounter. For example, you may not want any bots other than the Google’s to crawl your site. Robots.txt should advise crawlers what parts of a website to crawl. There are different types of bots that crawl the web.

Using Fail2ban

A large number of these bots we can consider “good” as they are not an issue because they respect robots.txt. However,  most crawlers, bots, parse-rs and spiders are “bad” and do not respect this file. So there are additional steps you can take to deny access.

Edit /etc/fail2ban/jail.local:

Add this to the end:

Fail2ban scans access log, if a bot is located on the list and has accessed the site, it is banned immediately!

Simple and effective.

bantime – number of seconds the IP will be banned (eg. 172800 = 48h or 2 days)

findtime – interval of scanning log in seconds (eg. 360 = 6 minutes)

maxretry – how many times a bot must appear in log

change action dest= abuse@example.com

Now we have to define filter and regex which fail2ban will use for finding bots.

Edit file /etc/fail2ban/filter.d/apache-badbots.conf

This file already exists so you have to edit lines to match these ones:

It is possible that this configuration won’t suit your needs so I have commented reg-exes that work.

Now I will check to see if the detecting works.

The syntax is:

*access.log = path to access.log-ap

After these configurations, restart with:

That’s all folks!

Sebastijan Placento

Comments

© 2017, All Rights Reserved. Gauss Development is Gauss Ltd brand.