Fighting Spam with Bogofilter

This page describes the system, I use to avoid being annoyed by the huge amounts of spam, that is sent to me. My goals for this system has been:

I feel that I've achieved these goals, and have written this page in the hope that other people can you use it to be free of spam as well.

It is presumed throughout this page that you use IMAP for accessing your mail, Maildirs for storing the mail on the server and Procmail for mail filtering. You might still find some inspiration if you use other systems.

Required software

The system I describe here is based on the following software

I would have preferred to use maildrop in place of procmail, since I find it easier to use. But my hosting provider uses procmail.

How to use it

When you have set up a system according to the directions given on this page, you will have the following additional folders on your IMAP account:

Ideally all spam sent to you should be placed in the "Spam" folder.

When some spam gets to your inbox, you will move it to the "BogoTrain/Spam" folder. This will make the system use that spam mail to train Bogofilter, thus making it less likely that spam mail similar to this one will get past Bogofilter the next time. When the mail has been used for training, it is moved to the "Spam" folder.

Likewise, if any relevant mail is ever placed in the "Spam" folder, you move it to the "BogoTrain/NotSpam" folder in order to tell Bogofilter that this is not spam. The mail will later be moved back into your inbox.

How to set it up

I won't go into much detail about how to install the needed software. I expect most of it to be installed already or being installable with your Linux distributions package manager.

Setting up an MTA with procmail is beyond the scope of this page, but it should be described on the procmail homepage or in the documentation for your MTA.

If you need to compile Bogofilter from source, it comes with a "configure" script, and can be installed in the same way as the countless other programs using this scheme.

I don't know of any Linux distributions that ship without a cron program.

Creating folders

The system uses the three folders mentioned above. You should create these folders as the first step.

Configuration of procmail

You will need the following in your .procmailrc

MAILDIR=path to your maildir
      
:0fw
| bogofilter -p -e

:0
* X-Bogosity: Spam
$MAILDIR/.Spam/

Where you should replace the red text with the path to your maildir. This is usually "~/Maildir" but it depends on the configuration of the MTA on your server.

The first rule makes procmail pass all mails into the "bogofilter -p -e" command. The output from this command is then used in place of the mail when processing the following rules. The "-p -e" options to bogofilter makes in run in a mode where it will:

The second rule results in the mail being moved into the "Spam" folder if it contains a header starting with "X-Bogosity: Spam", which is what bogofilter adds when it categorize a mail as spam.

Setting up the training script

The system use the Ruby script shown below to train the bogofilter program and move mails out of the "BogoTrain/Spam" and "BogoTrain/NotSpam" folders.

#! /usr/bin/env ruby

require "fileutils"
include FileUtils

LOGFILE = File.new(File.join(ENV["HOME"], "bogotrain.log"), "a")
BOGO_PROG = File.join(ENV["HOME"], "local", "bin", "bogofilter")

def log(str)
  LOGFILE.write(str)
  LOGFILE.write("\n")
end

def mark_as_spam(filename)
  log "Marking this file as SPAM and moving it to the spam folder: #{filename}"
  cmd = "#{BOGO_PROG} -s < #{filename}"
  log cmd
  system cmd
  mv filename, "Maildir/.Spam/cur"
end

def mark_as_ham(filename)
  log "Marking this file as ham and moving it back to the inbox: #{filename}"
  cmd = "#{BOGO_PROG} -n < #{filename}"
  log cmd
  system cmd
  mv filename, "Maildir/cur"
end

log "Started at #{`date`}"
Dir.glob("Maildir/.BogoTrain.Spam/cur/*").each { |fn| mark_as_spam fn }
Dir.glob("Maildir/.BogoTrain.NotSpam/cur/*").each { |fn| mark_as_ham fn }

You will need this file or an equivalent script in some other language somewhere on your server where cron can find it.

The script writes a log about what it is doing to the file pointed at in the first red line. If you leave this line unchanged the log will be written to "~/bogotrain.log". If you don't want any logging remove the first red line and the two "LOGFILE." lines in the "log" function.

The second red line tells the script where to find the "bogofilter" executable. If you leave the line unchanged it will expect to find it at "~/local/bin/bogofilter". Change it to fit the setup on your server.

Make sure that the file containing the script is executable by the cron daemon.

Configuration of cron

You will need the text below in your crontab. If you are using Vixie Cron, you can edit your crontab by issuing a "crontab -e" command.

20 * * * *  path to the ruby script/bogotrain.rb

This will make cron run the script every hour at 20 minutes past the hour. You should replace the red text with the path to where you stored the ruby script.

Initial training

In order to work Bogofilter has to know, how your spam and your non-spam looks. When you follow the directions given above it will learn this while working on categorizing your mail.

But Bogofilter needs some initial training before you can start using it. This means that you will have to archive some spam and non-spam manually for a period of time, and then use this to train Bogofilter.

When you have a collection of spam and non-spam with spam in one folder, and non-spam in another, you can perform the initial training by issuing these commands:

bogofilter -s -B Maildir/.Spam
bogofilter -n -B Maildir/.NonSpam

Replace the red text in the first line with the path to the maildir where you stored the spam, and the red text on the second line with where you stored the non-spam.

The -s and -n options means that Bogofilter should register the input as spam and non-spam respectively. The -B option tells Bogofilter to run in batch mode, which in this case means that it will process all mails in the given maildir.

Closing remarks

At the time of writing I have been using this system for three months.

In this time only a very small fraction of the spam mail sent to me has reached my inbox. Most of the spam that did get this far, was marked with a bogosity level beyond 0.9.

I have never had to rescue a non-spam message from the Spam folder. Most of the non-spam messages have been assigned very low bogosity values, typically below 0.1. Some bulk email (that I actually wanted to receive) have been assigned values around 0.5 until I trained Bogofilter that they were not spam.

This has made me consider lowering the spam-cutoff value in order to get rid of the last few spam messages in my inbox. But they are few enough that I have not bothered.

I regard these results as a great success. The Bogofilter program is very effective and relatively simple to use. Now, go ahead and get rid of that spam!

peter.peca.dk About Emacs Free Software Projects Linux

Valid XHTML 1.0 Strict