So What Makes a Good Spam Filter Anyway? By Alan Hearnshaw
Spam Filters. Most of us know we need one. Some of know we need
a better one, but how many stop to think what actually makes a
good spam filter in the first place?
This is not just a rhetorical question. It is a question that
many users – and many developers – do not ask, and consequently,
Maybe this could be better answered by defining here the
qualities of the perfect spam filter. We’ll call our perfect
spam filter the “SpamSplatter 3000″. Here are some of the
defining qualities of “SpamSplatter 3000″
1. It requires zero interaction from the user. 2. It produces
zero false positives (good messages identified as bad) and zero
false negatives (bad messages identified as good). 3. It is
transparent – that is, you only ever see good messages and never
need even be aware that spam exists.
That’s it. Not much of a shopping list is it? Of course,
“SpamSplatter 3000″ hasn’t been invented yet (and if it does, I
want a piece of the action), but it does give us a frame of
reference when looking for the best filter we can find.
Let’s take each point in turn:
It requires zero interaction from the user There are two kinds
of filters that come near to this ideal currently: Bayesian
Filters and Community Filters. Bayesian filters strip messages
down to small “word bites”, or tokens and maintain a database
containing lists of good and bad tokens. When a new message is
encountered, the filter strips this message down to tokens,
compares it to the database, and applies a formula based on the
British scientist Alan Bayes’ formula for probability
calculation. Over time, the Bayesian filter “learns” the
characteristics of spam messages.
Community Filters simply work on a voting system whereby every
user that receives a spam message “votes” it as spam. This
information is stored on a central server and when enough votes
are received the message is banned from all users in the
As can be seen, the user interaction from these types of filters
is mainly limited to two button operation – correcting wrongly
identified messages – and the more accurate the filter, the less
those buttons are used.
OK, so that’s pretty good. Not exactly zero interaction, but if
the filter is accurate enough, then it should be pretty near.
That brings us to point two:
It produces zero false positives or negatives This is the area
in which most spam filter development is concentrating and
things are getting pretty good nowadays. It is not at all
unusual to see an efficient modern filter achieve accuracy of
96% or better. It is, of course, far better to have a false
negative than a false positive if you are ever going to tear
yourself away from the killed mail folder!
Of course, by definition, community filters cannot reach 100%
accuracy as someone has to be getting the spam to be voting it
as such! Theoretically, a Bayesian filter may be able to
eventually get quite close to 100% accuracy, so at least there
is hope there. Content based filters (those that look for
certain words, phrases or other indicators in a message to
identify it as spam), will almost certainly not get much higher
accuracy figures than the best of them can achieve today.
Adapting to changing spam requires new filters to be created on
an ongoing basis.
And finally, we come to the holy grail of spam filtering:
It is transparent Strangely enough, not enough work seems to be
done in trying to achieve this goal. Some of the best filters on
the market today identify spam with impressive accuracy and then
simply place them in a “killed mail” folder for your later
perusal. Now, forgive me if I’m missing something here, but
isn’t the point to save you having to wade through the junk
mail? Isn’t that what you bought the filter for? With the
“SpamSplatter 3000″, you don’t need to do that.
As we haven’t achieved 100% accuracy yet (and probably never
will), the only way to free us from checking the killed mail
folder is a challenge/response system. This is where a message
is automatically sent back to the sender requiring them to take
some action for their message to actually be delivered.
Some systems tend to go overboard with the challenge/response
system. These systems – often called “Whitelist” systems – block
messages from anyone that isn’t in the user’s friends list.
Guaranteed 100% effective, but too drastic a measure for most
Now, it seems that the most intelligent use of this system would
be to send challenges only to messages that were flagged as
“questionable”. Good message can be delivered, definite spam can
be deleted and questionable ones would earn themselves a
So, to sum up, let’s rewrite the qualities of our perfect filter
and get a shopping list of what to look for while we wait for
the “SpamSplatter 3000″ to arrive:
1. Simple, minimal setup and maintenance. 2. Extremely low rate
of false positives and as few false negatives as possible. 3. A
transparent “fail-safe” mechanism whereby the victims of those
false positives can force the message through to you.
It’s simple really. Now, who’s going to build me this
Alan Hearnshaw is the owner of http://www.WhichSpamFilter.com, a
site which provides weekly in-depth spam filter reviews, user
help and guidance and a community forum. firstname.lastname@example.org