Spam from a Technical Viewpoint


Anti-spam filtering and prevention tools

This page lists some of the more commonly-used anti-spam filtering and prevention tools. Please note that the presence or absence of any specific product on this page does not represent any kind of value judgment by either the author or InternetNZ: we strongly recommend that you conduct your own web searches for products that might suit your needs, regarding this page as simply a starting resource.

If you run a Linux-based mail server, then either SpamAssassin or CRM114 may be just the tool you need. Both of these tools boast very high detection rates, but they also require some expertise to install and maintain. John Graham-Cumming's POPFile is also an excellent solution based on Bayesian filtering that can operate with many different mail servers and clients. Among New Zealand's contribution to spam handling is MailWasher Pro - a free version is available.

Various companies offer spam pre-filtering services, where they act as a front-end filter for your mail server. The largest and best known is probably Brightmail (note - this site requires Flash), but there are many others including New Zealand's own Death2Spam service. Front-end filtering services are good if you don't have an in-house system administrator.

Spam from a technical viewpoint

Technical approaches to dealing with spam can be divided into two broad classes: those that prevent spam from reaching your mailbox in the first place, and those that get remove it from your mailbox before you have to read it (these are known as server-side and client-side technologies respectively). Preventing the spam from reaching you in the first place is clearly the better approach, because it reduces the amount of Internet traffic and disk storage wasted carrying the stuff. It is, unfortunately, often also more difficult because it creates a significant bottleneck in the processing of e-mail.

Spammers use innumerable tricks to get their rubbish into your mailbox, preferably at someone else's expense: forged headers, misleading subject lines, deliberate misspellings of "naughty" words, erroneous or perverted use of the HTML language used to create formatting in web sites - all of these are commonplace in spam, and scarcely a day goes by without spammers coming up with new tricks to bypass the latest technologies.

This section of the StopSpam web site is intended only as the briefest of overviews into some of the methods available to prevent or remove spam, and is not presented as being in any way authoritative. For a much more detailed look into how spam "works", and the methods that can be used to detect it, we recommend that you examine the author's Spam White Paper (a 400KB document in Adobe PDF format). For the remainder of this section, we will examine two key anti-spam technologies to illustrate the problems associated with technical management of spam: the first of these technologies uses Black Lists to flag known spammers, and White Lists to flag mail from known "good guys". The second is the general process of filtering mail - using sets of pre-written rules that look for patterns, keywords and structural characteristics that identify messages as spam.

Black and White

A Black List is a kind of specialized online database that a mail server can query with the address of another system on the Internet (usually the address of a system that has connected to the server and asked to send some mail). If the other system's address is known by the online database, the mail server can assume that the system belongs to a spammer and can terminate the connection. On the surface, blacklists sound like a good solution to spam and indeed, they can be quite effective, but they suffer from two key problems: the first is that they are run by humans, and humans can make decisions based on erroneous information or skewed preferences... As a result, some blacklists have gained a reputation for blacklisting sites unfairly or even incorrectly. The second problem is that spammers move around a lot, and also have a tendency to use "stolen bandwidth" - that is, to use otherwise innocent systems that have been compromised in some way (usually by either a configuration error or the presence of a virus) to send their dross. Chosen carefully, a good blacklist, such as the SpamHaus Project's SBL, can be a valuable component of an anti-spam technique, but the process of evaluating blacklists is tricky and best left to experts.

A White List, on the other hand, approaches the problem of spam from the inside, by preventing anyone not on the list from sending mail to the user's mailbox. Many Internet Service Providers offer whitelisting services, but there are a number of drawbacks: if you only ever exchange mail with a small group of people that does not change much, then a whitelist may be a very effective solution for you, but if you need to receive mail from strangers, or if you use e-commerce facilities, then a Whitelist may result in too great a loss of mail to be effective. Most white lists have mechanisms whereby someone who is not on your list can mail you by going through a confirmation process of some kind, but there is clear evidence that many people simply will not make the effort to complete that process, instead discarding their mail to you.

The term White List is also used in some products to refer to a list of specific exceptions to a black list. While this sounds like a kind of balancing act, it can actually be surprisingly effective, and the ability to combine the best aspects of both types of list is being seen increasingly commonly in commercial anti-spam solutions.

Filtering

Filtering is the term used to describe any process that attempts to identify and remove spam after it has been accepted by the mail server, but before it appears in the user's new mail folder. The idea behind filtering is that spam will have certain specific characteristics that separate it from "regular" mail, and that it is possible to determine that any given message is spam by looking for those characteristics. Filtering can do a very good job - good systems can achieve detection rates higher than 90% - but as with any automated computer process of this kind, there will always be some messages that slip through the cordon (these are called false negatives) and more importantly, valid messages that are incorrectly identified as spam (called false positives). All filtering systems will generate a certain number of false positives and false negatives - part of the process of choosing a system will involve working out acceptable rates. In the broadest terms, there are two different types of filtering system in common use at present:

Rule-based filtering

With this type of filtering, the message is compared against a list of "rules" that identify words, phrases or aspects of message structure that might point to the message being spam. More sophisticated systems use a process of weighting, in which the more rules are matched, the higher the "weight" the message will get; once the weight passes a particular threshhold, it is considered spam and handled accordingly. Probably the most widely-used rule-based filtering system at present is the open source SpamAssassin system, but there are many other products that will do this type of test, including MailMarshall, MailWasher, Brightmail's subscription service and Mercury/32. Many mail clients, such as Outlook and Pegasus Mail, also have rule-based spam detection built-in. Rule based filtering works best at the mail server level: its most significant disadvantage is that it must be kept up-to-date to remain effective.

Bayesian filtering  

This type of filtering works by building a statistical language database that identifies two particular types of word - those that are very likely to appear in spam, and those that are very unlikely to appear in spam. It does this by "learning" from messages that the user explicitly identifies as either spam or not-spam (also known as "ham" in the trade). If you keep feeding a Bayesian filtering system samples of both your good mail and the spam you receive, it will, over time, build up a solid statistical model for you mail that it can use to distinguish between spam and ham. Bayesian filtering works best for single users (because it depends on the specific types of mail the user receives), and can achieve extremely high detection rates (an aggressively-maintained Bayesian filter can often trap more than 98% of spam). The disadvantage of Bayesian filtering is that it requires ongoing input from the user and it can slow down the processing of the user's inbox significantly. Nonetheless, for those willing to invest the effort required to use a Bayesian filtering system properly, few better filtering mechanisms exist. Practically all existing Bayesian spam filters are based on pioneering work done by Paul Graham. Bayesian filtering plugins are available for many existing mail clients.

Limits of technology

Anyone who is actively involved in developing technical anti-spam solutions will tell you that the process is effectively an arms race - the spammers are continually developing more devious methods to bypass the new techniques that the anti-spammers come up with. Unfortunately, the anti-spam lobby is always on the back foot in this arms race because they are necessarily in a reactive posture - they have to respond to what the spammers produce, and it is hard to create pre-emptive remedies.

Part of the problem faced by developers of anti-spam tools is the sheer quantity of spam being distributed. Current estimates suggest that something in the order of 12 billion spams are sent out every day, and even if it were possible to create an anti-spam technology that could deflect 99% of them (itself a very unlikely prospect), that would still mean that 120 million were getting through every day.

Added to this is the overriding problem of the false positive, or a legitimate message that is incorrectly classified as spam. For most businesses, false positives constitute a far more serious problem than almost any amount of spam, yet as anti-spam technologies get more aggressive in an attempt to cope with the torrents of spam, the risk of generating false positives grows higher and higher. While some anti-spam technologies can claim very good success rates in trapping spam, no technology based on non-human evaluation can guarantee that it will produce no false positives at all.

On the positive side of the ledger, ongoing improvements in anti-spam technology are leading to increasingly desperate measures being taken by spammers to get their crud past the defenses. At the time of writing, a lot of spam that gets through the barriers has been deliberately so mashed-up that the words it contains are barely recognizable - we have to assume that advertisements that the mark can scarcely read are going to be increasingly ineffectual and that sales will suffer as a result. Or so we hope.

From the end-user's perspective, the problem with technical anti-spam solutions is that they are... well, technical. There are so many different solutions using so many different techniques and approaches that you practically need the proverbial degree in astrophysics to choose between them, let alone configure or maintain them. This technical diaspora has always been a hallmark of the Internet, the end result usually being the creation of a so-called "High Priesthood" who understand the mysteries, while lesser mortals are left wondering.

It seems unlikely that it will ever be possible to produce a purely technical solution to spam - to do so would require levels of artificial intelligence not even visible on the computing horizon at this point. More likely, the solution to spam will be a combination of legislation (to make examples of the worst offenders), technical methods (to deal with the small fry that are left over) and education (to make the public more aware, and hence less likely to buy the "products").