A Hierarchical Spam Filter

When multiple filters are used to filter an email, each can assign a probability that the email is spam, and classification of the email as spam can be based upon a weighted average of the filter probabilities.

Alternatively the order in which the filters are used can be specified, as well as a set of rules for deciding whether to classify the email after a given filter test based upon the result, or whether to forward it to the next filter for further testing. We refer to this procedure as a hierarchical filter to distinguish from the "weighted approach".

Older versions of Emmesmail used multiple filters in the following sequence: 1) a sender filter (using a locally-generated user-specific whitelist and blacklist); 2) a Bayesian filter; 3) a token-paucity filter examining the average number of characters per token, and whenever the user-selected option, sender-filtering, is selected, 4) an appropriateness filter which examines what fraction of the included tokens were seen in previous emails. Only those emails that successfully pass through each of filters sequentially avoid being classified as spam. Only those emails found to be on the whitelist during sender-filtering avoid further filtering.

Currently Emmesmail uses multiple filters in the following sequence: 1) whitelist filtering using a locally-generated user-specific whitelist); 2) a Bayesian filter; 3) an unrecognized-wordss filter which examines what fraction of the included tokens were seen in previous emails. Only those emails that successfully pass through each of filters sequentially avoid being classified as spam. Only those emails found to be on the whitelist during sender-filtering are classified as non-spam without further filtering.



Emmes Technologies