Added an interesting link of example phishing scams to previous posts below. The link was provided by Tony.
Added an interesting link of example phishing scams to previous posts below. The link was provided by Tony.
Here is an explanation I gave to Tony, which I think is relevant to share with anyone using or interested in using AccuSpam.
Subject: Re: Is this from a spammer?
Yes it is a spam.
You received it because it had arrived in your mailbox within 3 minutes before you POPed email from your mailbox.
If using the paid version, it would be impossible for you to receive these.
The spammer sent an email from email@example.com to firstname.lastname@example.org. AccuSpam sent a confirmation to email@example.com, because there is no way for AccuSpam to know that firstname.lastname@example.org is an alias for same POP mailbox as email@example.com. For example, AccuSpam would not know if firstname.lastname@example.org is same mailbox as email@example.com.
But AccuSpam has a way to find out. When AccuSpam finds this email in the POP mailbox, it checks it's database and realizes that it received the confirmation in the same POP mailbox as it sent it from. So then AccuSpam deletes the confirmation and the original spam.
The only reason AccuSpam did not do this, is because you downloaded the confirmation email below before AccuSpam had a chance to check the mailbox again. So AccuSpam checked the POP mailbox, sent the confirmation, and then waited 3 minutes to check the POP mailbox again. While waiting for 3 minutes, the confirmation came back to same POP mailbox (because firstname.lastname@example.org and email@example.com are aliases for same POP mailbox). You downloaded the confirmation email below during that 3 minute wait. That is why this form of spam can only be received in the free version and only in the rare case that you happen to hit that 3 minute wait window.
AccuSpam must wait 3 minutes between inspecting your POP mailbox, because if it opened your POP mailbox more frequently than that, then you would be unable to open your own POP mailbox, because it can only be open to one client at a time.
The paid version solves this by using two mailboxes, one that AccuSpam inspects and the other that you POP from. This is called a "proxy". There are other ways we could attempt to do a proxy, but in our analysis they were all inferior to what we chose. For example, putting a proxy on the client computer of the user would not work well because it would not work with WebMail or when user uses other computer, so we chose the dual mailbox proxy for paid version instead.
At 09:01 PM 8/17/2004 -0400, you wrote:
>Received: (qmail 98099 invoked by uid 3052); 18 Aug 2004 00:17:50 -0000
>Received: (qmail 98096 invoked by uid 3052); 18 Aug 2004 00:17:50 -0000
>Date: 18 Aug 2004 00:17:50 -0000
>Subject: Received your email: [BuddyN]
>From: "firstname.lastname@example.org" <cnfm_77741_HZuJwmIsI9PTY6Y5@accuspam.com>
>I [email@example.com] received the email from you [firstname.lastname@example.org],
>containing the subject above.
>If you need me to reply more urgently, simply click Reply
>and send back this entire confirmation email.
>If you sent the email to [email@example.com], the following
>does not apply to you.
>If you did NOT send an email to [firstname.lastname@example.org],
>http://AccuSpam.com can help you stop forgery spam.
>Join free http://AccuSpam.com
>100% spam blocked. 0% of non-spam blocked.
Added "web.de" so impossible to get forged emails from a web.de email address, same as was done for hotmail and yahoo.
It is as easy as follows to add forgery blocking to AccuSpam for each free email provider:
Code:dig -x220.127.116.11 ; <<>> DiG 9.2.3rc4 <<>> -x18.104.22.168 ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15696 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0 ;; QUESTION SECTION: ;22.214.171.124.in-addr.arpa. IN PTR ;; ANSWER SECTION: 126.96.36.199.in-addr.arpa. 3353 IN PTR fmmailgate01.web.de. ;; AUTHORITY SECTION: 192.72.217.in-addr.arpa. 3353 IN NS nsx2.cinetic.de. 192.72.217.in-addr.arpa. 3353 IN NS nsx1.cinetic.de. ;; Query time: 2 msec ;; SERVER: 188.8.131.52#53(184.108.40.206) ;; WHEN: Wed Aug 18 09:20:27 2004 ;; MSG SIZE rcvd: 124 mysql> show create table dns; +-------+--------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+--------------------------------------------------------------------------------------------------------------------------+ | dns | CREATE TABLE `dns` ( `Tld` varchar(127) NOT NULL default '', `MajorISP` tinyint(3) unsigned NOT NULL default '1', `PTRSupported` tinyint(3) unsigned NOT NULL default '2', `PTRRequired` tinyint(3) unsigned NOT NULL default '0', `TldNSMatches` varchar(127) NOT NULL default '', PRIMARY KEY (`Tld`) ) TYPE=MyISAM | +-------+--------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> insert into dns values ('web.de','1','1','1','cinetic.de'); Query OK, 1 row affected (0.00 sec)
I was prompted to prioritize adding "web.de" (in advance of a planned comprehensive addition of all known (10,000+) free email providers), as I noticed the following forged email from a "web.de" address was NOT blocked by our best competitor, http://BrightMail.com:
Code:Return-Path: <email@example.com> Received: from 220.127.116.11 ([18.104.22.168]) by robin (EarthLink SMTP Server) with SMTP id 1bX4fK48O3NZFjX0 Tue, 17 Aug 2004 06:43:47 -0700 (PDT) Received: from dns3.web.de (dns3.web.de [22.214.171.124]) by 126.96.36.199 with SMTP id d7AJB51Jv7; Tue, 17 Aug 2004 18:38:44 +0400 From: "Carmen Shepard" <firstname.lastname@example.org> Reply-To: "Carmen Shepard" <email@example.com> Subject: of 9 but To: firstname.lastname@example.org Cc: email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com Message-ID: <B84EE85692174DC@web.de> X-Mailer: crank case 62 curses Date: Tue, 17 Aug 2004 20:43:44 +0600 Organization: philosopher 870 brides Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="=====250893080900=_" X-ELNK-AV: 0 Dewey Blair,%RND_SYB ,cretin ,strengthen .%RND_SY Under ground C D !Check Your spouse and staff, Investigates anyone own cREDIT-HISTORY, Govenment don't want me to sell. hacking someone P C !Get a new passport! Disappear in your city very easy! http://acadu.bettersites.info/amite/CD3/ insomniac ,hypothetic , ,pouch ,din ,formidable ,adolphus . kinky ,lack .
Note the doing a reverse dns query of the IP address in first Received: header in above spam does not return "web.de" domain and "cinetic.de" nameserver, thus indicating it was not sent over the "web.de" webmail and is thus (with very high probability) a forged email:
Code:dig -x188.8.131.52 ; <<>> DiG 9.2.3rc4 <<>> -x184.108.40.206 ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 54402 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;220.127.116.11.in-addr.arpa. IN PTR ;; Query time: 229 msec ;; SERVER: 18.104.22.168#53(22.214.171.124) ;; WHEN: Wed Aug 18 08:48:54 2004 ;; MSG SIZE rcvd: 45
Whereas looking at following email I sent from a "web.de" account I created, notice I used the IP address in first Received: header in the query I used to configure our database (as shown in first Code section above).
Code:Return-Path: <firstname.lastname@example.org> Received: from fmmailgate01.web.de ([126.96.36.199]) by sparrow (EarthLink SMTP Server) with ESMTP id 1bXqku7yV3NZFjV0 for <email@example.com>; Wed, 18 Aug 2004 06:18:14 -0700 (PDT) Received: by fmmailgate01.web.de (8.12.6/8.12.6/webde Linux 0.7) with SMTP id i7IDHq1d016358; Wed, 18 Aug 2004 15:18:12 +0200 Received: from 188.8.131.52 by freemailng2002.web.de with HTTP; Wed, 18 Aug 2004 15:18:07 +0200 Date: Wed, 18 Aug 2004 15:18:07 +0200 Message-Id: <firstname.lastname@example.org> MIME-Version: 1.0 From: "Shelby Moore" <email@example.com> To: firstname.lastname@example.org, email@example.com Subject: firstname.lastname@example.org, email@example.com Precedence: fm-user Organization: http://freemail.web.de/ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-ELNK-AV: 0 Test ____________________________________________________ Aufnehmen, abschicken, nah sein - So einfach ist WEB.DE Video-Mail: http://freemail.web.de/?mc=021200
Enforcing reverse dns on free (webmail exclusive) email providers deletes forged spam that apparently http://BrightMail.com does not block.
I understand it is possible, yet not standard and complex, for some (1 in 10,000?) users of free email to configure (see "Method 2: How to Set Up a New Account that Sends Messages by Using an SMTP Server") an email client to not send over the free email providers' network, but my opinion and assumption it simply isn't worth receiving all that forged spam from free email domains to insure against that rare chance (1 in million overall for all email received?). Those rare cases are easily handled by adding those rare users to Approved Senders list. My assumption is because by their nature, free email providers entice users who want to do webmail and who want an easy and free solution (not a complex one that requires paid password access to a non-open SMTP relay).
Last edited by accuspam; 08-18-04 at 09:10 AM.
At 11:55 AM 8/17/2004 -0400, you wrote:
Thank you very much for your free AccuSpam. I do not think I will need anything further as I am not a heavy user of email. I was just so annoyed at the spam mail and resulting pop-ups.
I just need to know if my existing blockers will interfere with your service?
As long as you are receiving the "Twice Daily Summary" emails from "AccuSpam Robot" then you are probably okay.
But if you lose a legit email, you will have to suspect your existing blockers. I noticed you are also using Spam Assassin (or am I mistaken?), which is known to delete legit email sometimes (severity depending on the Spam Assassin threshold set).
I would suggest turning off your existing blockers and see if AccuSpam can sufficiently block the spam. If not, then assuming you are receiving the "Twice Daily Summary" emails from "AccuSpam Robot", the turn back on your existing blockers. Repeat this test every couple of months, until you are satisfied that AccuSpam is sufficient without your existing blockers. Then leave the existing blockers off.
Added that free version of AccuSpam is not compatible if you use forwarding to the protected email address:
AccuSpam FAQ Requirements
This is because the reverse dns anti-forgery I added recently must have access to the original Received: headers of the email, which are normally deleted by most forwarding methods.
Most users do not use email forwarding.
Internal ISP forwarding that retains the Received: headers is compatible.
Failure to follow this Requirement, can lead to lost email.
The rough snapshot estimate of current AccuSpam performance from Tony's account:
Approximately 1300 emails avg. per day processed by AccuSpam over the last month.
Only 40 emails per day in Twice Daily Summary with no probability to be spam, thus (1300 - 80) / 1300 = 94% spam deletion if Bayesian level false positive risk accepted.
Only 66 emails per day in Twice Daily Summary with greater than 99 in 100 probability to be spam, thus (1300 - 122) / 1300 = 91% spam deletion with medium false positive risk.
Only 103 emails per day in Twice Daily Summary total is (1300 - 206) / 1300 = 84% spam deletion with 0% (> 1 in million) false positive risk.
The shows about 1% improvement from where we were last week. 10 other AccuSpam users are now correlated with Tony, compared with 8 last week. We only have about 100 AccuSpam users. We really need about 1000 for the spam deletion rate in the Daily Summaries to hit Bayesian level without the Bayesian level risk of false positive.
We are in the process of implementing anti-spam "honeypots" (aka "spam probes") to reduce with length of Daily Summaries without having to wait for more AccuSpam users.
This should be completed within August hopefully.
SenderKeys Anti-Forgery proposal drastically improved and now has discussion list:
It can now optionally be implemented entirely at the MTA (mail server) level without requiring MUA upgrades!
It now depends on (any one of) the 3 major anti-forgery proposals, so it will be seen as less of a threat to them and more complementary.
My Blocked senders list keeps getting bigger and bigger I think its time to give acuspam a try
Example input we are receiving from satisfied AccuSpam users.
To: "The Old Map Company" <postmaster@oldmapxxxxxx>
Subject: Re: AccuSpam Comments
Hope you do not mind if we post your comments to our Forum so others can be aware of the benefits you initially got with AccuSpam.
Actually you will find that AccuSpam will improve over time (we are still improving the algorithms), and eventually you will only get a Daily Summary if you have legit e-mail from previously unknown sender.
The Cool Page button was linked, but we added a link in the text based on your feedback.
Yes you can be sure with AccuSpam that you will never receive any spam that you did not specifically request, except as per the caveats in our FAQ if you are using the free version. If ever you need an absolute 100% insurance, you can upgrade to our paid version when it is available.
At 09:55 AM 9/7/2004 +0100, Steve wrote:
>Trialing for a few days now and this all looks very promising. There was
>always the odd spam mail that demanded a quick look and the Newsletters
>(with the ad links that also demanded a look) one should have un-subscribed
>from, but never got around to hitting the button. Also I can now let my
>family have access to my mail box in the knowledge they will not be exposed
>to anything unpleasant. AccuSpam must be saving me an hour a day! That's
>more than two weeks a year, or placing a conservative value on my time as
>£10 per hour - £3,650 (US$6,500!) Congratulations.
>PS You have a link missing on your site - To quickly design cool, creative
>web sites, we recommend ?
>Trust it's Coolpage!
This seems like the private/public key model that PGP and GPG use already for encryption, tied in with email.Originally Posted by accuspam
shikaza is her :irate:
A superior underlying algorithm for AccuSpam will be released probably today. Nothing will need to change in the user interface of AccuSpam at this time.
The new algorithm correlates (among all users in a safe manner) on highly recursive content fragments instead of domain of sender, making it less susceptible to error from excessive email forgery of a domain, and more accurate against ISPs (domains) which send both spam and non-spam.
This algorithm also effectively increases the statistical reach of AccuSpam's user count, because spam content fragments cross-correlate more often than domain of sender of spam.
Unlike the very popular Bayesian statistics for anti-spam (e.g. used in Spam Assassin used my many ISPs), this algorithm continually re-trains itself, it will not generate a false positive (delete non-spam) or false negative (fail to block spam) when YOUR current non-spam or spam, suddenly has a shift in content that (in terms of Bayesian statistics) resembles YOUR past spam or non-spam respectively. The risks of Bayesian were detailed further in past posts:
A reply we sent to a customer today:
Your promo on the home page says you just sign up and carry on as before. How can that be when an approved sender list must be compiled?
A single Daily Summary email is sent to you (automatically by our robot) with a COMPACT list of temporarily blocked emails (those that we were not sure if spam or not) and you can reply to that email back to our robot with an "A" in the [ ] box next to each sender you want to receive email from.
Once a sender is added to your Approved Senders list (by any method, i.e. directly or replying to Daily Summary, other method in future by login, etc..), then you always get email from that sender immediately in your Inbox and the following does not apply to that sender any more.
a) A friend of mine gives my address to another friend who emails me.
b) A person sees my address on a business card and emails me showing an interest in my product.
c) A person sees my address on a business card and emails me with info about his product, relevant to my industry.
All three above are unsolicited, all from addresses not see before by the anti spam software, and yet are welcome.
They will all appear in your Daily Summary email. And as our usership grows, less and less spam appears in the Daily Summary. Somtime this year, all you will see in Daily Summary are new senders. In that future scenario (where our undelying statistic spam detection is 99.99%) then on the days you do not get new senders, then those days you do not Daily Summary emails.
See our announcement of superior statistical algorithm yesterday:
Without intervention by the user, no antispam system could possibly know which of the above emails are welcome and which are not.
Not true. Our underlying statistical algorithm is able to know this, we just do not have enough users yet (we have 1184 user as of today) to detect 99.99% of spam statistically. Once we have 10,000+ users, there will be an option for paid users to turn off the Daily Summary and allow new senders directly into Inbox. However, note that we need to know the Approved Senders in order to drive our statistical algorithm. However in that future scenario, we will be able to auto-populate the Approved Senders by seeing that you have received email from a new sender more than once and have not chosen to block the sender. In that scenario without a Daily Summary, then you will login to AccuSpam.com to report any spam received in your Inbox. But we won't enable such an option until our underlying spam detection is 99.99%.
Right now AccuSpam is 100% because it blocks everything that is not an Approved Sender and them compiles it into a Daily Summary. About 50% of incoming spam is detected (40+% by detecting nonexistent sender and 10% by statistics) and not included in the Daily Summary. The nonexistent sender and statistical algorithm is mathematically certain to never delete a non-spam more frequently than once in a million spams. In other words, the false positive accuracy is always 99.9999%. The 99.99% accuracy we are aiming for is to increase the statistical spam detection from current 10% to 99.99%.
If this point is agreed with then how would Accuspam be different to, say, Mailwasher where mail received from an address not yet seen by Mailwasher must be viewed by the user before being manually blacklisted.
1. AccuSpam is currently deleting 50% of spam automatically before the Daily Summary with 99.9999% accuracy. 10% is being done by correlating spam and non-spam content among all users, and this detection will increase to 99.99% this year as our usership grows. The exact algorithm is secret and includes some "magic" (math) which will hopefully be patented soon. It is quite different from Bayesian algorithm that many anti-spam products use, with some distinct advantages.
(Note the previous statistical algorithm which correlate sender domain, was not achieving the desired 99.9999% accuracy because many ISP's domains are used to send spam as well as non-spam. This wasn't a big problem, because the statistical sender domain correlation was only affecting (detecting as spam) 10% or less of incoming email and still with a very high accuracy. However, we have fixed this with the announcement mentioned above. It would have become a bigger problem as our usership increases, and now we have a very accurate statistical algorithm to build on as usership grows).
2. 100% spam protection is guaranteed by the Daily Summary email, which is much more COMPACT and SAFE way of reviewing suspect email not caught by the underlying algorithm than MailWasher which downloads all the spam and viruses to your computer BEFORE you are shown them and make choice whether to blacklist or receive them.
3. MailWasher is not correlating spam statistics with other users and has no underlying statistical way to detect spam automatically. Some products (maybe Mailwasher) will attempt to correlate only YOUR spam stats to detect spam (Bayesian), and the drawbacks of Bayesian are discussed in the link to the AccuSpam Forum I gave above.
4. MailWasher only protects email you download to your computer. AccuSpam protects your mailbox, no matter where or how you access it, e.g. using WebMail or from other computer.
5. You have to download and install MailWasher (and learn to use with each mail program you use) to every computer you want to protect. With AccuSpam, just signup your mailbox online in 1 minute and you are done.
6. The are sure to be technical compatibility issues for some computers and some mail programs when using a program such as MailWasher which runs on the computer you are using. AccuSpam runs on our server and communicates to your mailbox on your ISP's server, via standard POP3 protocol, and thus compatibilty issues are very, very rare and any compatibility issues are discovered when you attempt to signup. If your ISP's POP3 mailbox server is not compatible, you won't be able to signup for AccuSpam (we do numerous POP3 compatibility checks at signup). You won't have nasty problems later. And such incompatibility is very, very, very rare, because POP3 is a very, very, very universal standard for email mailbox delivery.
7. If your computer crashes or gets virus, your anti-spam does not crash or get compromised with AccuSpam.
And a promotional email my be sent out in bulk to subscribers, with a few extra to relevant ( as above ) industry pariticipants, so if software scrubs it just becaus a lot of others like it have alos been sent out it will fail the user again.
Our new underlying statistical algorithm will not scrub a desireable bulk email (newsletter, etc), because some of our users will have the sender of the desireable bulk email on their Approved Senders list and this will tell our algorithm that the content of that bulk email is not spam.
Again that is why I said we need the Approved Senders list to feed our statistical correlation algorithm. And again I said we can eventually get rid of the Daily Summary, once we reach critical mass of usership. In the meantime, it works very well, which is why we have a growing usership.
We will post our reply to our Forum for the benefit of the public knowledge.
The following descriptions of AccuSpam's algorithms are not a license, nor a public grant of any rights. AccuSpam reserves all rights. A patent will be filed on this algorithms.
The Major Algorithm Update is working as expected. The amount of spam summarized in the Daily Summary is drastically reduced, because most spam is being recognized and deleted (safely with less than an impossible 1 in million (0.0001%) risk of losing non-spam) by this new statistical algorithm which will call "Chunk".
The "Chunk" algorithm has many benefits as compared to the per-user Bayesian algorithm used by most all other anti-spam (e.g. Spam Assassin uses Bayesian and is used by many ISPs):
(1) Analyzes data from all users
(2) Automatically trains itself in real-time
(3) Only needs to be told what *some* non-spam is (does not need to have every incoming email trained on). We get the non-spam data from users' whitelists.
(4) Automatically recognizes new strains of spam in real-time (does not have to be trained on new spam), e.g. "Viagra" changed to "Ciali$". Can not be fooled by changes in spam (randomization) between spam runs.
(5) Automatically recognizes new strains of non-spam (new to one user, but not new to all users). In other words, it doesn't get confused if you contact an insurance company for a quote, but you have classified insurance emails as spam in the past.
(6) Detects much higher rate of spam, with a much lower rate of false positives. The false positive rate can be set in the probability calculations of the algorithm (e.g. 1 in million is 0.0001%), compared to 0.03% (1 in 3333) for Bayesian. So Bayesian will lose a legit email every 3333 emails received, whereas AccuSpam will never (1 in million) lose non-spam:
(See Page 9 of the PDF linked at top)
(7) 100% immune to users who misclassify non-spam as spam.
(8) Is more immune than Bayesian to users who misclassify spam as non-spam. Also we monitor users to discovers spammers who signup for AccuSpam to approve their own spam. Besides getting spam past the statistical algorithm in AccuSpam is useless, because it goes into Daily Summary and still is blocked and body of email never read by users.
(9) Uses an *EXACT* probability calculation. Bayesian which counts statistical evidence, then uses an ad hoc (inexact/guess) calculation to sum/weight those evidence probabilities. Bayesian's ad hoc calculation makes many assumptions (e.g. that spam and non-spam never share the same corpus and are mutually independent), and when these assumptions fail, Bayesian makes mistakes (false positives). Fundamentally Bayesian samples only one user, and more importantly it samples spam over history yet spam changes in real-time, so it suffers aliasing (sampling below Nyquist of some event) errors. Some heuristics (guesses) have been developed to combat Bayesian's fundamental aliasing, but guesses+guesses is still not exact or 100% reliable. Bayesian works reasonably well for users whose non-spam and spam rarely change. Our "Chunk" works for all users, all the time.
(10) To understand more the strategic differences between the very popular Bayesian anti-spam algorithm and AccuSpam's new "Chunk" algorithm ("spam genetic correlator"), look at the following links to ideas our developer wrote years ago:
Semantics of spam is UBE
Spam that learns to not be statistically identified?
Bayesian does not measure the real-time bulkness of spam. The "Chunk" algorithm was born from a conceptual paradigm that starts with the semantics of spam. Spam is UBE (unsolicited bulk email), which means the vast majority of bulk email. So filter on what is statistically sent in bulk, then the receiver can white list those few sources of BE (bulk email) that is not unsolicited. The current "Chunk" algorithm improves this further by using the data from many receivers to determine which BE is definitely unsolicited and should be deleted without showing in Daily Summary. Thus legit BE is not deleted.
By filtering spam at the semantic level (BE), then spam has to change semantically. But if spam stops being BE, then it isn't spam any more. Contrast this with the Bayesian algorithm. Bayesian filters spam by assuming the semantics of spam is the word distribution of the message. That is not the what makes spam a problem. Most of us get a lot of non-spam which has annoying word distributions.
Bayesian encourages spam content to change such that either, a) it's annoying words change often, e.g. "Viagra" changes to "Ciali$". That is why spam is becoming more annoying, as it isn't even readable most of the time, or b) spam that has words distributions very similar to normal email, and we do now see some spams which are not quickly discernable as being spam, or c) spams with less words (and often more images instead), and we are seeing an increasing # of these.
So although "Chunk" is also correlating on spam content, the semantics it correlates are the UBE factor of an email. This is a big difference from Bayesian which correlates the word distributions of email histories.
(11) Paul Graham, the "creator" of Bayesian for anti-spam, asks a question which he does not adequately answer, "So why did we get such different numbers?" when comparing his horrible false positive rate (0.03% is 1 in 3333 non-spams lost) to Pantel and Lin's 1.16% (is 1 in 86 non-spams lost).
The answer is that Bayesian is only sampling the side effects of spam, and not sampling the semantics of spam, which is that spam is BE. Bayesian can not measure the bulkness in real-time.
The differing error rates is a well known phenomenon in science called "aliasing". It occurs when the sample rate (frequency) is too low, specially when it is lower than 2x the frequency of the signal (Nyquist).
So different people will get totally different error rates with Bayesian, and it will be very unpredictable. As stated, if a recipients' non-spam and spam do not change much over time, then performance of Bayesian is adequate (because then it is approximating a bulk measurement). But in reality, non-spam and spam change quite frequently! For some people the change rate may be 1 in 3333 and others it may be 1 in 86 or even worse! Our "Chunk" algorithm does not have this variable aliasing error rate problem!
Last edited by accuspam; 01-15-05 at 05:25 AM.
Subject: Semantics of Bayesian leads to varied false positive rates
Hi Paul Graham (aka "creator" of Bayesian),
You might remember me as perhaps first to warn you about possible bad effects of Bayesian in Dec. 2002:
In item #11 at the following web page, I have answered the more important question you posed but did not adequately answer in your, http://www.paulgraham.com/better.html, web page:
Subject: Spammers can subvert your Filters Fight Back idea
Hi Paul Graham (aka "creator" of Bayesian),
Spammers can subvert your Filters Fight Back idea:
They simply customize the link with a randomly chosen password (of sufficient length) for each recipient and ignore invalid links. They can then detect which recipients are hammering them and ignore their http requests (but not necessarily stop sending them spam).
And one of the links in email could be an unsubscribe link that causes http requests to ignored, but does not unsubscribe from mailing list.
I can see this incorporated into popular spamware programs that spammers use to create and send spam.
Note to mention that protecting against the bad side effects of this idea would be extremely complex in real world.
Also, correlating the content of the linked web sites as you suggest is unnecessary overhead (not needed in our "Chunk" algorithm).
Subject: You do not account the training cost of Bayesian
Hi Paul Graham (aka "creator" of Bayesian),
In your web page below, "So Far So Good", you fail to mention that if spammers conceal their bad tokens, by changing them frequently, then these spams do get past the Bayesian filter and involving a continual training cost for the recipient. And we observe that spammers are doing that all the time now. And eventually spammers could learn how to randomize their bad tokens (including urls, email addresses, and headers) sufficiently to avoid Bayesian all the time. But IMO they won't have to, because too many recipients are not satisfied with the false positive rates of Bayesian and effectively turn off the filter often (even if just by browsing their spam folder in normal course of using their Bayesian anti-spam).
Conversely, changes in good tokens have to be trained as well, especially when they change to overlap pre-existing bad tokens.
These are unavoidable consequences of incorrect semantics of Bayesian for spam filtering.