[c5c522c] | 1 | $Id: markov.txt,v 1.00 2009/12/22 12:25:59 sbajic Exp $ |
---|
| 2 | |
---|
| 3 | To implement Markovian weighting, the following pieces must be configured: |
---|
| 4 | |
---|
| 5 | 1. The storage driver. Be sure and compile using Bill Yerazunis' CRM114 |
---|
| 6 | Sparse Spectra driver (hash_drv). This is the only driver that is presently |
---|
| 7 | fast enough to handle the extra data generated by the tokenizer used. |
---|
| 8 | |
---|
| 9 | NOTE: If you plan on doing TEFT or TUM type training, you'll need a huge |
---|
| 10 | database. In dspam.conf, HashRecMax should be set to around 5000000 |
---|
| 11 | with a HashExtentSize of around 1000000. If you run into performance |
---|
| 12 | issues, you may consider increasing this or use csscompress after training |
---|
| 13 | |
---|
| 14 | NOTE: Bill has told me that TOE yields the best results on real-world |
---|
| 15 | email, however for initial training TEFT or a TUNE approach might |
---|
| 16 | be best. |
---|
| 17 | |
---|
| 18 | 2. The tokenizer. Bill Yerazunis' CRM114 uses OSB/Markovian. You'll want to |
---|
| 19 | set the tokenizer to 'osb', or for old-school CRM114, sbph. |
---|
| 20 | |
---|
| 21 | 3. The value computing algorithm. This should be set to 'markov' which uses |
---|
| 22 | Markovian weighting. Comment out graham. |
---|
| 23 | |
---|
| 24 | 4. The combination algorithm (Algorithm). This should be set to 'naive' to |
---|
| 25 | act like CRM114 or you may consider 'burton' or a combination of |
---|
| 26 | "graham burton", both which gave me better results than naive. |
---|
| 27 | Comment out any existing algorithms. |
---|
| 28 | |
---|
| 29 | This implements the "standard" CRM114ish Markovian type discrimination, but |
---|
| 30 | you could also mix and match different tokenizers and combination algorithms |
---|
| 31 | if you wanted to play around. It's quite possible you may get better results |
---|
| 32 | from using a different combo. The only thing that is certain is the value |
---|
| 33 | computing algorithm should always be 'markov'. |
---|
| 34 | |
---|