1 | $Id: markov.txt,v 1.00 2009/12/22 12:25:59 sbajic Exp $ |
---|
2 | |
---|
3 | To implement Markovian weighting, the following pieces must be configured: |
---|
4 | |
---|
5 | 1. The storage driver. Be sure and compile using Bill Yerazunis' CRM114 |
---|
6 | Sparse Spectra driver (hash_drv). This is the only driver that is presently |
---|
7 | fast enough to handle the extra data generated by the tokenizer used. |
---|
8 | |
---|
9 | NOTE: If you plan on doing TEFT or TUM type training, you'll need a huge |
---|
10 | database. In dspam.conf, HashRecMax should be set to around 5000000 |
---|
11 | with a HashExtentSize of around 1000000. If you run into performance |
---|
12 | issues, you may consider increasing this or use csscompress after training |
---|
13 | |
---|
14 | NOTE: Bill has told me that TOE yields the best results on real-world |
---|
15 | email, however for initial training TEFT or a TUNE approach might |
---|
16 | be best. |
---|
17 | |
---|
18 | 2. The tokenizer. Bill Yerazunis' CRM114 uses OSB/Markovian. You'll want to |
---|
19 | set the tokenizer to 'osb', or for old-school CRM114, sbph. |
---|
20 | |
---|
21 | 3. The value computing algorithm. This should be set to 'markov' which uses |
---|
22 | Markovian weighting. Comment out graham. |
---|
23 | |
---|
24 | 4. The combination algorithm (Algorithm). This should be set to 'naive' to |
---|
25 | act like CRM114 or you may consider 'burton' or a combination of |
---|
26 | "graham burton", both which gave me better results than naive. |
---|
27 | Comment out any existing algorithms. |
---|
28 | |
---|
29 | This implements the "standard" CRM114ish Markovian type discrimination, but |
---|
30 | you could also mix and match different tokenizers and combination algorithms |
---|
31 | if you wanted to play around. It's quite possible you may get better results |
---|
32 | from using a different combo. The only thing that is certain is the value |
---|
33 | computing algorithm should always be 'markov'. |
---|
34 | |
---|