Context Navigation

source: npl/mailserver/dspam/dspam-3.10.2/doc/markov.txt @ d36701a

gcc484perl-5.22

Last change on this file since d36701a was c5c522c, checked in by Edwin Eefting <edwin@datux.nl>, 8 years ago
initial commit, transferred from cleaned syn3 svn tree
Property mode set to `100644`
File size: 1.6 KB

Rev	Line
[c5c522c]	1	$Id: markov.txt,v 1.00 2009/12/22 12:25:59 sbajic Exp $
	2
	3	To implement Markovian weighting, the following pieces must be configured:
	4
	5	1. The storage driver. Be sure and compile using Bill Yerazunis' CRM114
	6	Sparse Spectra driver (hash_drv). This is the only driver that is presently
	7	fast enough to handle the extra data generated by the tokenizer used.
	8
	9	NOTE: If you plan on doing TEFT or TUM type training, you'll need a huge
	10	database. In dspam.conf, HashRecMax should be set to around 5000000
	11	with a HashExtentSize of around 1000000. If you run into performance
	12	issues, you may consider increasing this or use csscompress after training
	13
	14	NOTE: Bill has told me that TOE yields the best results on real-world
	15	email, however for initial training TEFT or a TUNE approach might
	16	be best.
	17
	18	2. The tokenizer. Bill Yerazunis' CRM114 uses OSB/Markovian. You'll want to
	19	set the tokenizer to 'osb', or for old-school CRM114, sbph.
	20
	21	3. The value computing algorithm. This should be set to 'markov' which uses
	22	Markovian weighting. Comment out graham.
	23
	24	4. The combination algorithm (Algorithm). This should be set to 'naive' to
	25	act like CRM114 or you may consider 'burton' or a combination of
	26	"graham burton", both which gave me better results than naive.
	27	Comment out any existing algorithms.
	28
	29	This implements the "standard" CRM114ish Markovian type discrimination, but
	30	you could also mix and match different tokenizers and combination algorithms
	31	if you wanted to play around. It's quite possible you may get better results
	32	from using a different combo. The only thing that is certain is the value
	33	computing algorithm should always be 'markov'.
	34

Note: See TracBrowser for help on using the repository browser.

Download in other formats: