A look at the code libraries
From MailWasher Server
Contents |
MailParserLib
What's there
fcl's MailParserLib implements the basic parsing of MIME messages.
The classes use a streaming, event-based design to minimize the amount of memory taken by different representations of the messages, and have built-in representations for the different MIME fields and their associated information. Other units also implement the required MIME transfer encoding and decoding methods.
Points to note
MailParserLib was originally written for Firetrust's Benign product, and the designers tried to make sure it interpreted mail in the same way as the major MUAs did, especially in the face of "dirty tricks" used by viruses and spammers, and the not-strictly-RFC messages put out by a lot of mail clients and systems.
Accordingly, once again any semantic changes (to the parsing in particular) would be have to be well-justified and even better tested; if the changes are substantial, a new release of the FirstAlert signature algorithm may be required, since they would result in the mails of interest being decoded differently, but we wouldn't expect changes of that size to be required as the basic parsing logic is quite stable and consistent with the de-facto standard behavior.
FCL doesn't contain a lot of tests for these units, and we'd like to see a lot more. Firetrust do however have a lot of Benign-specific tests (unfortunately quite wrapped around that product) which test out most aspects of this library's functionality, so the semantics of the library have been well tested in the past.
CharsetLib
What's there
fcl's CharsetLib implements the character set decoders needed by MailWasher Server's mail parsing and analysis systems.
It contains decoders for nearly all the western and asian encodings in common use, plus a number of less common character sets that are also supported by major MUAs.
Points to note
Given that most OSs already come with character set conversion libraries, developers may wonder why we wrote our own. Basically, it comes down to cross-platform portability and consistency. Since the character set decoding system is used to convert the text of incoming messages into Unicode before the Unicode text is analysed by the system, if we used different character set libraries on different platforms, each with different supported character sets, we would get different message signatures for FirstAlert! (which would result in false negatives), and inconsistent handling of foreign-language emails.
We have chosen to emulate, as closely as possible, the semantics of Outlook Express as far as character set names and aliases, on the basis that that is basically the de facto target that spammers write for. In practice, this means we support all of the official, canonical names, and a selection of the non-canonical but registered aliases, plus a few unofficial but common labels. We also support it's sometimes non-standard behaviour when text labelled as one character set actually turns out to use one of the "extensions" to that set, a particular problem with asian character sets.
HTMLParserLib
What's there
fcl's HTMLParserLib is a simple HTML parser class, plus a few utility classes that help with URI/URL manipulation and converting HTML to plaintext.
Points to note
The comments about the origin, design goals, semantics, and tests made above for MailParserLib all apply here also.
MailAnalysisLib
What's there
fcl's MailAnalysisLib builds on the previous three libraries and adds a variety of classes used for email inspection and filtering.
The MessageTextParser unit implements the basic message decoding logic. TextTokenizer, HTMLTextTokenizer, and HTMLTagLibrary contain the code to tokenize plain and HTML text. MessageTokenizer glues these units and the following trait detection units together, implementing everything required to convert an email into a stream of text tokens and code-detected traits.
Trait detection is implemented in MessageTokenizer and HTMLTraitDetector. These two units are where you should look at to add a new trait. The trait IDs and names are exported by the TraitIDs and TraitNames units, but the actual list of traits is in TraitIDs.inc. See below for more on traits.
The OccuranceData class is a simple value object used to store the trained weighings table for Bayesian-style analysis. The AnalysisAlgorithm class receives the stream of tokens and traits from MessageTokenizer and calculates the probability of the message being junk using the probability data from the supplied OccuranceData object. AnalysisAlgorithm currently uses the 'geometric mean' technique to calculate this probability; our testing found that it performed as well as chi^2 (to the limit of statistical significance) and was considerably simpler, which is consistent with the results others have found.
Finally, this library contains a glue class, SynopsisExtractor, which is used to grab the first bit of textual content from a message when it is quarantined (the synopsis is shown in the quarantined message summary - it makes it a lot easier for users to confirm that a message is junk than just looking at the subject).
Points to note
"Traits" are characteristics often observed in junk mail captured in the wild that rarely or never occur in legitimate mail - usually because the characteristics are produced by the commercial junk mail applications used by many spammers, or written in by the spammers themselves in an attempt to evade other junk mail filters. MailWasher Server detects these traits and considers them in addition to the message's words when performing statistical analysis of the message.
Traits are a very useful tool because they combine the filtering power offered by manual rules in many other mail filtering systems together with the accuracy gained from having the traits trained on the actual mail received by the users - ensuring that, for example, the filter is quickly trained so that legitimate commercial mailings that also happen to express some traits are not undesirably filtered.
Traits are one of the most important areas the community can contribute to. We have a page with guidelines for writing new traits that gives some important suggestions.
BDBLib
What's there
MailWasher Server uses SleepyCat's Berkeley DB Concurrent Data Store engine. Berkeley DB is not a database server in the traditional sense: it doesn't understand the format of the data stored in the tables and doesn't offer a query interface; it simply efficiently stores, retrieves, and manipulates binary key and value data.
fcl's BDBLib is a set of C classes and templates for Berkeley DB.
BDBLib can be thought of as having two parts. As above, BDB just stores and retrieves binary data; applications are free to represent their data in whatever format they like. This means applications have to define their own record formatting or serialization mechanism. The basic BDBLib classes - Table, UniqueIndex, TableCursor etc. - just handle chunks of binary data (wrapped in a BOB object - like SQL's BLOBs, but not necessarily L).
The other part of BDBLib is the template system, which provides the mechanism used to convert actual C object instances to and from the storable binary data representation; the template version of all the table, index, and cursor classes is templated on the key and data types, and automatically makes the conversions to and from binary format for you, meaning that the data access methods actually take and return your native C value objects.
Points to note
Although Berkeley DB does come with it's own C wrappers, the developers were unhappy with them as they aren't very "C -like". They use a single class for all kinds of tables and indexes, and similarly, a single class for all kinds of cursors, meaning that it was never clear what operations could be performed on what tables - which is not always obvious with BDB: for example, the mechanics of secondary indexes means that a lot of operations can't be performed on secondary indexes, and different cursor operations are available if the table has non-unique keys, or is an index, and it's not well documented when you can and can't change the key or data the cursor is sitting on.
So instead, they created their own library, BDBLib. It still wraps around the C API, but makes it a lot easier to see what interfaces are available, and provides strong typing to prevent common usage errors. The templated system this makes it really quick to develop with - once you've made your value classes and defined the table/index types, you can simply access the table's data directly, without having to manually write queries or DML statements.
MWServer common
What's there
Most of the units and classes in MWServer's common directory are of the database value classes/enumerations/types/etc., the units that define the table and index structures, and the classes that group the tables and indexes together for convenience of instantiation.
There are also some units that are used by both daemons and/or the conduits, for example the SMTPConnections unit which is used by both the MPD and MWI to inject new emails, and the Milters unit which is used by conduits to connect to the MPD and check messages - and also by the MWI to check that the MPD is up and running healthily.
Finally, some utility units implement RNGs and password generation.
Points to note
Except for the few database classes or tables used only by one daemon (for example, the MPD's CFS content upload queue), all of the database structures and formats are defined here. We'll have a page about the database formats.
The use of Milter protocol connections for all conduits is new in MailWasher Server version 2. Previous versions used a simple SMTP-like protocol, and a mail conduit program/library/daemon was used for all MTAs, including Sendmail. Once the Milter protocol had been popularised and support added for other MTAs, and once the expanded version 2 feature set required additional communications between MPD and mail conduit, it was decided to replace the custom mail conduit protocol with the Milter protocol, bringing benefits of improved efficiency and existing support. However, in order to comply with Sendmail's licensing terms for their milter library (which restrict it to use with Sendmail specifically), an implementation was written that did not use, and was not based on, the Sendmail code. A small protocol extension is used by the MWI to query the MPD's status (it uses the same interface so it will see the same problems mail conduits do - if any); since it can only ever do this to the MPD specfically, not to other Milters, this shouldn't ever be a problem.
Excellent web site
I will visit it often.
BlackDad
carisoprodol without prescription | alprazolam er | buy soma online | buy xanax online
Kiss you! I think this is the best site in the net I have ever visited! Really, very nice site. I check it regularly!
teen tits | shemale model galleries | tranny sex | teen bikini
The others
That's all the libraries. The daemons and conduits are discussed in the architecture guide. The remaining programs are debugging and testing utilities, including SMTP and MPD load tests and functional tests for the MPD.
