Mbox
| Dan Tobias  (Talk | contribs) | Dan Tobias  (Talk | contribs)   (→Links) | ||
| Line 35: | Line 35: | ||
| * [http://docs.python.org/2/library/mailbox.html Python library to handle mailbox formats] | * [http://docs.python.org/2/library/mailbox.html Python library to handle mailbox formats] | ||
| * [http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html More discussion of mbox format versions] | * [http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html More discussion of mbox format versions] | ||
| + | * [http://kb.mozillazine.org/Recover_messages_from_a_corrupt_folder Info on fixing corrupt mbox files of the form used by Mozilla Thunderbird] | ||
Revision as of 15:05, 4 February 2013
mbox is the format typically used in Unix-like systems for storing collections of e-mail messages, dating back to the early days of Unix and UUCP network connections. It has several minor variants, but they all consist of a series of messages in Internet e-mail message format (RFC 822 and its successors) appended together, including headers and body.
Some problems and incompatibilities of this format stem from the (rather shortsighted, in hindsight) design decision to have as the separator indicating the boundary between messages a "From" line inserted by the mailer program; messages are split based on the characters "From " (with a trailing space) appearing at the beginning of a line. This is distinct from the "From:" line in the headers of a message, which has a colon after it. The From line that is used as a separator follows that keyword with the originating mailbox name (originally a UUCP "bang path", with a series of nodes separated by exclamation points showing how the message got from its originating node to the place it currently was) then the date and possibly other information.
Because "From" is a common English word which often appears in the body of messages, sometimes at the beginning of lines, "escaping" has to be done by the mailer programs to alter such lines so they are not mistaken for message breakpoints. This is usually done by prefixing the line with a greater-than sign (>).
This is where the variant formats come in. The original escaping system merely added the prefix character when "From " appeared at the start of a line, so if it was already escaped as ">From ", no further escaping was done. This made the escaping non-reversible, since there was no way to distinguish the case where a "From " was escaped (and the character should be stripped on reading or export) from cases where a greater-than sign was already present (e.g., when the "From " is part of a quoted message in a reply, with all lines prefixed with angle brackets) and should be left alone.
Some "improved" mbox formats solve this by always adding a ">" sign to lines with "From " either at the start of a line or with one or more ">" signs between the start of the line and the "From ". Then, on reading, exporting, or otherwise handling the messages, one ">" sign is stripped from the beginning of a line that contains "From " after one or more ">" signs. Thus, the ">" sign count increases by one on encoding, and decreases by one on decoding, and everything works in a perfectly reversible way if all software cooperates and doesn't encode too many times without decoding (which would result in a ">"-sign pileup), or decode too many times (which would strip more ">" signs than necessary and leave a bare "From " in a message that didn't have one to begin with).
Often, mail software doesn't strip any of the ">" signs, so you wind up with those characters intruding into plain-text messages where the word "From" occurs; sometimes these malformattings have even made it into print in magazines, newspapers, and books which printed articles which were at one point e-mailed (and not adequately proofread). Web pages too (see this lyric page for an example).
The known variants (in terminology dating to the 1990s):
- mboxo: The "original" mbox format, which escapes "From " lines in a non-reversible manner.
- mboxrd: Version using "reversible" escaping where the ">" signs are piled on and are supposed to be stripped later; named after its inventor, Rahul Dhesi.
- mboxcl: From Unix System V; uses "Content-Length" headers in the individual messages to determine where to find the next message, so it doesn't need to scan for "From " lines. However, this format still escapes "From " lines in message bodies (in the non-reversible manner of mboxo) anyway, making the whole thing seem pretty pointless.
- mboxcl2: A variant of mboxcl which uses "Content-Length" headers to find messages, and doesn't do any "From " escaping. This avoids corrupting message bodies, but could get into trouble if a message has a missing or incorrect content length header, or if the file is processed by a mailer or utility that expects to be able to split messages by "From " headers.
Links
- RFC 4155 (application for MIME type, defines format)
- Wikipedia article
- Qmail man page
- Python library to handle mailbox formats
- More discussion of mbox format versions
- Info on fixing corrupt mbox files of the form used by Mozilla Thunderbird

