You are here:

ActiveXperts.com > ActiveEmail > Mime Encoding > RFC-2049

ActivEmail SMTP/POP3 Toolkit Add SMTP/POP3 capabilities to any Windows or .NET application

Quicklinks


RFC 2049 - Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples

1.  Introduction

   The first and second documents in this set define MIME header fields
   and the initial set of MIME media types.  The third document
   describes extensions to RFC822 formats to allow for character sets
   other than US-ASCII.  This document describes what portions  of MIME
   must be supported by a conformant MIME implementation. It also
   describes various pitfalls of contemporary messaging systems as well
   as the canonical encoding model MIME is based on.

2.  MIME Conformance

   The mechanisms described in these documents are open-ended.  It is
   definitely not expected that all implementations will support all
   available media types, nor that they will all share the same
   extensions.  In order to promote interoperability, however, it is
   useful to define the concept of "MIME-conformance" to define a
   certain level of implementation that allows the useful interworking
   of messages with content that differs from US-ASCII text.  In this
   section, we specify the requirements for such conformance.

   A mail user agent that is MIME-conformant MUST:

    (1)   Always generate a "MIME-Version: 1.0" header field in
          any message it creates.

    (2)   Recognize the Content-Transfer-Encoding header field
          and decode all received data encoded by either quoted-
          printable or base64 implementations.  The identity
          transformations 7bit, 8bit, and binary must also be
          recognized.

          Any non-7bit data that is sent without encoding must be
          properly labelled with a content-transfer-encoding of
          8bit or binary, as appropriate.  If the underlying
          transport does not support 8bit or binary (as SMTP
          [RFC-821] does not), the sender is required to both
          encode and label data using an appropriate Content-
          Transfer-Encoding such as quoted-printable or base64.

    (3)   Must treat any unrecognized Content-Transfer-Encoding
          as if it had a Content-Type of "application/octet-
          stream", regardless of whether or not the actual
          Content-Type is recognized.

    (4)   Recognize and interpret the Content-Type header field,
          and avoid showing users raw data with a Content-Type
          field other than text.  Implementations  must be able
          to send at least text/plain messages, with the
          character set specified with the charset parameter if
          it is not US-ASCII.

    (5)   Ignore any content type parameters whose names they do
          not recognize.

    (6)   Explicitly handle the following media type values, to
          at least the following extents:

          Text:

            -- Recognize and display "text" mail with the
            character set "US-ASCII."

            -- Recognize other character sets at least to the
            extent of being able to inform the user about what
            character set the message uses.

            -- Recognize the "ISO-8859-*" character sets to the
            extent of being able to display those characters that
            are common to ISO-8859-* and US-ASCII, namely all
            characters represented by octet values 1-127.

            -- For unrecognized subtypes in a known character
            set, show or offer to show the user the "raw" version
            of the data after conversion of the content from
            canonical form to local form.

            -- Treat material in an unknown character set as if
            it were "application/octet-stream".

          Image, audio, and video:

            -- At a minumum provide facilities to treat any
            unrecognized subtypes as if they were
            "application/octet-stream".

          Application:

            -- Offer the ability to remove either of the quoted-
            printable or base64 encodings defined in this
            document if they were used and put the resulting
            information in a user file.

          Multipart:

            -- Recognize the mixed subtype.  Display all relevant
            information on the message level and the body part
            header level and then display or offer to display
            each of the body parts individually.

            -- Recognize the "alternative" subtype, and avoid
            showing the user redundant parts of
            multipart/alternative mail.

            -- Recognize the "multipart/digest" subtype,
            specifically using "message/rfc822" rather than
            "text/plain" as the default media type for body parts
            inside "multipart/digest" entities.

            -- Treat any unrecognized subtypes as if they were
            "mixed".

          Message:

            -- Recognize and display at least the RFC822 message
            encapsulation (message/rfc822) in such a way as to
            preserve any recursive structure, that is, displaying
            or offering to display the encapsulated data in
            accordance with its media type.

            -- Treat any unrecognized subtypes as if they were
            "application/octet-stream".

    (7)   Upon encountering any unrecognized Content-Type field,
          an implementation must treat it as if it had a media
          type of "application/octet-stream" with no parameter
          sub-arguments.  How such data are handled is up to an
          implementation, but likely options for handling such
          unrecognized data include offering the user to write it
          into a file (decoded from its mail transport format) or
          offering the user to name a program to which the
          decoded data should be passed as input.

    (8)   Conformant user agents are required, if they provide
          non-standard support for non-MIME messages employing
          character sets other than US-ASCII, to do so on
          received messages only. Conforming user agents must not
          send non-MIME messages containing anything other than
          US-ASCII text.

          In particular, the use of non-US-ASCII text in mail
          messages without a MIME-Version field is strongly
          discouraged as it impedes interoperability when sending
          messages between regions with different localization
          conventions. Conforming user agents MUST include proper
          MIME labelling when sending anything other than plain
          text in the US-ASCII character set.

          In addition, non-MIME user agents should be upgraded if
          at all possible to include appropriate MIME header
          information in the messages they send even if nothing
          else in MIME is supported.  This upgrade will have
          little, if any, effect on non-MIME recipients and will
          aid MIME in correctly displaying such messages.  It
          also provides a smooth transition path to eventual
          adoption of other MIME capabilities.

    (9)   Conforming user agents must ensure that any string of
          non-white-space printable US-ASCII characters within a
          "*text" or "*ctext" that begins with "=?" and ends with

          "?=" be a valid encoded-word.  ("begins" means: At the
          start of the field-body or immediately following
          linear-white-space; "ends" means: At the end of the
          field-body or immediately preceding linear-white-
          space.) In addition, any "word" within a "phrase" that
          begins with "=?" and ends with "?=" must be a valid
          encoded-word.

    (10)  Conforming user agents must be able to distinguish
          encoded-words from "text", "ctext", or "word"s,
          according to the rules in section 4, anytime they
          appear in appropriate places in message headers.  It
          must support both the "B" and "Q" encodings for any
          character set which it supports.  The program must be
          able to display the unencoded text if the character set
          is "US-ASCII".  For the ISO-8859-* character sets, the
          mail reading program must at least be able to display
          the characters which are also in the US-ASCII set.

   A user agent that meets the above conditions is said to be MIME-
   conformant.  The meaning of this phrase is that it is assumed to be
   "safe" to send virtually any kind of properly-marked data to users of
   such mail systems, because such systems will at least be able to
   treat the data as undifferentiated binary, and will not simply splash
   it onto the screen of unsuspecting users.

   There is another sense in which it is always "safe" to send data in a
   format that is MIME-conformant, which is that such data will not
   break or be broken by any known systems that are conformant with RFC
   821 and RFC 822.  User agents that are MIME-conformant have the
   additional guarantee that the user will not be shown data that were
   never intended to be viewed as text.

3.  Guidelines for Sending Email Data

   Internet email is not a perfect, homogeneous system.  Mail may become
   corrupted at several stages in its travel to a final destination.
   Specifically, email sent throughout the Internet may travel across
   many networking technologies. Many networking and mail technologies
   do not support the full functionality possible in the SMTP transport
   environment.  Mail traversing these systems is likely to be modified
   in order that it can be transported.

   There exist many widely-deployed non-conformant MTAs in the Internet.
   These MTAs, speaking the SMTP protocol, alter messages on the fly to
   take advantage of the internal data structure of the hosts they are
   implemented on, or are just plain broken.

   The following guidelines may be useful to anyone devising a data
   format (media type) that is supposed to survive the widest range of
   networking technologies and known broken MTAs unscathed.  Note that
   anything encoded in the base64 encoding will satisfy these rules, but
   that some well-known mechanisms, notably the UNIX uuencode facility,
   will not.  Note also that anything encoded in the Quoted-Printable
   encoding will survive most gateways intact, but possibly not some
   gateways to systems that use the EBCDIC character set.

    (1)   Under some circumstances the encoding used for data may
          change as part of normal gateway or user agent
          operation.  In particular, conversion from base64 to
          quoted-printable and vice versa may be necessary.  This
          may result in the confusion of CRLF sequences with line
          breaks in text bodies.  As such, the persistence of
          CRLF as something other than a line break must not be
          relied on.

    (2)   Many systems may elect to represent and store text data
          using local newline conventions.  Local newline
          conventions may not match the RFC822 CRLF convention --
          systems are known that use plain CR, plain LF, CRLF, or
          counted records.  The result is that isolated CR and LF
          characters are not well tolerated in general; they may
          be lost or converted to delimiters on some systems, and
          hence must not be relied on.

    (3)   The transmission of NULs (US-ASCII value 0) is
          problematic in Internet mail.  (This is largely the
          result of NULs being used as a termination character by
          many of the standard runtime library routines in the C
          programming language.) The practice of using NULs as
          termination characters is so entrenched now that
          messages should not rely on them being preserved.

    (4)   TAB (HT) characters may be misinterpreted or may be
          automatically converted to variable numbers of spaces.
          This is unavoidable in some environments, notably those
          not based on the US-ASCII character set.  Such
          conversion is STRONGLY DISCOURAGED, but it may occur,
          and mail formats must not rely on the persistence of
          TAB (HT) characters.

    (5)   Lines longer than 76 characters may be wrapped or
          truncated in some environments.  Line wrapping or line
          truncation imposed by mail transports is STRONGLY
          DISCOURAGED, but unavoidable in some cases.
          Applications which require long lines must somehow

          differentiate between soft and hard line breaks.  (A
          simple way to do this is to use the quoted-printable
          encoding.)

    (6)   Trailing "white space" characters (SPACE, TAB (HT)) on
          a line may be discarded by some transport agents, while
          other transport agents may pad lines with these
          characters so that all lines in a mail file are of
          equal length.  The persistence of trailing white space,
          therefore, must not be relied on.

    (7)   Many mail domains use variations on the US-ASCII
          character set, or use character sets such as EBCDIC
          which contain most but not all of the US-ASCII
          characters.  The correct translation of characters not
          in the "invariant" set cannot be depended on across
          character converting gateways.  For example, this
          situation is a problem when sending uuencoded
          information across BITNET, an EBCDIC system.  Similar
          problems can occur without crossing a gateway, since
          many Internet hosts use character sets other than US-
          ASCII internally.  The definition of Printable Strings
          in X.400 adds further restrictions in certain special
          cases.  In particular, the only characters that are
          known to be consistent across all gateways are the 73
          characters that correspond to the upper and lower case
          letters A-Z and a-z, the 10 digits 0-9, and the
          following eleven special characters:

            "'"  (US-ASCII decimal value 39)
            "("  (US-ASCII decimal value 40)
            ")"  (US-ASCII decimal value 41)
            "+"  (US-ASCII decimal value 43)
            ","  (US-ASCII decimal value 44)
            "-"  (US-ASCII decimal value 45)
            "."  (US-ASCII decimal value 46)
            "/"  (US-ASCII decimal value 47)
            ":"  (US-ASCII decimal value 58)
            "="  (US-ASCII decimal value 61)
            "?"  (US-ASCII decimal value 63)

          A maximally portable mail representation will confine
          itself to relatively short lines of text in which the
          only meaningful characters are taken from this set of
          73 characters.  The base64 encoding follows this rule.

    (8)   Some mail transport agents will corrupt data that
          includes certain literal strings.  In particular, a

          period (".") alone on a line is known to be corrupted
          by some (incorrect) SMTP implementations, and a line
          that starts with the five characters "From " (the fifth
          character is a SPACE) are commonly corrupted as well.
          A careful composition agent can prevent these
          corruptions by encoding the data (e.g., in the quoted-
          printable encoding using "=46rom " in place of "From "
          at the start of a line, and "=2E" in place of "." alone
          on a line).

   Please note that the above list is NOT a list of recommended
   practices for MTAs.  RFC 821 MTAs are prohibited from altering the
   character of white space or wrapping long lines.  These BAD and
   invalid practices are known to occur on established networks, and
   implementations should be robust in dealing with the bad effects they
   can cause.

4.  Canonical Encoding Model

   There was some confusion, in earlier versions of these documents,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system.  For this reason, a
   canonical model for encoding is presented below.

   The process of composing a MIME entity can be modeled as being done
   in a number of steps.  Note that these steps are roughly similar to
   those steps used in PEM [RFC-1421] and are performed for each
   "innermost level" body:

    (1)   Creation of local form.

          The body to be transmitted is created in the system's
          native format.  The native character set is used and,
          where appropriate, local end of line conventions are
          used as well.  The body may be a UNIX-style text file,
          or a Sun raster image, or a VMS indexed file, or audio
          data in a system-dependent format stored only in
          memory, or anything else that corresponds to the local
          model for the representation of some form of
          information.  Fundamentally, the data is created in the
          "native" form that corresponds to the type specified by
          the media type.

    (2)   Conversion to canonical form.

          The entire body, including "out-of-band" information
          such as record lengths and possibly file attribute
          information, is converted to a universal canonical
          form.  The specific media type of the body as well as
          its associated attributes dictate the nature of the
          canonical form that is used.  Conversion to the proper
          canonical form may involve character set conversion,
          transformation of audio data, compression, or various
          other operations specific to the various media types.
          If character set conversion is involved, however, care
          must be taken to understand the semantics of the media
          type, which may have strong implications for any
          character set conversion, e.g. with regard to
          syntactically meaningful characters in a text subtype
          other than "plain".

          For example, in the case of text/plain data, the text
          must be converted to a supported character set and
          lines must be delimited with CRLF delimiters in
          accordance with RFC 822.  Note that the restriction on
          line lengths implied by RFC 822 is eliminated if the
          next step employs either quoted-printable or base64
          encoding.

    (3)   Apply transfer encoding.

          A Content-Transfer-Encoding appropriate for this body
          is applied.  Note that there is no fixed relationship
          between the media type and the transfer encoding.  In
          particular, it may be appropriate to base the choice of
          base64 or quoted-printable on character frequency
          counts which are specific to a given instance of a
          body.

    (4)   Insertion into entity.

          The encoded body is inserted into a MIME entity with
          appropriate headers. The entity is then inserted into
          the body of a higher-level entity (message or
          multipart) as needed.

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

   It is vital to note that these steps are only a model; they are
   specifically NOT a blueprint for how an actual system would be built.
   In particular, the model fails to account for two common designs:

    (1)   In many cases the conversion to a canonical form prior
          to encoding will be subsumed into the encoder itself,
          which understands local formats directly.  For example,
          the local newline convention for text bodies might be
          carried through to the encoder itself along with
          knowledge of what that format is.

    (2)   The output of the encoders may have to pass through one
          or more additional steps prior to being transmitted as
          a message.  As such, the output of the encoder may not
          be conformant with the formats specified by RFC 822.
          In particular, once again it may be appropriate for the
          converter's output to be expressed using local newline
          conventions rather than using the standard RFC 822 CRLF
          delimiters.

   Other implementation variations are conceivable as well.  The vital
   aspect of this discussion is that, in spite of any optimizations,
   collapsings of required steps, or insertion of additional processing,
   the resulting messages must be consistent with those produced by the
   model described here.  For example, a message with the following
   header fields:

     Content-type: text/foo; charset=bar
     Content-Transfer-Encoding: base64

   must be first represented in the text/foo form, then (if necessary)
   represented in the "bar" character set, and finally transformed via
   the base64 algorithm into a mail-safe form.

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

5.  Summary

   This document defines what is meant by MIME Conformance. It also
   details various problems known to exist in the Internet email system
   and how to use MIME to overcome them. Finally, it describes MIME's
   canonical encoding model.

6.  Security Considerations

   Security issues are discussed in the second document in this set, RFC
   2046.