Network Working Group E. Levinson Internet Draft: MIME/SGML Accurate Information Systems, Inc October 19, 1993 Expire: April 1993 MIME Content-types for SGML Documents This draft document is being circulated for comment. Please send your comments to the authors or to the ietf-822 maillist . If consensus is reached this document may be submitted to the RFC editor as a Proposed Standard protocol specification for use with MIME. Status of this Memo This document is an Internet Draft; Internet Drafts are working documents of the Internet Engineering Task Force (IETF) its Areas, and Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. They may be updated, replaced, or obseoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a Rworking draftS or Rwork in progressS. Please check the abstract listing in each Internet Draft directory for the current status of this or any other Internet Draft. Abstract This document specifies how a specific compound object, a complete SGML document, is to be carried within a MIME message. MIME provides a flexible mechanism for structuring RFC 822 message bodies. To use that mechanism for compound documents requires additional agreements on how the compound document is represented and labelled within the message body. In addition, this document specifies the requirements for using MIME to carry SGML documents within a data stream in conformance with the SGML Document Interchange Format (SDIF). That format provides a mechanism for transferring one or more SGML documents. Subtypes are proposed for the Multipart and Application content types to support SGML documents and SDIF within MIME. Compound documents, including SGML, consist of a number of files, some of which may contain references to other files. Explicit indications of the bindings between the sender's file names and the MIME body parts are needed to re-bind the sender's file names to ones on the recipient's system. A content reference header field makes the bindings explicit. 1 Introduction Many MIME [RFC-MIME] based mail User Agents can be readily configured to display (and compose) standard message body content types. These user agents invoke applications that correspond to the particular content type. Standard content types exist for data that consists of a single body part and there are mechanisms to convey multiple body parts. However there is no standard mechanism for objects like compound documents that contain multiple, inter-related body parts. Compound documents are represented in various mark-up languages, e.g. troff, text/enriched. This document provides a mechanism for embedding, inside an Internet mail [RFC-822, RFC-MIME] message, complete documents of one such markup language, the Standard Generalized Markup Language (SGML) [ISO-SGML]. 1.1 SGML SGML is used in several communities to encode document structure and layout. A rigorous description of SGML is left to [ISO-SGML]; Appendix A contains an unbelievably brief description of the SGML elements relevant to this document. In this document attempts to be consistent with SGML terminology and usage. A complete SGML document consists of an SGML declaration, a document type definition (DTD), and a document instance. The document instance may, recursively, contain subdocuments which consist of DTDs and an instance. The applications that process SGML documents may require the document parts to be individual files or combined in a single file. For a person or application (a recipient) to receive and display a complete SGML document a precise definition for each of the SGML document parts must be carried within the mail message. In the sender's environment these parts may be references to standard parts or specific files in the sender's file system. Further, a DTD may reference other files, for example, images and graphics. The identity of the document parts as well as the name and contents of each file must be transferred. Sufficient information must be accompany the data for the recipient to transform the sender's file name into an equivalent local reference. 1.2 SDIF The SGML Document Interchange Format, SDIF, [ISO-SDIF] specifies the structure for a data stream whcih contains one or more SGML documents. SDIF is focused on transferring documents between sites and does not include a requirement that the documents be displayed as they are encountered. Users of mail based systems, however, expect to have each mail item in a multipart message displayed -- more precisely, ready for display -- when encouneterd. This document shows how to meet both the SDIF and display requirements. 1.3 Organization of this Memorandum First a body part content type for a simple SGML document is defined. The discussion of that content type expains the SGML specific parameters and explores a number of the issues that arise when transferring an SGML document from one system to another. More complex documents are handled via a Multipart subtype. Discussion of that subtype explores additional document transfer issues. The discussion concludes by presenting the content types required to create an SDIF conformant data stream. 2 Processing Model for MIME/SGML Four issues must be addressed for the recipientUs user agent to display a complete SGML document: the various parts must be specified and file and command references on the sender's systems must be resolved to references on the receiverUs system. Finally, an appropriate application, an unpacker, must be in control to unpack of the MIME body parts containing the document and present them to the display software. The controlling application is discussed first and then the document parts, file references, and command references. 2.1 Invoking the SGML Parser Application MIME offers the possibility to add SGML capability to existing mail user agents. To accomplish this with existing SGML viewers and composers a process must be interposed between the SGML application and the user agent to translate from MIME format to a form acceptable to the local SGML application. This document uses the notation in [ISO-SDIF] where the process creating the data stream (here, the MIME message) is called the packer; the correponding one for the receiver, the unpacker. Normally one expects a MIME capable user-agent to display each body part in turn, usually in a depth first manner. For a compound document the display must be deferred until all the body parts are available to the application and are structured according to the requirements of the SGML application. For example, some SGML viewers expect the DTD and instance be combined in a single file, others expect them to be separate. "Available" means that the files corresponding to the various document parts have been instantiated on the receiver's system. Once instantiated, an SGML viewer can be invoked. Similarly, an SGML composer will create the respective parts using its own file structure. For example, an SGML application may expect the DTD and document instance to be in a single file. The unpacker separates these parts and encapsulates as separate MIME body parts. 2.2 Specifying the Document Parts Different implementations of SGML parsers use different methods for storing the SGML declaration, DTD, and document instance. Consequently the unpacker may find these parts as separate body parts or as a single part and must store them as the local application requires. There are several ways to specify each part of a complete SGML document. The declaration part may be a default value and not included, an file which is included, or it may be part of the document instance. It could also be a file each correspondent already has. An easy solution would be to require a standard form, perhaps a single file, a concatenation of the declaration, DTD, and instance. That would often require the transferring much more data than needed, often only the document intsance is required. The discussion so far assumes that there is only one file (or its equivalent). While there may be many files, for SGML document instances as opposed to files of image (or other) data, the consideration here is only how to specify the document or sub- document SGML elements. The next section considers other data . Rather than require a standard form, this document permits the SGML document parts to specified as parameters. Thus a sender may choose to send the declaration, DTD, and instance as a single file or may choose to specify any of them as a parameter. If neither the SGML declaration nor document type declaration is specified it may be in the message body; if it is not there then the recipient is free to apply a local default. These parameters are provided for each document or subdocument instance. sgml-parm := *( ";" sgml-part "=" sgml-part-spec) [ ";" "version" "=" iso-sgml-spec ] [ ";" "created-with" "=" ref-or-tok ] [ ";" "character-set" "=" charset ] sgml-part := "instance" / "declaration" / "dtd" / "fosi" / extension-token sgml-part-spec := file-token / sgml-public / extension-token sgml-public := iso-sgml-spec := Sgml-parts specify the various parts of a complete document. File- tokens are discussed in the next section. If used that file's contents will be contained in a body part and will be labelled with a content-reference: field. Sgml-public are identifiers defined in [ISO SGML] which represent well known files or entities. The SGML parser is expected to resolve these references on its own. Although the SGML definition provides for associating location (local file system) information with public data this document does not supported it. It is possible to provide support for that capability in the unpacker. The two parameters, version and created-with, are provided for guidance to user agents. Version specifies the particular SGML standard to which the document conforms. A user agent can use this value to invoke the application appropriate to that version of the standard. The created-with parameter provides guidance in cases where inter-operability with respect to SGML may be a problem. In those environments, where user's maintain several of SGML processors, this parameter can be used to invoke the appropriate implementation. The character-set parameter specifies the body part character set. If not specified, the default is us-ascii. 2.3 Resolving File References SGML permits the DTD to define document parts (entities) that a document instance can reference for inclusion or interpolation. The entities point to files that can contain SGML coded text, text not to be interpreted, images, or other data. Within SGML there are two types of file reference entitites SYSTEM and PUBLIC. PUBLIC entities specify SGML document parts that are known to and resolvable by SGML viewers and editors. The SYSTEM identifiers refer to files in the local environment. In order for the recipient's SGML application to properly process the document, the file references must be resolvable in the recipient's environment. Conceptually, one must replace each of the sender's file references with a corresponding reference in the recipient's file system. There are two issues here. First, the sending user agent must parse the document and identity the sender's file references. Second, the internally referenced file will become a MIME body part and the correspondence between the file name and the body part must be preserved. This document applies the principle of "sender makes right" to these issues and requires first, that the packer converts all file references into a unique token containing only US-ASCII characters. Second, those files will be a body part in a multipart MIME message and the corresponding body part header must contain a Content-Reference: field whose value contains the file's token. Thus, the internal file name, now a token which can appear in an 822 header, explicitly appears in the document and its corresponding MIME body part using the Content-Reference: field. When the unpacker stores the body part in the recipient's file system it can convert the internal file references (tokens) into valid local references. 2.4 Processors for Non-SGML Data Non-SGML data requires the SGML parser to invoke a processor to format the data. The correspondence between the file name and the application is contained in the type field of the SGML entity declaration and the SGML notation declaration for that type. The notation declaration contains the operating system command string to invoke or launch the processor. That is, the string in the notation declaration is an arbitrary command. There are two problems with this situation, the command may only be valid in the sender's environment and, if it is valid in the recipient's , invoking that command is a security hazard. Therefore, this document requires that any type used in an SGML notation be an valid MIME content type (or an extension token) and that the unpacker substitute a local string for the string in the notation declaration. 3 The SGML Subtypes A complete document may be a single instance in which all the other document parts are defined by existing standards or private agreements. It may also be a set of parts several of which must be included in the MIME message. Two SGML subtypes are defined, content types application and multipart. Both body part content types use the same parameters. The multipart subtype is considered first, it is the general case. The application subtype is a simplification for the case where the multipart would contain a single part. It is also used to contain SGML subdocument entities, that is text with mark-up. 3.1 The Multipart/SGML Subtype An SGML document carried in a MIME message as a Multipart body of subtype SGML (Content-Type: Multipart/SGML). The content- type parameters specify each of the parts of the SGML document. Additional parameters specify the software that created the document and the applicable SGML standard. In a complex document some of the SGML document parts are references to standard parts and the others as filenames. In the latter case the filename tokens must appear in exactly one Content- Reference: header in an enclosed body part. Inside the document itself, the file names must be replaced by their tokens. Thus a complete SGML document can appear as the following MIME message. Content-Type: Multipart/SGML; instance=SSBradio; dtd=sgml-dtd-mtce-radio; boundary=tiger-lily --tiger-lily Content-type: Application/SGML Content-reference: SSBradio --tiger-lily Content-type: Image/gif Content-reference: sgml-radio-figure-1 ... --tiger-lily Content-type: Application/SGML Content-reference: sgml-dtd-mtce-radio --tiger-lily-- 3.2 The Application/SGML Subtype When transferring a file containing text and mark-up within a Multipart/SGML message or when a complete SGML document can be contained in a single message the content-type: Application/sgml can be used. application-subtype := ("octet-stream" *stream) / "postscript" / ("sgml" *sgml-parm) / extension-token The following example shows a MIME message an document instance which specifies a dtd. Content-Type: application/SGML; dtd="//USA-DOD//DTD MIL-M-21742 911991//EN" 3.3 Character Set Considerations It is expected that SGML documents will use the ASCII character set. For documents not in the US-ASCII character set, the charset= parameter of the Content-Type: field specifies the actual character set. Note that the values of the charset parameter must be registered with IANA, or be a mutually agreed upon extension- token (i.e., charset=X-set). Values contained in the MIME headers must use be drawn from [US- ASCII] and conform to[RFC-822]. Where the sender's file names do not meet this requirement the conventions specified in [RFC-HDRC] may be used. 4 The Content-Reference Header Field The Content-Reference: header field provides the linkage between file references within the SGML document and the MIME body parts. It contains the unique file name token which represents the sender's file name to which the body part corresponds. The process that handles the Multipart/Compound-SGML body part will use this value to convert internal file references into valid references in the receiver's file system. The syntax is: reference := "Content-Reference" ":" (token / quoted-string) 5 SDIF [ISO-SDIF] Data Streams [This part need work -- Ed] SDIF is an interchange format standard for SGML documents [ISO- SDIF]. It defines a data stream that may contain several SGML documents. This section defines a Multipart subtype RSDIFS for an SDIF data stream that contains one or more Multipart/SGML documents. Messages that conform to the SDIF subtype will conform to [ISO-SDIF]. Briefly an SDIF data stream is a sequence of SGML documents and their subdocument and external entities (c.f. Appendix A). These external entities are defined in the DTD and are referred to via their SGML name in the document intstance. The scope of an enitiy name is the document or subdocument in which it is defined. Thus names are not unique across documents and subdocuments. To provide unique names within the SDIF data stream, each entity is assigned a sequential number. Each SGML document or subdocument structure in the SDIF stream lists the number of the first entity it contains. An SDIF data stream is encoded within a MIME message as a Multipart/SDIF body part. It contains one to three body parts. The first and last body parts are optional. They are labelled with a content description field whose value is "related-documents-A" and "related-documents-B" respectively and are Multipart/Mixed. These multipart bodies contain only Multipart/SGML or Application/SGML (mime/sgml is for convenience where the particular content type does not matter) body parts. The second body part, of the three Multipart/SDIF body parts, is a mime/sgml body part. The Multipart/SDIF content type has a character set parameter which specifies the character set used for SGML markup tokens through-out the data stream. There are five SDIF entity types: subdocument These can contain references to external entities as well as marked up text. text An external entity containing only marked up text. data An external entity containing non-SGML data, images, for example. public-text Corresponds to a PUBLIC external reference and contains a NULL message body. [A reminder to readers without intimate knowledge of SGML, PUBLIC text can be located by SGML processors without further identification.] cross-reference Corresponds to a previously included external entity. This avoids duplicating material previously included. It contains a NULL message body. This docment requires, in contrast to [ISO-SDIF], that the referenced body part have already appeared. That requirement enables the user agent to display the SGML documents as they are encountered. The subdoucment and text SDIF entities become Application/SGML body parts and data entities are encapsulated as the appropriate MIME content type. The last two entities have null message bodies and are handled as parameters, public and cross-reference, of an Application/SDIF content type. The syntax is: application-subtype := / RsdifS sdif-param sdif-param := ";" "public" "=" / ";" "cross-reference" "=" <-- the enclosing Multipart/SDIF body part is take as the root (level 1) for numbering body parts --> SDIF requires the entity name to accompany each entity in the data stream. When MIME is used to transfer SDIF data streams the entity name will be the value of the content description field in each body part. Since SDIF does not distinguish the parts of a document entity (declartion, dtd, and instance) when SGML documents are contained in a Multipart/SDIF message the document is sent as a single body part. The application can apply default values for unspecified declarations and DTDs. Finally, SDIF uses sequential numbers to uniquely identify each entity, an entity-identifier in [ISO-SDIF] and to locate the position of the first external entity, a first-identifier, of each document. These are not necessary when using the methods in this document but can be derived. Within a Multipart/SDIF message number each body part sequentially, starting at 1 with the first Application/SGML body part. Note that the only Multipart body part that can be present in a Multipart/SDIF message is Mulitpart/Alternative. That will resolve into a single body part and shall be treated as though it were a non-multipart body part. The subdocument, text and data entities may, in fact, be Message/External body parts. With the numbering described the unpacker may, if needed, build a table to translate body parts into SDIF entity numbers. 6 Security An SGML parser can be directed to invoke a local process, usually to format or display a grpahical image. That capability presents an opportunity for abuse. To understand the potential problems requires understanding two SGML consturcts, entity and notation statements, presented below. Capitalized items are literals, lowercase ones are tokens, and the special characters are markup escape squences. The document text will refer to name which, in turn, will cause the application, type, represented by qstring to be invoked. Qstring could be the DOS command "delete *.*". To eliminate potential problems it is recommended that the unpacker replace notation contained within the message with the appropriate statements for the recipient's environment. An implementation may use a local configuration file that identifies the acceptable types and inform the user of types in the message that are not available in the local environment. They could be replaced by a no-operation NOTATION statement. It is recommended that the list of acceptable types be drawn from the MIME set of types and subtypes. SGML also provides for sending non-interpreted data to the display device or typesetter. The security hazard presented is similar to those posed by the use of PostScript. Greater threats may be posed by more "powerful" display systems and typesetters. Unautorized access to the recipient's system and resources may be possible. 7 References [ISO-SGML] ISO 8879:1988, Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML). [ISO-SDIF] ISO 9069:1988, Information Processing - SGML Support Facilities -- SGML Document Interchange Format (SDIF). [RFC-822] Crocker, D., Standard for the Format of ARPA Internet Text Messages, August 1982, University of Delaware, RFC 822. [RFC-HDRC] Moore, Keith, Representation of Non-Ascii Text in Internet Message Headers, June, 1992, RFC 1522 [RFC-MIME] Borenstein, N. and Freed, N., MIME (Mulitpurpose Internet Mail Extensions): Mechanisms for Specifying and Describing the Format of Internet Message Bodies, June 1992, RFC 1521. [US-ASCII] Coded Character Set -- 7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986. 8 Acknowledgements The author acknowledges Andy Gelsey, Accurate Information Systems, Inc., Nathaniel Borenstein, Bellcore, Einar Stefferud, Network Management Asscoiates, Inc, John Klensin, MIT, and Erik Naggum, for their suggestions, explanations, and encouragement. No errors or faults in this document can be ascribed to them, they all belong to me. UNIX is a registered trademark of UNIX System Laboratories, Inc. 9 Author's Address Ed Levinson elevinson@accurate.com Accurate Information Systems, Inc. 2 Industrial Way Eatontown, NJ 0772 Appendix A. SGML for IETFers This appendix describes of the elements of the Standard Generalized Markup Language (SGML) that are key to understanding the relationship between SGML and the Multipurpose Internet Mail Extensions (MIME). For the purposes of this discussion, and without doing too much damage to the SGML specification, an SGML document contains text, markup, and references to non-text document elements (e.g., graphics). For a complete and accurate description see ISO 8879, Information Processing - Text and office systems - Standard Generalized Markup Language (SGML). An SGML document has the following structure (the parenthesized numbers refer to productions in ISO 8879) and is processed by an application called an SGML parser. Note that Internet style ABNF is used for notation here, [ISO-SGML] uses a different style. sgml-doc ::= sgml-decl dtd doc-inst (2) sgml-sub-doc ::= dtd doc-inst (3) Sgml-decl defines the various elements and parameters of SGML. For example, the characters that introduce and end markup tags, RS respectively will be used here, the maximum length of markup tags, etc.. Dtd is a document type definition (DTD) which defines the structure of the document, most important for interchange considerations the DTD contains references to external files, system commands, and text to be sent directly to a typesetter or printer. Doc-inst is the actual document text; it includes graphic elements, other text with or without markup, by reference to DTD elements. The remainder of this discussion focuses on two elements which a DTD uses to reference other things, entities and notations. They appear in the DTD and have the following format. entity ::= "" (101) e-text ::= q-string | data | b-text | external (105) data ::= ( "CDATA" | "SDATA" | "PI" ) q-string (106) external ::= ext-id ( "SUBDOC" | ( "NDATA" type )) (108) ext-id ::= ( "SYSTEM" q-string) | ( "PUBLIC" pub-id [q- string] )(73) notation ::= "" (148) where name is a character sting, and the definition of b-text left to ISO 8879; for convenience q-string has been substituted for the SGML term parameter literal. Entities referred to via the SUBDOC keyword differ from SGML documents in that they cannot contain an sgml-decl. Using the above productions the following sample entities demonstrate the important issues. Name, xname, and type are alphanumeric tokens and q-string is a series of characters enclosed in double (or single) quote marks. (A) (B) (C) (D) (E) Form A refers to a well known or "public" name that the SGML parser is able to resolve; in the marked up text there will be a markup item "&name" that directs the parser to include the corresponding public file. Similarly, form B corresponds to a locally known file. Form C allows the markup text to refer to non- SGML data, an image for example, and the type parameter must match the type of a NOTATION element . The matching element's command parameter specifies the command which processes the file fname. Finally form E, processing instructions, specifies a string of characters to be sent directly to the output device. These examples give rise to the following issues when the document is transferred from one environment to another. A Is the public name known to the recipient? The recipient SGML parser may not know of the public file and this will be discovered when it processes the document. B What is the file name on the recipient system? There must be some process which binds the sender's file names to the recipient. C See B and D. D Direct use of the NOTATION form is a large security risk, an invitation to a Trojan Horse attack. The recipient must be protected from a sender invoking an arbitrary command on the recipient system. E Processing instructions permit the sender to manipulate the recipient output device. This is the same risk that exists for PostScript documents and is not addressed. Appendix B. Content-Type registrations _________________________________ B.1 The Application/SGML Content-Type (1) MIME type name: Application (2) MIME subtype name: SGML (3) Required parameters: none (4) Optional parameters: declaration, dtd, instance, fosi, charset (5) Encoding considerations: may be encoded (6) Security considerations: see RFC section 6 (7) Specification: This subtype is used for text marked with the Standard Generalized Markup Language. Body parts of this subtype will contain a Content-Reference: field if this body part is referred to as a file by an SGML document or subdocument entity or if it is explictily referred to in a Multipart/SGML parameter. _________________________________ B.2. The Application/SDIF Content-Type (1) Mime type name: Application (2) MIME subtype name: SDIF (3) Required parameters: one of public or cross-reference (4) Optional parameters: none (5) Encoding considerations: none (6) Security considerations: (7) Specification: This subtype contains a NULL or empty message body. The value of the public parameter is an SGML PUBLIC entity identifier. The value of cross-reference is the body part identifier of a previously occurring body part. _________________________________ B.3. The Multipart/SGML Content-Type (1) Mime type name: Multipart (2) MIME subtype name: SGML (3) Required parameters: boundary (4) Optional parameters: declaration, dtd, fosi, instance (5) Encoding considerations: none (6) Security considerations: see RFC section 6 (7) Specification: _________________________________ B.4. The Multipart/SDIF Content-Type (1) Mime type name: Multipart (2) MIME subtype name: SDIF (3) Required parameters: boundary (4) Optional parameters: charset (5) Encoding considerations: none (6) Security considerations: none