Internet Engineering Task Force          Audio-Video Transport Working Group
INTERNET-DRAFT                                                H. Schulzrinne
                                                      AT&T Bell Laboratories
                                                           December 15, 1992
                                                            Expires:  5/1/93


                              Media Encodings


Status of this Memo


This document is an Internet Draft.   Internet Drafts are working  documents
of the Internet Engineering  Task Force (IETF), its  Areas, and its  Working
Groups.   Note that other  groups may also  distribute working documents  as
Internet Drafts).

Internet Drafts  are draft  documents valid  for a  maximum of  six  months.
Internet Drafts may be  updated, replaced, or  obsoleted by other  documents
at any time.   It  is not appropriate  to use Internet  Drafts as  reference
material or  to cite  them  other than  as a  "working  draft" or  "work  in
progress."

Please check  the I-D  abstract  listing contained  in each  Internet  Draft
directory to learn the current status of this or any other Internet Draft.

Distribution of this document is unlimited.


                                  Abstract


     This  document describes a possible structure of the media content
    for audio  and video  for Internet applications.    The definitions
    are independent  of the particular  transport mechanism used.   The
    descriptions  provide  pointers  to reference  implementations  and
    the  detailed  standards.     This  document is  meant  as  an  aid
    for implementors  of audio,  video  and other  real-time multimedia
    applications.


INTERNET-DRAFT                    Media                    December 15, 1992

1 Audio


1.1 Encoding-independent recommendations


The  following  recommendations  are  default  operating  parameters.     An
applications should  be  prepared  to  handle other  values.     The  ranges
given  are  meant  to  give  guidance   to  application  writers,   allowing
a set  of  applications  conforming  to  these  guidelines  to  interoperate
without additional  negotiation.    These  guidelines are  not  intended  to
restrict operating parameters for  application that can  negotiate a set  of
interoperable parameters, e.g., through a conference control protocol.

For packetized  audio,  the default  packetization  interval should  have  a
duration of 20  ms, unless  otherwise noted  in Table  1.   For  frame-based
encodings (marked as F in the table 1 below) such as LPC, CELP and  GSM, the
sender may choose to combine several  frame intervals into a single  message
to reduce header  overhead.   The number of  frames is single  packetization
interval, however, a sender may choose  to combine several intervals into  a
single message.  The receiver can  tell the number of frames contained in  a
message since the nominal frame duration is defined as part of the encoding.

If multiple channels are used, the left channel information always  precedes
the right-channel information.  For  more than two channels, the  convention
followed by the  AIFF-C audio  interchange format should  be followed.    It
is listed in the  table below.   (The AIFF-C  specification is available  by
anonymous ftp at sgi.sgi.com in the file sgi/aiff-c.9.26.91.ps.)


type_______channels________________________________________________________________

stereo     left        right
3 channel  left        right        center
quad       front left  front right  rear left  rear right
4 channel  left        center       right      surround
6 channel  left        left center  center     right       right center  surround


The sampling frequency should be drawn from the set:  8, 11.025, 16,  22.05,
44.1 and 48 kHz.  Preferred rates are 8, 16 and 44.1 kHz.


1.2 Recommended Audio Encodings


The table 1 shows the names, types (sample vs.  frame oriented) and  default
sampling frequencies  of  recommended encodings.    The  list  is  partially
drawn from  the  document  ``Recommended  practices  for  enhancing  digital
audio compatibility in  multimedia systems'', published  by the  Interactive
Multimedia Assocation,  Version 3.00,  Oct.    1992 (referenced  as  [IMA]).

H. Schulzrinne                    Expires 5/1/93                    [Page 2]


INTERNET-DRAFT                    Media                    December 15, 1992

The names are  for identification only;  they correspond  to the names  used
within the Real-Time Transport Protocol (RTP). Other applications may choose
different namings.


  name  nom.  sampling   rate  type  frame  description
 __________________kHz___kb/s__S/F___ms___________________________________
  L16             44.1  705.6  S            16-bit linear, 2's complement
  G722              16     64  S            CCITT subband ADPCM
  PCMU               8     64  S            CCITT mu-law PCM
  PCMA               8     64  S            CCITT A-law PCM
  G721               8     32  S            CCITT ADPCM
  DVI                8     32  S            Intel/DVI ADPCM [IMA]
  G723               8     24  S            CCITT ADPCM
  GSM                8     13  B     20     RTE/LTP GSM 06.10)
 _1016_______________8____4.8__B_____30_____CELP__________________________


                         Table 1:  Audio encodings

For multi-octet  encodings, octets  are transmitted  in network  byte  order
(i.e., most significant octet first).

A detailed description of the encodings is given below:


L16 denotes  uncompressed audio  data,  using 16-bit  signed  representation
    with 65535  equally divided  steps  between minimum  and maximum  signal
    level, ranging from -32768 to 32767.  The value is represented in two's
    complement notation.

PCMU is  specified in CCITT  recommendation G.711.   Audio  data is  encoded
    as eight  bits per sample, after  companding.   Code to convert  between
    linear and mu-law companded data is available in the IMA document.

PCMA is  specified in CCITT  recommendation G.711.   Audio  data is  encoded
    as eight  bits per sample, after  companding.   Code to convert  between
    linear and A-law companded data is available in the IMA document.

G721 through G729 are specified in the corresponding CCITT  recommendations.
    Reference implementations for G.721  and G.723 are available as part  of
    the CCITT Software Tool Library (STL) from the ITU  General Secretariat,
    Sales Service, Place  du Nations, CH-1211 Geneve  20, Switzerland.   The
    library is covered  by a license and is  available for anonymous ftp  on
    gaia.cs.umass.edu, file pub/ccitt/ccitt_tools.tar.Z.

GSM (group  speciale mobile)  denotes  the  European GSM  06.10  provisional
    standard  for full-rate  speech  transcoding,  prI-ETS  300  036,  which
    is based  on RPE/LTP  (residual pulse  excitation/long term  prediction)
    coding at a rate of 13 kb/s.  A reference implementation was  written by
    Carsten Borman and Jutta  Degener (TU Berlin, Germany) and is  available

H. Schulzrinne                    Expires 5/1/93                    [Page 3]


INTERNET-DRAFT                    Media                    December 15, 1992

    for anonymous ftp from tub.cs.tu-berlin.de, directory tub/tubmik.

1016 uses code-excited linear prediction (CELP) and is specified  in Federal
    Standard  FED-STD  1016,  published by  the  Office  of  Technology  and
    Standards, Washington, DC 20305-2010.

    The  U. S.  DoD's  Federal-Standard-1016  based 4800  bps  code  excited
    linear  prediction  voice coder  version  3.2  (CELP  3.2)  Fortran  and
    C  simulation source  codes  are available  for  worldwide  distribution
    at  no charge  (on  DOS diskettes,  but  configured  to compile  on  Sun
    SPARC stations)  from:   Bob Fenichel,  National Communications  System,
    Washington, D.C. 20305, phone +1-703-692-2124, fax +1-703-746-4960.

    Example  input  and processed  speech  files,  a  technical  information
    bulletin, and  the official standard  ``Federal Standard 1016,  Telecom-
    munications:   Analog  to Digital  Conversion of  Radio Voice  by  4,800
    bit/second Code  Excited Linear Prediction (CELP)''  are included at  no
    charge.  According  to Vincent Cate (Carnegie Mellon), the  distribution
    is  also  available  for  anonymous  ftp  at   furmint.nectar.cs.cmu.edu
    (128.2.209.111) in directory celp.audio.compression.

    The  following articles  describes  the  Federal-Standard-1016  4.8-kbps
    CELP coder:

    Campbell, Joseph  P. Jr., Thomas  E. Tremain and  Vanoy C. Welch,  ``The
    Proposed Federal  Standard 1016 4800  bps Voice Coder:   CELP,''  Speech
    Technology Magazine, April/May 1990, p.  58-64.

    Campbell, Joseph  P. Jr., Thomas  E. Tremain and  Vanoy C. Welch,  ``The
    Federal  Standard 1016  4800  bps  CELP Voice  Coder,''  Digital  Signal
    Processing, Academic Press, 1991, Vol.  1, No.  3, p.  145-155.

    Campbell, Joseph  P. Jr., Thomas  E. Tremain and  Vanoy C. Welch,  ``The
    DoD 4.8  kbps Standard (Proposed Federal  Standard 1016),'' in  Advances
    in Speech  Coding,  ed.   Atal,  Cuperman  and Gersho,  Kluwer  Academic
    Publishers, 1991, Chapter 12, p.  121-133.

    Campbell, Joseph  P. Jr., Thomas  E. Tremain and  Vanoy C. Welch,  ``The
    Proposed Federal  Standard 1016 4800  bps Voice Coder:   CELP,''  Speech
    Technology Magazine, April/May 1990, p.  58-64.

    Copies of the FS-1016 document are available for $2.50 each from:


    GSA Rm 6654
    7th & D St SW
    Washington, D.C. 20407
    1-202-708-9205


DVI/ADPCM is  specified   in  the  ``Recommended  Practices  for   Enhancing


H. Schulzrinne                    Expires 5/1/93                    [Page 4]


INTERNET-DRAFT                    Media                    December 15, 1992

    Digital Audio  Compatibility in Multimedia  Systems'', published by  the
    Interactive Multimedia  Association (IMA), Annapolis,  MD. The  document
    also contains reference implementations for mu-law to 16-bit,  ADPCM and
    sample rate conversions.


For sample-based encodings,  a receiver should  accept packets  representing
between 0 and  200 ms of  audio data.(1)   Receivers should  be prepared  to
accept multi-channel audio, but may choose to only play a single channel.


1.3 API for Codecs


The application programming interface described  here is suggested, but  not
required for interoperability.   The API shown  here is compatible with  the
one used by SunOS 4.1.


#define AUDIO_ENCODING_ULAW     (1)    /* ISDN u-law */
#define AUDIO_ENCODING_ALAW     (2)    /* ISDN A-law */
#define AUDIO_ENCODING_LINEAR   (3)    /* PCM 2's-complement (0-center) */

typedef struct {
  unsigned sample_rate;            /* samples per second */
  unsigned samples_per_unit;       /* samples per unit */
  unsigned bytes_per_unit;         /* bytes per sample unit */
  unsigned channels;               /* # of interleaved channels */
  unsigned encoding;               /* data encoding format */
  unsigned data_size;              /* length of data (optional) */
} audio_hdr_t;

void *x_init(void *state, double period);

int x_encode(void *in_buf, int in_size, audio_hdr_t *ah,
   void *out_buf, int *out_size, void *state);
int x_decode(void *in_buf, int in_size, audio_hdr_t *ah,
   void *out_buf, int *out_size, void *state)


x_init initializes a particular instance  of a codec.  If the argument  state
is zero, a memory area  sufficient to hold the  encoder or decoder state  is
allocated; if that argument is non-zero, the existing area is reinitialized.
The function returns a pointer to the area, zero if the state area could not
be allocated.   The argument period  refers to the amount  of audio data  in
each block.  It is typically only used for block-oriented codecs.

------------------------------
 1. This restriction allows reasonable buffer sizing for the receiver.


H. Schulzrinne                    Expires 5/1/93                    [Page 5]


INTERNET-DRAFT                    Media                    December 15, 1992

The generic pointer to state refers to an area of storage whose structure is
opaque to the application program.  In the functions, 'x' is replaced by the
appropriate codec name, appropriately modified to conform to C syntax (e.g.,
g711, g721, etc).

The encoder and  decoder transform the  data contained in  the input  buffer
in_buf (in_size bytes) and  deposit the  result into the  output buffer  area
out_buf.    The variable  out_size  is set  to the  number of  bytes  actually
contained in the output buffer.   The ah arguments points to a structure  of
type audio_hdr_t,  which defines the given  input data format for the  encoder
and the desired output data format for the decoder.  The functions return  0
on success, a negative number if a failure occurred.

All block-oriented audio codecs should be able to encode and decode  several
consecutive blocks.


2 Video


For further study.   (To contain definitions  and pointers of H.261,  etc.).
Items to include:  H.261 frame format, request for retransmission.


3 Address of Author


Henning Schulzrinne
AT&T Bell Laboratories
MH 2A244
600 Mountain Avenue
Murray Hill, NJ 07974
telephone:  908 582-2262
electronic mail:  hgs@research.att.com


H. Schulzrinne                    Expires 5/1/93                    [Page 6]