Move a few obsolete RFCs to the Attic

This commit is contained in:
Kurt Zeilenga 2000-02-07 05:48:17 +00:00
parent bc51bd5180
commit eeefab745c
3 changed files with 0 additions and 1129 deletions

View file

@ -1,619 +0,0 @@
Network Working Group T. Howes
Request for Comments: 1488 University of Michigan
S. Kille
ISODE Consortium
W. Yeong
Performance Systems International
C. Robbins
NeXor Ltd.
July 1993
The X.500 String Representation of Standard Attribute Syntaxes
Status of this Memo
This RFC specifies an IAB standards track protocol for the Internet
community, and requests discussion and suggestions for improvements.
Please refer to the current edition of the "IAB Official Protocol
Standards" for the standardization state and status of this protocol.
Distribution of this memo is unlimited.
Abstract
The Lightweight Directory Access Protocol (LDAP) [9] requires that
the contents of AttributeValue fields in protocol elements be octet
strings. This document defines the requirements that must be
satisfied by encoding rules used to render Directory attribute
syntaxes into a form suitable for use in the LDAP, then goes on to
define the encoding rules for the standard set of attribute syntaxes
defined in [1,2] and [3].
1. Attribute Syntax Encoding Requirements
This section defines general requirements for lightweight directory
protocol attribute syntax encodings. All documents defining attribute
syntax encodings for use by the lightweight directory protocols are
expected to conform to these requirements.
The encoding rules defined for a given attribute syntax must produce
octet strings. To the greatest extent possible, encoded octet
strings should be usable in their native encoded form for display
purposes. In particular, encoding rules for attribute syntaxes
defining non-binary values should produce strings that can be
displayed with little or no translation by clients implementing the
lightweight directory protocols.
Howes, Kille, Yeong & Robbins [Page 1]
RFC 1488 X.500 Syntax Encoding July 1993
2. Standard Attribute Syntax Encodings
For the purposes of defining the encoding rules for the standard
attribute syntaxes, the following auxiliary BNF definitions will be
used:
<a> ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' |
'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' |
's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | 'A' |
'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' |
'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
<d> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<hex-digit> ::= <d> | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' |
'A' | 'B' | 'C' | 'D' | 'E' | 'F'
<k> ::= <a> | <d> | '-'
<p> ::= <a> | <d> | ''' | '(' | ')' | '+' | ',' | '-' | '.' |
'/' | ':' | '?' | ' '
<CRLF> ::= The ASCII newline character with hexadecimal value 0x0A
<letterstring> ::= <a> | <a> <letterstring>
<numericstring> ::= <d> | <d> <numericstring>
<keystring> ::= <a> | <a> <anhstring>
<anhstring> ::= <k> | <k> <anhstring>
<printablestring> ::= <p> | <p> <printablestring>
<space> ::= ' ' | ' ' <space>
2.1. Undefined
Values of type Undefined are encoded as if they were values of type
Octet String.
2.2. Case Ignore String
A string of type caseIgnoreStringSyntax is encoded as the string
value itself.
Howes, Kille, Yeong & Robbins [Page 2]
RFC 1488 X.500 Syntax Encoding July 1993
2.3. Case Exact String
The encoding of a string of type caseExactStringSyntax is the string
value itself.
2.4. Printable String
The encoding of a string of type printableStringSyntax is the string
value itself.
2.5. Numeric String
The encoding of a string of type numericStringSyntax is the string
value itself.
2.6. Octet String
The encoding of a string of type octetStringSyntax is the string
value itself.
2.7. Case Ignore IA5 String
The encoding of a string of type caseIgnoreIA5String is the string
value itself.
2.8. IA5 String
The encoding of a string of type iA5StringSyntax is the string value
itself.
2.9. T61 String
The encoding of a string of type t61StringSyntax is the string value
itself.
2.10. Case Ignore List
Values of type caseIgnoreListSyntax are encoded according to the
following BNF:
<caseignorelist> ::= <caseignorestring> |
<caseignorestring> '$' <caseignorelist>
<caseignorestring> ::= a string encoded according to the rules
for Case Ignore String as above.
Howes, Kille, Yeong & Robbins [Page 3]
RFC 1488 X.500 Syntax Encoding July 1993
2.11. Case Exact List
Values of type caseExactListSyntax are encoded according to the
following BNF:
<caseexactlist> ::= <caseexactstring> |
<caseexactstring> '$' <caseexactlist>
<caseexactstring> ::= a string encoded according to the rules for
Case Exact String as above.
2.12. Distinguished Name
Values of type distinguishedNameSyntax are encoded to have the
representation defined in [5].
2.13. Boolean
Values of type booleanSyntax are encoded according to the following
BNF:
<boolean> ::= "TRUE" | "FALSE"
Boolean values have an encoding of "TRUE" if they are logically true,
and have an encoding of "FALSE" otherwise.
2.14. Integer
Values of type integerSyntax are encoded as the decimal
representation of their values, with each decimal digit represented
by the its character equivalent. So the digit 1 is represented by the
character
2.15. Object Identifier
Values of type objectIdentifierSyntax are encoded according to the
following BNF:
<oid> ::= <descr> | <descr> '.' <numericoid> | <numericoid>
<descr> ::= <keystring>
<numericoid> ::= <numericstring> | <numericstring> '.' <numericoid>
In the above BNF, <descr> is the syntactic representation of an
object descriptor. When encoding values of type
objectIdentifierSyntax, the first encoding option should be used in
preference to the second, which should be used in preference to the
Howes, Kille, Yeong & Robbins [Page 4]
RFC 1488 X.500 Syntax Encoding July 1993
third wherever possible. That is, in encoding object identifiers,
object descriptors (where assigned and known by the implementation)
should be used in preference to numeric oids to the greatest extent
possible. For example, in encoding the object identifier representing
an organizationName, the descriptor "organizationName" is preferable
to "ds.4.10", which is in turn preferable to the string "2.5.4.10".
2.16. Telephone Number
Values of type telephoneNumberSyntax are encoded as if they were
Printable String types.
2.17. Telex Number
Values of type telexNumberSyntax are encoded according to the
following BNF:
<telex-number> ::= <actual-number> '$' <country> '$' <answerback>
<actual-number> ::= <printablestring>
<country> ::= <printablestring>
<answerback> ::= <printablestring>
In the above, <actual-number> is the syntactic representation of the
number portion of the TELEX number being encoded, <country> is the
TELEX country code, and <answerback> is the answerback code of a
TELEX terminal.
2.18. Teletex Terminal Identifier
Values of type teletexTerminalIdentifier are encoded according to the
following BNF:
<teletex-id> ::= <printablestring> 0*( '$' <printablestring>)
In the above, the first <printablestring> is the encoding of the
first portion of the teletex terminal identifier to be encoded, and
the subsequent 0 or more <printablestrings> are subsequent portions
of the teletex terminal identifier.
2.19. Facsimile Telephone Number
Values of type FacsimileTelephoneNumber are encoded according to the
following BNF:
<fax-number> ::= <printablestring> [ '$' <faxparameters> ]
Howes, Kille, Yeong & Robbins [Page 5]
RFC 1488 X.500 Syntax Encoding July 1993
<faxparameters> ::= <faxparm> | <faxparm> '$' <faxparameters>
<faxparm> ::= 'twoDimensional' | 'fineResolution' | 'unlimitedLength' |
'b4Length' | 'a3Width' | 'b4Width' | 'uncompressed'
In the above, the first <printablestring> is the actual fax number,
and the <faxparm> tokens represent fax parameters.
2.20. Presentation Address
Values of type PresentationAddress are encoded to have the
representation described in [6].
2.21. UTC Time
Values of type uTCTimeSyntax are encoded as if they were Printable
Strings with the strings containing a UTCTime value.
2.22. Guide (search guide)
Values of type Guide, such as values of the searchGuide attribute,
are encoded according to the following BNF:
<guide-value> ::= [ <object-class> '#' ] <criteria>
<object-class> ::= an encoded value of type objectIdentifierSyntax
<criteria> ::= <criteria-item> | <criteria-set> | '!' <criteria>
<criteria-set> ::= [ '(' ] <criteria> '&' <criteria-set> [ ')' ] |
[ '(' ] <criteria> '|' <criteria-set> [ ')' ]
<criteria-item> ::= [ '(' ] <attributetype> '$' <match-type> [ ')' ]
<match-type> ::= "EQ" | "SUBSTR" | "GE" | "LE" | "APPROX"
2.23. Postal Address
Values of type PostalAddress are encoded according to the following BNF:
<postal-address> ::= <t61string> | <t61string> '$' <postal-address>
In the above, each <t61string> component of a postal address value is
encoded as a value of type t61StringSyntax.
Howes, Kille, Yeong & Robbins [Page 6]
RFC 1488 X.500 Syntax Encoding July 1993
2.24. User Password
Values of type userPasswordSyntax are encoded as if they were of type
octetStringSyntax.
2.25. User Certificate
Values of type userCertificate are encoded according to the following
BNF:
<certificate> ::= <signature> '#' <issuer> '#' <validity> '#' <subject>
'#' <public-key-info>
<signature> ::= <algorithm-id>
<issuer> ::= an encoded Distinguished Name
<validity> ::= <not-before-time> '#' <not-after-time>
<not-before-time> ::= <utc-time>
<not-after-time> ::= <utc-time>
<algorithm-parameters> ::= <null> | <integervalue> |
'{ASN}' <hex-string>
<subject> ::= an encoded Distinguished Name
<public-key-info> ::= <algorithm-id> '#' <encrypted-value>
<encrypted-value> ::= <hex-string> | <hex-string> '-' <d>
<algorithm-id> ::= <oid> '#' <algorithm-parameters>
<utc-time> ::= an encoded UTCTime value
<hex-string> ::= <hex-digit> | <hex-digit> <hex-string>
2.26. CA Certificate
Values of type cACertificate are encoded as if the values were of
type userCertificate.
2.27. Authority Revocation List
Values of type authorityRevocationList are encoded according to the
following BNF:
Howes, Kille, Yeong & Robbins [Page 7]
RFC 1488 X.500 Syntax Encoding July 1993
<certificate-list> ::= <signature> '#' <issuer> '#'
<utc-time> [ '#' <revoked-certificates> ]
<revoked-certificates> ::= <algorithm> '#' <encrypted-value>
[ '#' 0*(<revoked-certificate>) '#']
<revoked-certificates> ::= <subject> '#' <algorithm> '#'
<serial> '#' <utc-time>
The syntactic components <algorithm>, <issuer>, <encrypted-value>,
<utc-time>, <subject> and <serial> have the same definitions as in
the BNF for the userCertificate attribute syntax.
2.28. Certificate Revocation List
Values of type certificateRevocationList are encoded as if the values
were of type authorityRevocationList.
2.29. Cross Certificate Pair
Values of type crossCertificatePair are encoded according to the
following BNF:
<certificate-pair> ::= <certificate> '|' <certificate>
The syntactic component <certificate> has the same definition as in
the BNF for the userCertificate attribute syntax.
2.30. Delivery Method
Values of type deliveryMethod are encoded according to the following
BNF:
<delivery-value> ::= <pdm> | <pdm> '$' <delivery-value>
<pdm> ::= 'any' | 'mhs' | 'physical' | 'telex' | 'teletex' |
'g3fax' | 'g4fax' | 'ia5' | 'videotex' | 'telephone'
2.31. Other Mailbox
Values of the type otherMailboxSyntax are encoded according to the
following BNF:
<otherMailbox> ::= <mailbox-type> '$' <mailbox>
<mailbox-type> ::= an encoded Printable String
<mailbox> ::= an encoded IA5 String
Howes, Kille, Yeong & Robbins [Page 8]
RFC 1488 X.500 Syntax Encoding July 1993
In the above, <mailbox-type> represents the type of mail system in
which the mailbox resides, for example "Internet" or "MCIMail"; and
<mailbox> is the actual mailbox in the mail system defined by
<mailbox-type>.
2.32. Mail Preference
Values of type mailPreferenceOption are encoded according to the
following BNF:
<mail-preference> ::= "NO-LISTS" | "ANY-LIST" | "PROFESSIONAL-LISTS"
2.33. MHS OR Address
Values of type MHS OR Address are encoded as strings, according to
the format defined in [10].
2.34. Photo
Values of type Photo are encoded as if they were octet strings
containing JPEG images in the JPEG File Interchange Format (JFIF), as
described in [8].
2.35. Fax
Values of type Fax are encoded as if they were octet strings
containing Group 3 Fax images as defined in [7].
3. Acknowledgements
Many of the attribute syntax encodings defined in this document are
adapted from those used in the QUIPU X.500 implementation. The
contribu- tions of the authors of the QUIPU implementation in the
specification of the QUIPU syntaxes [4] are gratefully acknowledged.
4. Bibliography
[1] The Directory: Selected Attribute Syntaxes. CCITT,
Recommendation X.520.
[2] Information Processing Systems -- Open Systems Interconnection --
The Directory: Selected Attribute Syntaxes.
[3] Barker, P., and S. Kille, "The COSINE and Internet X.500 Schema",
RFC 1274, University College London, November 1991.
[4] The ISO Development Environment: User's Manual -- Volume 5:
QUIPU. Colin Robbins, Stephen E. Kille.
Howes, Kille, Yeong & Robbins [Page 9]
RFC 1488 X.500 Syntax Encoding July 1993
[5] Kille, S., "A String Representation of Distinguished Names", RFC
1485, July 1993.
[6] Kille, S., "A String Representation for Presentation Addresses",
RFC 1278, University College London, November 1991.
[7] Terminal Equipment and Protocols for Telematic Services -
Standardization of Group 3 facsimile apparatus for document
transmission. CCITT, Recommendation T.4.
[8] JPEG File Interchange Format (Version 1.02). Eric Hamilton, C-
Cube Microsystems, Milpitas, CA, September 1, 1992.
[9] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access
Protocol", RFC 1487, Performance Systems International,
University of Michigan, ISODE Consortium, July 1993.
[10] Kille, S., "Mapping between X.400(1988)/ISO 10021 and RFC 822",
RFC 1327, University College London, May 1992.
5. Security Considerations
Security issues are not discussed in this memo.
Howes, Kille, Yeong & Robbins [Page 10]
RFC 1488 X.500 Syntax Encoding July 1993
6. Authors' Addresses
Tim Howes
University of Michigan
ITD Research Systems
535 W William St.
Ann Arbor, MI 48103-4943
USA
Phone: +1 313 747-4454
EMail: tim@umich.edu
Steve Kille
ISODE Consortium
PO Box 505
London
SW11 1DX
UK
Phone: +44-71-223-4062
EMail: S.Kille@isode.com
Wengyik Yeong
PSI, Inc.
510 Huntmar Park Drive
Herndon, VA 22070
USA
Phone: +1 703-450-8001
EMail: yeongw@psilink.com
Colin Robbins
NeXor Ltd
University Park
Nottingham
NG7 2RD
UK
Howes, Kille, Yeong & Robbins [Page 11]

View file

@ -1,171 +0,0 @@
Network Working Group T. Howes
Request for Comments: 1558 University of Michigan
Category: Informational December 1993
A String Representation of LDAP Search Filters
Status of this Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
Abstract
The Lightweight Directory Access Protocol (LDAP) [1] defines a
network representation of a search filter transmitted to an LDAP
server. Some applications may find it useful to have a common way of
representing these search filters in a human-readable form. This
document defines a human-readable string format for representing LDAP
search filters.
1. LDAP Search Filter Definition
An LDAP search filter is defined in [1] as follows:
Filter ::= CHOICE {
and [0] SET OF Filter,
or [1] SET OF Filter,
not [2] Filter,
equalityMatch [3] AttributeValueAssertion,
substrings [4] SubstringFilter,
greaterOrEqual [5] AttributeValueAssertion,
lessOrEqual [6] AttributeValueAssertion,
present [7] AttributeType,
approxMatch [8] AttributeValueAssertion
}
SubstringFilter ::= SEQUENCE {
type AttributeType,
SEQUENCE OF CHOICE {
initial [0] LDAPString,
any [1] LDAPString,
final [2] LDAPString
}
}
Howes [Page 1]
RFC 1558 Representation of LDAP Filters December 1993
AttributeValueAssertion ::= SEQUENCE
attributeType AttributeType,
attributeValue AttributeValue
}
AttributeType ::= LDAPString
AttributeValue ::= OCTET STRING
LDAPString ::= OCTET STRING
where the LDAPString above is limited to the IA5 character set. The
AttributeType is a string representation of the attribute object
identifier in dotted OID format (e.g., "2.5.4.10"), or the shorter
string name of the attribute (e.g., "organizationName", or "o"). The
AttributeValue OCTET STRING has the form defined in [2]. The Filter
is encoded for transmission over a network using the Basic Encoding
Rules defined in [3], with simplifications described in [1].
2. String Search Filter Definition
The string representation of an LDAP search filter is defined by the
following BNF. It uses a prefix format.
<filter> ::= '(' <filtercomp> ')'
<filtercomp> ::= <and> | <or> | <not> | <item>
<and> ::= '&' <filterlist>
<or> ::= '|' <filterlist>
<not> ::= '!' <filter>
<filterlist> ::= <filter> | <filter> <filterlist>
<item> ::= <simple> | <present> | <substring>
<simple> ::= <attr> <filtertype> <value>
<filtertype> ::= <equal> | <approx> | <greater> | <less>
<equal> ::= '='
<approx> ::= '~='
<greater> ::= '>='
<less> ::= '<='
<present> ::= <attr> '=*'
<substring> ::= <attr> '=' <initial> <any> <final>
<initial> ::= NULL | <value>
<any> ::= '*' <starval>
<starval> ::= NULL | <value> '*' <starval>
<final> ::= NULL | <value>
<attr> is a string representing an AttributeType, and has the format
defined in [1]. <value> is a string representing an AttributeValue,
or part of one, and has the form defined in [2]. If a <value> must
contain one of the characters '*' or '(' or ')', these characters
Howes [Page 2]
RFC 1558 Representation of LDAP Filters December 1993
should be escaped by preceding them with the backslash '\' character.
3. Examples
This section gives a few examples of search filters written using
this notation.
(cn=Babs Jensen)
(!(cn=Tim Howes))
(&(objectClass=Person)(|(sn=Jensen)(cn=Babs J*)))
(o=univ*of*mich*)
4. Security Considerations
Security issues are not discussed in this memo.
5. References
[1] Yeong, W., Howes, T., and S. Kille, "Lightweight Directory Access
Protocol", RFC 1487, Performance Systems International,
University of Michigan, ISODE Consortium, July 1993.
[2] Howes, T., Kille, S., Yeong, W., and C. Robbins, "The String
Representation of Standard Attribute Syntaxes", RFC 1488,
University of Michigan, ISODE Consortium, Performance Systems
International, NeXor Ltd., July 1993.
[3] "Specification of Basic Encoding Rules for Abstract Syntax
Notation One (ASN.1)", CCITT Recommendation X.209, 1988.
6. Author's Address
Tim Howes
University of Michigan
ITD Research Systems
535 W William St.
Ann Arbor, MI 48103-4943
USA
Phone: +1 313 747-4454
EMail: tim@umich.edu
Howes [Page 3]

View file

@ -1,339 +0,0 @@
Network Working Group F. Yergeau
Request for Comments: 2044 Alis Technologies
Category: Informational October 1996
UTF-8, a transformation format of Unicode and ISO 10646
Status of this Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
Abstract
The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993 jointly
define a 16 bit character set which encompasses most of the world's
writing systems. 16-bit characters, however, are not compatible with
many current applications and protocols, and this has led to the
development of a few so-called UCS transformation formats (UTF), each
with different characteristics. UTF-8, the object of this memo, has
the characteristic of preserving the full US-ASCII range: US-ASCII
characters are encoded in one octet having the usual US-ASCII value,
and any octet with such a value can only be an US-ASCII character.
This provides compatibility with file systems, parsers and other
software that rely on US-ASCII values but are transparent to other
values.
1. Introduction
The Unicode Standard, version 1.1 [UNICODE], and ISO/IEC 10646-1:1993
[ISO-10646] jointly define a 16 bit character set, UCS-2, which
encompasses most of the world's writing systems. ISO 10646 further
defines a 31-bit character set, UCS-4, with currently no assignments
outside of the region corresponding to UCS-2 (the Basic Multilingual
Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to use
in many current applications and protocols that assume 8 or even 7
bit characters. Even newer systems able to deal with 16 bit
characters cannot process UCS-4 data. This situation has led to the
development of so-called UCS transformation formats (UTF), each with
different characteristics.
UTF-1 has only historical interest, having been removed from ISO
10646. UTF-7 has the quality of encoding the full Unicode repertoire
using only octets with the high-order bit clear (7 bit US-ASCII
values, [US-ASCII]), and is thus deemed a mail-safe encoding
([RFC1642]). UTF-8, the object of this memo, uses all bits of an
octet, but has the quality of preserving the full US-ASCII range:
Yergeau Informational [Page 1]
RFC 2044 UTF-8 October 1996
US-ASCII characters are encoded in one octet having the normal US-
ASCII value, and any octet with such a value can only stand for an
US-ASCII character, and nothing else.
UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
into a pair of UCS-2 values from a reserved range. UTF-16 impacts
UTF-8 in that UCS-2 values from the reserved range must be treated
specially in the UTF-8 transformation.
UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
octets, where the number of octets, and the value of each, depend on
the integer value assigned to the character in ISO 10646. This
transformation format has the following characteristics (all values
are in hexadecimal):
- Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values).
- US-ASCII values do not appear otherwise in a UTF-8 encoded charac-
ter stream. This provides compatibility with file systems or
other software (e.g. the printf() function in C libraries) that
parse based on US-ASCII values but are transparent to other val-
ues.
- Round-trip conversion is easy between UTF-8 and either of UCS-4,
UCS-2 or Unicode.
- The first octet of a multi-octet sequence indicates the number of
octets in the sequence.
- Character boundaries are easily found from anywhere in an octet
stream.
- The lexicographic sorting order of UCS-4 strings is preserved. Of
course this is of limited interest since the sort order is not
culturally valid in either case.
- The octet values FE and FF never appear.
UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify a File
System Safe UCS Transformation Format [FSS-UTF] that is compatible
with UNIX systems, supporting multilingual text in a single encoding.
The original authors were Gary Miller, Greger Leijonhufvud and John
Entenmann. Later, Ken Thompson and Rob Pike did significant work for
the formal UTF-8.
Yergeau Informational [Page 2]
RFC 2044 UTF-8 October 1996
A description can also be found in Unicode Technical Report #4 [UNI-
CODE]. The definitive reference, including provisions for UTF-16
data within UTF-8, is Annex R of ISO/IEC 10646-1 [ISO-10646].
2. UTF-8 definition
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
The only octet of a "sequence" of one has the higher-order bit set to
0, the remaining 7 bits being used to encode the character value. In
a sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the value of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the UCS-4
character value.
UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
Encoding from UCS-4 to UTF-8 proceeds as follows:
1) Determine the number of octets required from the character value
and the first column of the table above.
2) Prepare the high-order bits of the octets as per the second column
of the table.
3) Fill in the bits marked x from the bits of the character value,
starting from the lower-order bits of the character value and
putting them first in the last octet of the sequence, then the
next to last, etc. until all x bits are filled in.
Yergeau Informational [Page 3]
RFC 2044 UTF-8 October 1996
The algorithm for encoding UCS-2 (or Unicode) to UTF-8 can be
obtained from the above, in principle, by simply extending each
UCS-2 character with two zero-valued octets. However, UCS-2 val-
ues between D800 and DFFF, being actually UCS-4 characters trans-
formed through UTF-16, need special treatment: the UTF-16 trans-
formation must be undone, yielding a UCS-4 character that is then
transformed as above.
Decoding from UTF-8 to UCS-4 proceeds as follows:
1) Initialize the 4 octets of the UCS-4 character with all bits set
to 0.
2) Determine which bits encode the character value from the number of
octets in the sequence and the second column of the table above
(the bits marked x).
3) Distribute the bits from the sequence to the UCS-4 character,
first the lower-order bits from the last octet of the sequence and
proceeding to the left until no x bits are left.
If the UTF-8 sequence is no more than three octets long, decoding
can proceed directly to UCS-2 (or equivalently Unicode).
A more detailed algorithm and formulae can be found in [FSS_UTF],
[UNICODE] or Annex R to [ISO-10646].
3. Examples
The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262, 0391,
002E) may be encoded as follows:
41 E2 89 A2 CE 91 2E
The Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (0048, 0069,
0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows:
48 69 20 4D 6F 6D 20 E2 98 BA 21
The Unicode sequence representing the Han characters for the Japanese
word "nihongo" (65E5, 672C, 8A9E) may be encoded as follows:
E6 97 A5 E6 9C AC E8 AA 9E
Yergeau Informational [Page 4]
RFC 2044 UTF-8 October 1996
MIME registrations
This memo is meant to serve as the basis for registration of a MIME
character encoding (charset) as per [RFC1521]. The proposed charset
parameter value is "UTF-8". This string would label media types
containing text consisting of characters from the repertoire of ISO
10646-1 encoded to a sequence of octets using the encoding scheme
outlined above.
Security Considerations
Security issues are not discussed in this memo.
Acknowledgments
The following have participated in the drafting and discussion of
this memo:
James E. Agenbroad Andries Brouwer
Martin J. D|rst David Goldsmith
Edwin F. Hart Kent Karlsson
Markus Kuhn Michael Kung
Alain LaBonte Murray Sargent
Keld Simonsen Arnold Winkler
Bibliography
[FSS_UTF] X/Open CAE Specification C501 ISBN 1-85912-082-2 28cm.
22p. pbk. 172g. 4/95, X/Open Company Ltd., "File Sys-
tem Safe UCS Transformation Format (FSS_UTF)", X/Open
Preleminary Specification, Document Number P316. Also
published in Unicode Technical Report #4.
[ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Infor-
mation technology -- Universal Multiple-Octet Coded
Character Set (UCS) -- Part 1: Architecture and Basic
Multilingual Plane. UTF-8 is described in Annex R,
adopted but not yet published. UTF-16 is described in
Annex Q, adopted but not yet published.
[RFC1521] Borenstein, N., and N. Freed, "MIME (Multipurpose
Internet Mail Extensions) Part One: Mechanisms for
Specifying and Describing the Format of Internet Mes-
sage Bodies", RFC 1521, Bellcore, Innosoft, September
1993.
[RFC1641] Goldsmith, D., and M. Davis, "Using Unicode with
MIME", RFC 1641, Taligent inc., July 1994.
Yergeau Informational [Page 5]
RFC 2044 UTF-8 October 1996
[RFC1642] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
Transformation Format of Unicode", RFC 1642,
Taligent, Inc., July 1994.
[UNICODE] The Unicode Consortium, "The Unicode Standard --
Worldwide Character Encoding -- Version 1.0", Addison-
Wesley, Volume 1, 1991, Volume 2, 1992. UTF-8 is
described in Unicode Technical Report #4.
[US-ASCII] Coded Character Set--7-bit American Standard Code for
Information Interchange, ANSI X3.4-1986.
Author's Address
Francois Yergeau
Alis Technologies
100, boul. Alexis-Nihon
Suite 600
Montreal QC H4M 2P2
Canada
Tel: +1 (514) 747-2547
Fax: +1 (514) 747-2561
EMail: fyergeau@alis.com
Yergeau Informational [Page 6]