new/updated drafts

This commit is contained in:
David Lawrence 2001-03-05 12:18:56 +00:00
parent 0015ab0974
commit fef2d3dce0
6 changed files with 5174 additions and 2713 deletions

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,374 @@
Internet Draft Maynard Kang
draft-ietf-idn-mua-00.txt i-EMAIL.net
February 5, 2001
Expires on August 5, 2001
Internationalizing Domain Names in Mail User Agents
Status of this Memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document describes a way where domain names used in Internet e-mail
can be internationalized by making changes only to end-user Mail User
Agents and, by doing so, avoid damaging other applications which handle
Internet e-mail, such as Message Transfer Agents and Delivery Agents.
1. Introduction
One of the proposed solutions for internationalized domain names (IDN)
involves only updating the user applications with no changes required
to the DNS protocol, servers and resolvers [IDNA] compared to other
solutions which require changes to be made to protocol, servers,
resolvers and applications.
The underlying principle of [IDNA] may be similarly applied to the
Internet e-mail system today - by effecting changes to only the Mail
User Agent (MUA) component of the e-mail system. Thus, existing
Message Transfer Agents, Delivery Agents and other applications which
handle e-mail do not have to be changed at all.
1.1 Definitions and Conventions
Usage of terms related to the character encoding model are in
reference to Unicode Technical Report 17 [UTR17].
The terms "international character", "non-ASCII character" and
"multilingual character", which are used interchangeably, are taken
to mean any abstract character which is not included in the range
specified by [US-ASCII].
1.2 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in RFC
2119 [RFC2119].
1.3. Design Philosophy
As the Internet e-mail system is a diverse, distributed and
heterogeneous system with many vendors deploying a vast number of
applications, it is of utmost importance that interoperability amongst
these various components is maintained. Thus, the ideal solution would
be one which does not compromise or damage the operation of any of these
existing components once internationalized domain names are encountered.
Also, solutions which call for changes to be made to many or even all
components of the Internet e-mail system would require far too much
time and effort to deploy, given that Internet e-mail has such a huge
installed base.
This solution adheres to both of the above principles, in that
interoperability is preserved and that the cost and speed of
implementation is low. All that the user has to do to use IDNs in e-mail
is update his or her MUA.
1.4. IDN Summary
This solution specifies an IDN architecture of arch-3 (just send ACE)
and a transition strategy of trans-1 (always do current plus new
architecture) as described in [IDNCOMP]. The choice of ACE format is not
defined in this document, but MUST be the same as that specified in
[IDNA] in order to maintain uniqueness and consistency.
1.5. E-mail Internationalization Summary
As many Internet e-mail standards such as the SMTP protocol [RFC821]
and the e-mail message format [RFC822] only specify usage of the 7-bit
ASCII character set [US-ASCII], international characters which use octet-
based character encoding schemes (CES) cannot be used in e-mail
transmission, headers and bodies.
Although this issue has been addressed in [RFC2045] for message bodies
and [RFC2047] for message headers through the use of a Transfer Encoding
Syntax (TES) such as Quoted-Printable or Base64, there is no similar
solution which extends the functionality of [RFC821] to include usage of
international characters, except for [RFC1652] which allows transmission
of 8-bit data passed by the DATA command in an SMTP session.
[RFC1652] however, does not fully address the problem of using IDNs in
an SMTP session - the IDN may be used in areas within the SMTP session
other than the DATA command, such as the MAIL FROM and RCPT TO commands,
where an IDN may be part of the e-mail address(es) specified there.
Hence, this would be a major stumbling block to deploying "just-send-
8bit" IDNs for use in Internet e-mail, as these IDNs would not be able
to be used in SMTP e-mail transmissions due to [RFC821] restrictions.
2. Architectural Overview
The end-user MUA may encounter IDNs in the scenarios below:
(i) When specifying the transmission server (i.e. SMTP server)
(ii) When specifying the retrieval server (i.e. POP3/IMAP4/any other
retrieval mechanism)
(iii) When specifying e-mail addresses during composition of a message
(iv) When reading messages with e-mail addresses in it
As with [IDNA], the MUA is updated in a similar fashion to process IDNs
which are input by users and process IDNs which are displayed to users,
in all of the scenarios above.
For (i) and (ii), the IDN MUST be handled in the same manner as
specified in [IDNA]. The method of handling an IDN For (iii) and (iv) is
described below in 2.1.
2.1 Interfaces between E-mail components when composing/reading a mail
The interfaces between e-mail components can be pictorially represented
as shown below.
The example assumes the setup of a POP3/IMAP4 retrieval client and
server, but the exact nature of end-to-end e-mail transmission may vary
accordingly (e.g. elm or pine would read directly from the mail store).
However, these variations do not impact an accurate description of this
solution to a large extent as no changes are required at these levels.
+------+ +------+
| User | | User |
+------+ +---^--|
| User Input: User Display: Characters/ |
| Keyboard/Pen/etc Glyphs on CRT or other |
+-----v---------------+ Representation (e.g. sound) |
| Input Method Editor | +------------|-----+
+---------------------+ | Rendering Engine |
| Input: Any localized/ +---------^--------+
| internationalized Output: Any localized/ |
| charset internationalized |
+----v-----------------+ charset |
| +------------------+ | +----------|-------------+
| | Mail Composition | | | +--------------+ |
| | Interface | | Sender's | | Mail Reading | |
| +------------------+ | MUA | | Interface | |
| | | | +--------^-----+ |
| | Nameprepped ACE | Receiver's | | Nameprepped |
| v | MUA | | ACE |
| +-------------+ | | +-------------------+ |
| | SMTP Client | | | | POP3/IMAP4 Client | |
| +-------------+ | | +-------------------+ |
+----|-----------------+ +----------^-------------+
| Nameprepped | Nameprepped
v ACE Nameprepped Nameprepped | ACE
+-------------+ ACE +------------+ ACE +-------------------+
| SMTP Server | -----> | Mail Store | -----> | POP3/IMAP4 Server |
+-------------+ +------------+ +-------------------+
2.1.1 Interface between User and Input Method Editor
For ASCII characters, input is straightforward: the user types on the
keyboard and whichever character that is pressed is sent to the
application.
However, for international characters, the end-user has to use a script-
specific Input Method Editor (IME), which may or may not be built-into
the OS, to interpret what the user communicates to the system and
thereafter send the respective international characters to the
application.
For example, for input of Chinese characters, some users use IMEs
which support the "Pinyin" input method. When a user types "zhongguo"
(in ASCII characters) on the keyboard and selects the characters which
represent "China" (in Chinese) from a list, the IME sends the
international characters to the application in a user-determined
charset (e.g. GB2312).
2.1.2 Interface between Input Method Editor and MUA Composition
Interface
The MUA mail composition interface (i.e. the "Compose Message"
function of the MUA) SHOULD be able to accept IDNs using 8-bit character
encoding schemes, including those represented in any localized (e.g.
GB2312) or internationalized (e.g. UTF-8) charsets.
This input typically takes place where e-mail addresses are entered
such as the "From", "To", "Cc", "Bcc" fields, amongst others, as IDNs
may be used at the right-hand-side of the "@" sign in an e-mail address
(domain-parts).
The mail composition interface MAY allow ACE input for the same
reasons as specified in [IDNA], but is not recommended as ACE is opaque
and ugly.
2.1.3 Interface between MUA Composition Interface and SMTP Client
The MUA composition interface communicates with the SMTP client in the
MUA typically through internal function calls within the software itself
or through an API. It is at this level where ACE conversion of any IDN
encountered by the MUA composition interface takes place.
Before converting the name parts of the IDN into ACE, the MUA MUST
prepare each name part as specified in [NAMEPREP]. Thereafter, the MUA
MUST convert the name parts into ACE before passing any data to the SMTP
client.
The SMTP client then prepares the e-mail for transmission using the
SMTP protocol [RFC821], and thereafter establishes an SMTP connection
with the user-specified SMTP server to transmit the e-mail.
It is important to note that an IDN specified in the parameters of any
SMTP command MUST be represented in nameprepped ACE at this point in
time. This includes SMTP commands which require domain parameters (such
as the HELO and EHLO commands) and commands where e-mail addresses are
specified (such as the MAIL FROM, RCPT TO, DATA, VRFY, EXPN, SEND, SOML
and SAML commands).
As for data passed by the DATA command, ACE conversion MUST be
performed when the "domain" portion of an "addr-spec" or when a "domain"
itself, within the context of [RFC822], is encountered. This is
necessary as an updated MUA may originate a message which is read by a
non-updated MUA. If this happens, the non-updated MUA may face
operational problems dealing with IDNs that appear in the "addr-spec"
which are not in ACE.
Any transfer encoding syntax to be applied to the mail headers as
specified in [RFC2047] SHOULD be performed before nameprepped ACE
conversion. This is to reduce confusion between IDNs within "addr-spec"
and "domain" portions, in the context of [RFC822], and IDNs which appear
as arbitrary data in mail headers and bodies.
2.1.4. Interface between POP3/IMAP4 client (or local mail store) and
Mail Reading Interface
The MUA mail reading interface (i.e. "Read mail" function of an MUA)
typically displays e-mail data retrieved from either a POP3/IMAP4
client or from a local mail store through internal function calls within
the MUA software or through an API.
When e-mail containing an ACE-represented IDN is to be displayed, the
MUA SHOULD convert the ACE-represented IDN contained within the
"addr-spec" or "domain" portion specified in [RFC822] back into any
localized or internationalized charset of the user's choice, whenever
possible. In the event that it is impossible to achieve conversion back
into the selected localized charset (for example, conversion of RACE-
represented Hangeul characters into ISO-8859-1 is impossible), the MUA
should prompt the user with an error message.
It may be possible to save and retrieve information about the original
charset of the ACE-converted IDN through the use of additional
[RFC822] mail headers, but that is not (yet) addressed by this memo.
Although it is possible to render ACE into properly decoded glyphs and
display the actual abstract characters without any conversion to other
charsets, the MUA SHOULD NOT do this as it is not the primary function
of an MUA to render characters. This should be left to a rendering
engine which is separate from the MUA and typically embedded into the
OS. It is sufficient for the MUA to pass the appropriate charset to the
rendering engine for proper display.
3. ACE Length Considerations
As [RFC821] in Section 4.5.3 restricts the maximum total length of a
domain name to 64 characters, representation of IDNs using ACE may
pose a potential problem. Most ACEs typically require 3-4 ASCII
characters to represent one international character (especially in the
case of CJK characters, where compression is less effective).
That would leave only about 16-24 characters for the whole IDN,
including all name parts and dots. This is highly undesirable as some
languages such as Arabic are unable to be abbreviated and the domain
names may require a larger length than that which is allowed by
[RFC821].
To further complicate matters, several mailing list software such as
ezmlm embed domain names into the local-parts portion of an e-mail
address during management of subscriptions, together with randomly-
generated subscription information. This would leave an even smaller
maximum ACE length, if interoperability with these mailing list software
were to be maintained, given that there is also a 64 character
restriction on local parts.
4. Security Considerations
As this memo is based on [IDNA], security considerations are similar
to that faced by [IDNA]. This includes security considerations from
[NAMEPREP] as well.
5. Other Considerations
Although this document addresses end-user MUAs (e.g. elm, mutt, pine,
Eudora, Outlook Express, etc) to a large extent, the definition of an
MUA could be extended to include web-based e-mail server software and
automated programs such as mailing list management software.
End-user MUAs may also include additional functionality where IDNs may
be encountered, such as calendaring/scheduling, directory services and
digital certificate storage. This is not (yet) addressed in this memo.
6. Future Extensions
It is possible to achieve internationalization of the entire e-mail
address by representation of international characters in the local-parts
of an "addr-spec" using nameprepped ACE conversion in a similar fashion
as described in this memo.
However, this is a different problem altogether and is currently beyond
the scope of this memo.
7. References
[IDNA] Paul Hoffman & Patrik Faltstrom, "Internationalizing Host Names
in Applications (IDNA)", draft-ietf-idn-idna.
[UTR17] K. Whistler & M. Davis, Unicode Consortium, "Character Encoding
Model", Unicode Technical Report #17,
http://www.unicode.org/unicode/reports/tr17/
[US-ASCII] United States of America Standards Institute, "USA Code for
Information Interchange", X3.4, 1968.
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.
[RFC821] Jonathan B. Postel, "Simple Mail Transfer Protocol", August
1982, RFC 821.
[RFC822] David H. Crocker, "Standard for the Format of ARPA Internet
Text Messages", August 1982, RFC 822.
[RFC2045] N. Freed & N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies",
November 1996, RFC 2045.
[RFC2047] K. Moore, "MIME (Multipurpose Internet Mail Extensions)
Part Three: Message Header Extensions for Non-ASCII Text", November
1996, RFC 2047.
[RFC1652] J. Klensin et al., "SMTP Service Extension for 8bit-
MIMEtransport", July 1994, RFC 1652.
[NAMEPREP] Paul Hoffman & Marc Blanchet, "Preparation of
Internationalized Host Names", draft-ietf-idn-nameprep.
A. Author's Address
Maynard Kang
i-EMAIL.net Pte Ltd
1 Kim Seng Promenade #12-07
Great World City West Tower
Singapore 237994
E-mail: maynard@i-email.net

View file

@ -1,855 +0,0 @@
Internet Draft Paul Hoffman
draft-ietf-idn-nameprep-00.txt IMC & VPNC
July 3, 2000 Marc Blanchet
Expires in six months ViaGenie
Preparation of Internationalized Host Names
Status of this memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document describes how to prepare internationalized host names for
transmission on the wire. The steps include excluding characters that
are prohibited from appearing in internationalized host names, changing
all characters that have case properties to be lowercase, and
normalizing the characters. Further, this document lists the prohibited
characters.
1. Introduction
When expanding today's DNS to include internationalized host names,
those new names will be handled in many parts of the DNS. The IDN
Working Group's requirements document [IDNReq] describes a framework for
domain name handling as well as requirements for the new names. The IDN
Working Group's comparison document [IDNComp] gives a framework for how
various parts of the IDN solution work together.
A user can enter a domain name into an application program in a myriad
of fashions. Depending on the input method, the characters entered in
the domain name may or may not be those that are allowed in
internationalized host names. Thus, there must be a way to canonicalized
the user's input before the name is resolved in the DNS.
It is a design goal of this document to allow users to enter host names
in applications and have the highest chance of getting the name correct.
This means that the user should not be limited to only entering exactly
the characters that might have been used, but to instead be able to
enter characters that unambiguously canonicalize to characters in the
desired host name. At the same time, this process must not introduce any
chance that two host names could be represented by two distinct strings
of characters that look identical to typical users. It is also a design
goal to have all preprocessing of IDN done before going on the wire, so
that no transformation is done in the DNS server space.
This document describes the steps needed to convert a name part from one
that is entered by the user to one that can be used in the DNS.
1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].
Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
A". In the lists of prohibited characters, the "U+" is left off to make
the lists easier to read.
1.2 IDN summary
Using the terminology in [IDNComp], this document specifies all of the
prohibited characters and the canonicalization for an IDN solution.
Specifically, it covers the following sections from [IDNComp]:
prohib-1: Identical and near-identical characters
prohib-2: Separators
prohib-3: Non-displaying and non-spacing characters
prohib-4: Private use characters
prohib-5: Punctuation
prohib-6: Symbols
canon-1.2: Normalization Form KC
canon-2.1: Case folding in ASCII
canon-2.2: Case folding in non-ASCII
Note that this document does not cover:
canon-1.1: Normalization Form C
canon-2.3: Han folding
1.3 Open issues
This is the first draft of this document. Although there has been much
discussion on the WG mailing list about the topics here, there has not
yet been much agreement on some issues. Now that there is a document to
talk about, that discussion can be more focussed.
1.3.1 Where to do name preparation
Section 2.1 says to do name preparation in the resolver. An argument can
be made for doing name preparation in the application, before the
application service interface. An advantage of that proposal is that
resolvers would not need to do any name preparation. A disadvantage is
that applications would have to be updated each time the IDN protocol is
updated, such as if new characters are added to the repertoire of
allowed characters. It seems likely that resolvers are more easily
updated than all the individual applications that use internationalized
host names.
1.3.2 Choosing between normalization form C and KC
Much of the discussion of normalization on the WG mailing list assumed
that normalization form C would be used. Near the time that this
document was written, people started considering form KC instead of C.
This document used form KC, but the reasons for doing so could be
contentious.
1.3.3 Does the prohibition catch all bad characters?
On the mailing list, it was discussed doing prohibition in two steps: a
short list of prohibited characters before case folding in order to
prevent uppercase characters that have no lowercase equivalents from
getting through, and then a full check on the output of normalization.
In this draft, all checking is done before case folding, based on the
(possibly wrong) assumption that none of the prohibited characters will
re-appear after the case folding and normalization. If that assumption
turns out to be wrong, a check for just those problematic characters can
be added after normalization, or a full check against the prohibited
characters can be added.
2. Preparation Overview
This section describes where name preparation happens and the steps that
name preparation software must take.
2.1 Where name preparation happens
Part of the chart in section 1.4 of [IDNReq] looks like this:
+---------------+
| Application |
+---------------+
| Application service interface
| For ex. GethostbyXXXX interface
+---------------+
| Resolver |
+---------------+
| <----- DNS service interface
+-------------------------------------------+
In this specification, the name preparation is done in the resolver,
before the DNS service interface. That is, it is acceptable for software
in the application service interface (such as a "GetHostByName" API) to
pass the resolver a name that has not been prepared. However, the
resolver MUST prepare the name as described in this specification before
passing it to the DNS service interface.
2.2 Name preparation steps
The steps for preparing names are:
1) Input from the application service interface -- This can be done in
many ways and is not specified in this document
2) Look for prohibited input -- Check for any characters that are not
allowed in the input. If any are found, return an error to the
application service interface. This step is necessary to prevent errors
in the following two steps. This step fulfills prohib-1, prohib-2,
prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].
3) Fold case -- Change all uppercase characters into lowercase
characters. Design note: this step could just as easily have been
"change all lowercase characters into uppercase characters". However,
the upper-to-lower folding was chosen because most users of the Internet
today enter host names in lowercase. This step fulfills canon-2.1 and
canon-2.2 from [IDNComp].
4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
from [IDNComp].
5) Resolution of the prepared name -- This must be specified in a
different IDN document.
The above steps MUST be performed in the order given in order to comply
with this specification.
3. Prohibited Input
Before the text can be processed, it must be checked for prohibited
characters. There is a variety of prohibited characters, as described in
this section.
Note that one of the goals of IDN is to allow the widest possible set of
host names as long as those host names do not cause other problems, such
as possible ambiguity. Specifically, experience with current DNS names
have shown that there is a desire for host names that include personal
names, company names, and spoken phrases. A goal of this section is to
prohibit as few characters that might be used in these contexts as
possible while making sure that characters that might easily cause
confusion or ambiguity are prohibited.
Note that every character listed in this section MUST NOT be transmitted
on the DNS service interface. Although the checking is being performed
before case folding and canonicalization, those steps cannot result in
any of these characters if these characters are not in the input stream.
[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
server receives a request containing a prohibited character, then the
IDN protocol MUST return an error message.
Note that some characters listed in one section would also appear in
other sections. Each character is only listed once.
3.1 prohib-1: Identical and near-identical characters
Many characters in [ISO10646] are identical or nearly identical to other
characters. These were often included for compatibility with other
character sets.
The characters prohibited because they are identical or nearly identical
to allowed characters are:
00AD SOFT HYPHEN
00D7 MULTIPLICATION SIGN
01C3 LATIN LETTER RETROFLEX CLICK
02B0-02FF [SPACING MODIFIER LETTERS]
066D ARABIC FIVE POINTED STAR
1806 MONGOLIAN TODO SOFT HYPHEN
2010 HYPHEN
2011 NON-BREAKING HYPHEN
2012 FIGURE DASH
2013 EN DASH
2014 EM DASH
2160-217F [ROMAN NUMERALS]
FB1D-FB4F [HEBREW PRESENTATION FORMS]
FB50-FDFF [ARABIC PRESENTATION FORMS A]
FE20-FE2F [COMBINING HALF MARKS]
FE30-FE4F [CJK COMPATIBILITY FORMS]
FE50-FE6F [SMALL FORM VARIANTS]
FE70-FEFC [ARABIC PRESENTATION FORMS B]
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
3.2 prohib-2: Separators
Horizontal and vertical spacing characters would make it unclear where a
host name begins and ends. The prohibited spacing characters are:
0020 SPACE
00A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
2000-200B [SPACES]
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
3000 IDEOGRAPHIC SPACE
Allowing periods and period-like characters as characters within a name
part would also cause similar confusion. The prohibited periods,
characters that look like periods, and characters that canonicalize to a
period or to a period-like character are:
002E FULL STOP
06D4 ARABIC FULL STOP
2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
2488 DIGIT ONE FULL STOP
2489 DIGIT TWO FULL STOP
248A DIGIT THREE FULL STOP
248B DIGIT FOUR FULL STOP
248C DIGIT FIVE FULL STOP
248D DIGIT SIX FULL STOP
248E DIGIT SEVEN FULL STOP
248F DIGIT EIGHT FULL STOP
2490 DIGIT NINE FULL STOP
2491 NUMBER TEN FULL STOP
2492 NUMBER ELEVEN FULL STOP
2493 NUMBER TWELVE FULL STOP
2494 NUMBER THIRTEEN FULL STOP
2495 NUMBER FOURTEEN FULL STOP
2496 NUMBER FIFTEEN FULL STOP
2497 NUMBER SIXTEEN FULL STOP
2498 NUMBER SEVENTEEN FULL STOP
2499 NUMBER EIGHTEEN FULL STOP
249A NUMBER NINETEEN FULL STOP
249B NUMBER TWENTY FULL STOP
33C2 SQUARE AM
33C2 SQUARE AM
33C7 SQUARE CO
33D8 SQUARE PM
33D8 SQUARE PM
3.3 prohib-3: Non-displaying and non-spacing characters
There are many characters that cannot be seen in the ISO 10646 character
set. These include control characters, non-breaking spaces, formatting
characters, and tagging characters. These characters would certainly
cause confusion if allowed in host names.
0000-001F [CONTROL CHARACTERS]
007F DELETE
0080-009F [CONTROL CHARACTERS]
070F SYRIAC ABBREVIATION MARK
180B MONGOLIAN FREE VARIATION SELECTOR ONE
180C MONGOLIAN FREE VARIATION SELECTOR TWO
180D MONGOLIAN FREE VARIATION SELECTOR THREE
180E MONGOLIAN VOWEL SEPARATOR
200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK
202A LEFT-TO-RIGHT EMBEDDING
202B RIGHT-TO-LEFT EMBEDDING
202C POP DIRECTIONAL FORMATTING
202D LEFT-TO-RIGHT OVERRIDE
202E RIGHT-TO-LEFT OVERRIDE
206A INHIBIT SYMMETRIC SWAPPING
206B ACTIVATE SYMMETRIC SWAPPING
206C INHIBIT ARABIC FORM SHAPING
206D ACTIVATE ARABIC FORM SHAPING
206E NATIONAL DIGIT SHAPES
206F NOMINAL DIGIT SHAPES
FEFF ZERO WIDTH NO-BREAK SPACE
FFF9 INTERLINEAR ANNOTATION ANCHOR
FFFA INTERLINEAR ANNOTATION SEPARATOR
FFFB INTERLINEAR ANNOTATION TERMINATOR
FFFC OBJECT REPLACEMENT CHARACTER
FFFD REPLACEMENT CHARACTER
3.4 prohib-4: Private use characters
Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are:
E000-F8FF [PRIVATE USE, PLANE 0]
3.5 prohib-5: Punctuation
The following characters are reserved or delimiters in URLs [RFC2396]
and [RFC2732]:
" # $ % & + , . / : ; < = > ? @ [ ]
3.5.1 Characters from URLs
The following punctuation characters are prohibited because they are
reserved or delimiters in URLs.
0022 QUOTATION MARK
0023 NUMBER SIGN
0024 DOLLAR SIGN
0025 PERCENT SIGN
0026 AMPERSAND
002B PLUS SIGN
002C COMMA
002E FULL STOP
002F SOLIDUS
003A COLON
003B SEMICOLON
003C LESS-THAN SIGN
003D EQUALS SIGN
003E GREATER-THAN SIGN
003F QUESTION MARK
0040 COMMERCIAL AT
005B LEFT SQUARE BRACKET
005D RIGHT SQUARE BRACKET
3.5.2 Characters that canonicalize to characters from URLs
The following punctuation characters are prohibited because their
normalization contains one or more of the characters from section 3.5.1.
037E GREEK QUESTION MARK
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK
207A SUPERSCRIPT PLUS SIGN
207C SUPERSCRIPT EQUALS SIGN
208A SUBSCRIPT PLUS SIGN
208C SUBSCRIPT EQUALS SIGN
2100 ACCOUNT OF
2101 ADDRESSED TO THE SUBJECT
2105 CARE OF
2106 CADA UNA
3.5.3 Characters that look like characters from URLs
The following are prohibited because they look indistinguishable from
the characters listed in section 3.5.1.
037E GREEK QUESTION MARK
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
061B ARABIC SEMICOLON
066A ARABIC PERCENT SIGN
201A SINGLE LOW-9 QUOTATION MARK
2030 PER MILLE SIGN
2031 PER TEN THOUSAND SIGN
2033 DOUBLE PRIME
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
2044 FRACTION SLASH
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D INTERROBANG
3001 IDEOGRAPHIC COMMA
3002 IDEOGRAPHIC FULL STOP
3003 DITTO MARK
3008 LEFT ANGLE BRACKET
3009 RIGHT ANGLE BRACKET
3014 LEFT TORTOISE SHELL BRACKET
3015 RIGHT TORTOISE SHELL BRACKET
301A LEFT WHITE SQUARE BRACKET
301B RIGHT WHITE SQUARE BRACKET
3.5.4 Other punctuation
The following punctuation are prohibited because they are unlikely to
be used in names and may be confusing to users or to character-entry
processes:
005C REVERSE SOLIDUS
3.6 prohib-6: Symbols
[UniData] has non-normative categories for symbols. The four symbol
categories are:
Symbol, Currency: Currency symbols could appear in company names and
spoken phrases, so they are not prohibited.
Symbol, Modifier: Stand-alone modifiers might appear in personal names,
company names, and spoken phrases, so they are not prohibited.
Symbol, Math: It is very unlikely that there are any significant
personal names, company names, or spoken phrases that contain
mathematical symbols. Further, many of these symbols are the same or
similar to other punctuation, thereby leading to ambiguity. For this
reason, math-specific symbols are prohibited. These prohibited math
symbols are:
00AC NOT SIGN
00B1 PLUS-MINUS SIGN
2200-22FF [MATHEMATICAL OPERATORS]
Further, the following characters canonicalize to characters in the
above math list, and therefore are also prohibited:
00BC VULGAR FRACTION ONE QUARTER
00BD VULGAR FRACTION ONE HALF
00BE VULGAR FRACTION THREE QUARTERS
207B SUPERSCRIPT MINUS
208B SUBSCRIPT MINUS
2153 VULGAR FRACTION ONE THIRD
2154 VULGAR FRACTION TWO THIRDS
2155 VULGAR FRACTION ONE FIFTH
2156 VULGAR FRACTION TWO FIFTHS
2157 VULGAR FRACTION THREE FIFTHS
2158 VULGAR FRACTION FOUR FIFTHS
2159 VULGAR FRACTION ONE SIXTH
215A VULGAR FRACTION FIVE SIXTHS
215B VULGAR FRACTION ONE EIGHTH
215C VULGAR FRACTION THREE EIGHTHS
215D VULGAR FRACTION FIVE EIGHTHS
215E VULGAR FRACTION SEVEN EIGHTHS
215F FRACTION NUMERATOR ONE
33A7 SQUARE M OVER S
33A8 SQUARE M OVER S SQUARED
33AE SQUARE RAD OVER S
33AF SQUARE RAD OVER S SQUARED
33C6 SQUARE C OVER KG
Symbol, Other: This category covers a multitude of symbols, few of which
would ever appear in personal names, company names, and spoken phrases.
The rest of the prohibited symbols are:
2190-21FF [ARROWS]
2300-23FF [MISCELLANEOUS TECHNICAL]
2400-243F [CONTROL PICTURES]
2440-245F [OPTICAL CHARACTER RECOGNITION]
2500-257F [BOX DRAWING]
2580-259F [BLOCK ELEMENTS]
25A0-25FF [GEOMETRIC SHAPES]
2600-267F [MISCELLANEOUS SYMBOLS]
2700-27BF [DINGBATS]
2800-287F [BRAILLE PATTERNS]
3.7 Additional prohibited characters
3.7.1 Unassigned characters
All characters not yet assigned in [ISO10646] are prohibited. Although
this may at first seem trivial, it is extremely important because
characters that may be assigned in the future might have properties that
would cause them to be prohibited or might have case-folding properties.
As is the case of all prohibited characters, if a DNS server receives a
request containing an unassigned character, then the IDN protocol MUST
return an error message.
3.7.2 Surrogate characters
So far, all proposals for binary encodings of internationalized name
parts have specified UTF-8 as the encoding format. In such an encoding,
surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
the following are prohibited:
D800-DFFF [SURROGATE CHARACTERS]
3.7.3 Uppercase characters with no lowercase mappings
There are many uppercase characters in [ISO10646] which do not have
lowercase equivalents in [UniData]. Therefore, they are prohibited on
input because they would get through the case mapping step while still
being in uppercase.
The characters that are prohibited on input because they are uppercase
but have no lowercase mappings are:
03D2 GREEK UPSILON WITH HOOK SYMBOL
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0 CYRILLIC LETTER PALOCHKA
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
Note that many characters in the range U+1200 to U+213A, the letterlike
symbols, also are uppercase but have no lowercase mappings. However,
they are not listed here because the entire range is already prohibited
in section 3.6.
3.7.4 Radicals and Ideographic Description
Some Han characters can be informally defined in terms of ideographic
descriptions. However, ideographic descriptions can lead to multiple
character streams leading to the same character in a fashion that does
not canonicalize. Thus, the radicals for ideographic description and the
ideographic description characters themselves are prohibited. These
characters are:
2E80-2EFF [CJK RADICALS SUPPLEMENT]
2F00-2FDF [KANGXI RADICALS]
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3.8 Summary of prohibited characters
The following is a collected list from the previous sections.
0000-001F [CONTROL CHARACTERS]
0020 SPACE
0022 QUOTATION MARK
0023 NUMBER SIGN
0024 DOLLAR SIGN
0025 PERCENT SIGN
0026 AMPERSAND
002B PLUS SIGN
002C COMMA
002E FULL STOP
002E FULL STOP
002F SOLIDUS
003A COLON
003B SEMICOLON
003C LESS-THAN SIGN
003D EQUALS SIGN
003E GREATER-THAN SIGN
003F QUESTION MARK
0040 COMMERCIAL AT
005B LEFT SQUARE BRACKET
005C REVERSE SOLIDUS
005D RIGHT SQUARE BRACKET
007F DELETE
0080-009F [CONTROL CHARACTERS]
00A0 NO-BREAK SPACE
00AC NOT SIGN
00AD SOFT HYPHEN
00B1 PLUS-MINUS SIGN
00BC VULGAR FRACTION ONE QUARTER
00BD VULGAR FRACTION ONE HALF
00BE VULGAR FRACTION THREE QUARTERS
00D7 MULTIPLICATION SIGN
01C3 LATIN LETTER RETROFLEX CLICK
02B0-02FF [SPACING MODIFIER LETTERS]
037E GREEK QUESTION MARK
037E GREEK QUESTION MARK
03D2 GREEK UPSILON WITH HOOK SYMBOL
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0 CYRILLIC LETTER PALOCHKA
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
061B ARABIC SEMICOLON
066A ARABIC PERCENT SIGN
066D ARABIC FIVE POINTED STAR
06D4 ARABIC FULL STOP
070F SYRIAC ABBREVIATION MARK
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
1680 OGHAM SPACE MARK
1806 MONGOLIAN TODO SOFT HYPHEN
180B MONGOLIAN FREE VARIATION SELECTOR ONE
180C MONGOLIAN FREE VARIATION SELECTOR TWO
180D MONGOLIAN FREE VARIATION SELECTOR THREE
180E MONGOLIAN VOWEL SEPARATOR
2000-200B [SPACES]
200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK
2010 HYPHEN
2011 NON-BREAKING HYPHEN
2012 FIGURE DASH
2013 EN DASH
2014 EM DASH
201A SINGLE LOW-9 QUOTATION MARK
2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202A LEFT-TO-RIGHT EMBEDDING
202B RIGHT-TO-LEFT EMBEDDING
202C POP DIRECTIONAL FORMATTING
202D LEFT-TO-RIGHT OVERRIDE
202E RIGHT-TO-LEFT OVERRIDE
202F NARROW NO-BREAK SPACE
2030 PER MILLE SIGN
2031 PER TEN THOUSAND SIGN
2033 DOUBLE PRIME
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D INTERROBANG
2044 FRACTION SLASH
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK
206A INHIBIT SYMMETRIC SWAPPING
206B ACTIVATE SYMMETRIC SWAPPING
206C INHIBIT ARABIC FORM SHAPING
206D ACTIVATE ARABIC FORM SHAPING
206E NATIONAL DIGIT SHAPES
206F NOMINAL DIGIT SHAPES
207A SUPERSCRIPT PLUS SIGN
207B SUPERSCRIPT MINUS
207C SUPERSCRIPT EQUALS SIGN
208A SUBSCRIPT PLUS SIGN
208B SUBSCRIPT MINUS
208C SUBSCRIPT EQUALS SIGN
2100 ACCOUNT OF
2101 ADDRESSED TO THE SUBJECT
2105 CARE OF
2106 CADA UNA
2153 VULGAR FRACTION ONE THIRD
2154 VULGAR FRACTION TWO THIRDS
2155 VULGAR FRACTION ONE FIFTH
2156 VULGAR FRACTION TWO FIFTHS
2157 VULGAR FRACTION THREE FIFTHS
2158 VULGAR FRACTION FOUR FIFTHS
2159 VULGAR FRACTION ONE SIXTH
215A VULGAR FRACTION FIVE SIXTHS
215B VULGAR FRACTION ONE EIGHTH
215C VULGAR FRACTION THREE EIGHTHS
215D VULGAR FRACTION FIVE EIGHTHS
215E VULGAR FRACTION SEVEN EIGHTHS
215F FRACTION NUMERATOR ONE
2160-217F [ROMAN NUMERALS]
2190-21FF [ARROWS]
2200-22FF [MATHEMATICAL OPERATORS]
2300-23FF [MISCELLANEOUS TECHNICAL]
2400-243F [CONTROL PICTURES]
2440-245F [OPTICAL CHARACTER RECOGNITION]
2488 DIGIT ONE FULL STOP
2489 DIGIT TWO FULL STOP
248A DIGIT THREE FULL STOP
248B DIGIT FOUR FULL STOP
248C DIGIT FIVE FULL STOP
248D DIGIT SIX FULL STOP
248E DIGIT SEVEN FULL STOP
248F DIGIT EIGHT FULL STOP
2490 DIGIT NINE FULL STOP
2491 NUMBER TEN FULL STOP
2492 NUMBER ELEVEN FULL STOP
2493 NUMBER TWELVE FULL STOP
2494 NUMBER THIRTEEN FULL STOP
2495 NUMBER FOURTEEN FULL STOP
2496 NUMBER FIFTEEN FULL STOP
2497 NUMBER SIXTEEN FULL STOP
2498 NUMBER SEVENTEEN FULL STOP
2499 NUMBER EIGHTEEN FULL STOP
249A NUMBER NINETEEN FULL STOP
249B NUMBER TWENTY FULL STOP
2500-257F [BOX DRAWING]
2580-259F [BLOCK ELEMENTS]
25A0-25FF [GEOMETRIC SHAPES]
2600-267F [MISCELLANEOUS SYMBOLS]
2700-27BF [DINGBATS]
2800-287F [BRAILLE PATTERNS]
2E80-2EFF [CJK RADICALS SUPPLEMENT]
2F00-2FDF [KANGXI RADICALS]
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3000 IDEOGRAPHIC SPACE
3001 IDEOGRAPHIC COMMA
3002 IDEOGRAPHIC FULL STOP
3003 DITTO MARK
3008 LEFT ANGLE BRACKET
3009 RIGHT ANGLE BRACKET
33A7 SQUARE M OVER S
33A8 SQUARE M OVER S SQUARED
33AE SQUARE RAD OVER S
33AF SQUARE RAD OVER S SQUARED
33C2 SQUARE AM
33C2 SQUARE AM
33C6 SQUARE C OVER KG
33C7 SQUARE CO
33D8 SQUARE PM
33D8 SQUARE PM
D800-DFFF [SURROGATE CHARACTERS]
E000-F8FF [PRIVATE USE, PLANE 0]
FB1D-FB4F [HEBREW PRESENTATION FORMS]
FB50-FDFF [ARABIC PRESENTATION FORMS A]
FE20-FE2F [COMBINING HALF MARKS]
FE30-FE4F [CJK COMPATIBILITY FORMS]
FE50-FE6F [SMALL FORM VARIANTS]
FE70-FEFC [ARABIC PRESENTATION FORMS B]
FEFF ZERO WIDTH NO-BREAK SPACE
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
FFF9 INTERLINEAR ANNOTATION ANCHOR
FFFA INTERLINEAR ANNOTATION SEPARATOR
FFFB INTERLINEAR ANNOTATION TERMINATOR
FFFC OBJECT REPLACEMENT CHARACTER
FFFD REPLACEMENT CHARACTER
Unassigned characters
4. Case Folding
After it has been verified that the input text has none of the
characters prohibited for case folding, the case-folding step itself is
quite straight-forward. For each character in the input, if there is a
lowercase mapping for that character in [UniData], the input character
is changed to the mapped lowercase letter.
5. Canonicalization
After case folding, the input string is normalized using form KC, as
described in [UTR15].
6. IDN Table Revisions
A table consisting of all characters allowed and prohibited and the
rules for case folding and canonicalization will be created based on the
content of the [UniData] and on the content of this document. This table
will be the authority for implementations to follow and will be
normatively referenced by this document. Such a table will enable the
IDN protocol to have versions independent of the revisions to Unicode
and/or to ISO 10646 because the revision of IDN and its deployment may
not in sync with revisions to Unicode and ISO 10646.
In a future draft of this document, IANA will be asked to keep this
table, with an initial version number of 1. Each new version of the
table will have a new, higher version number.
7. Security Considerations
Much of the security of the Internet relies on the DNS. Thus, any change
to the characteristics of the DNS can change the security of much of the
Internet.
Host names are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different servers
based on different interpretations of the internationalized host name.
8. References
[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.
[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
draft-ietf-idn-requirement.
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane. Five amendments and a
technical corrigendum have been published up to now. UTF-16 is described
in Annex Q, published as Amendment 1. 17 other amendments are currently
at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
[Normalize] Character Normalization in IETF Protocols,
draft-duerst-i18n-norm-03
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
Generic Syntax", August 1998, RFC 2396.
[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
URL's, December 1999, RFC 2732.
[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
[UniData] The Unicode Consortium. UnicodeData File.
<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
Unicode Technical Report #15.
<http://www.unicode.org/unicode/reports/tr15/>.
A. Acknowledgements
Many people from the IETF IDN Working Group and the Unicode Technical
Committee contributed ideas that went into the first draft of this
document. Mark Davis was particularly helpful in some of the early
ideas.
B. Changes From Previous Versions of this Draft
This is the -00 version, so there are no changes.
C. IANA Considerations
There are no specific IANA considerations in this draft, but there will
be in a future draft of this document.
D. Author Contact Information
Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
paul.hoffman@imc.org and paul.hoffman@vpnc.org
Marc Blanchet
Viagenie inc.
2875 boul. Laurier, bur. 300
Ste-Foy, Quebec, Canada, G1V 2M2
Marc.Blanchet@viagenie.qc.ca

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,269 @@
INTERNET-DRAFT Martin Duerst
draft-ietf-idn-uri-00 W3C/Keio University
Expires July 2001 January 6, 2001
Internationalized Domain Names in URIs and IRIs
Status of this Memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Abstract
This document is a first draft for the provisions necessary to
upgrade the definitions of URIs [RFC 2396] and IRIs (Internationalized
Resource Identifiers, [IRI]) to work with internationalized domain
names.
1. Introduction
Internet domain names serve to identify hosts and services on the
Internet in a convenient way. The IETF IDN working group is currently
working on extending the character repertoire usable in domain names
beyond a subset of US-ASCII.
One of the most important places where domain names appear are
Uniform Resource Identifiers (URIs, [RFC 2396], as modified by
[RFC2732]). However, in the current definition of the generic URI
syntax, the restrictions on domain names are 'hard-coded'. This
document proposes to relax these restrictions by updating the syntax,
and defines how internationalized domain names are encoded in URIs.
URIs themselves are restricted to a subset of US-ASCII. However,
there is a proposal for relieving these restrictions by creating
a new protocol element called an IRI (Internationalized Resource
Identifier [IRI]). While IRIs in general allow the use of non-ASCII
characters, the syntax of IRIs has the same restriction for domain
names as the syntaxt of URIs. This document proposes to relax these
restrictions, too, in a way that is compatible with the new syntax
for URIs. This means that encoding an internationalized domain name in
an URI and encoding the same name in an IRI will produce an URI and an
IRI that can be converted into each other using the procedures defined
in [IRI] for these conversions.
2. URI syntax changes
The syntax of URIs [RFC2326] currently contains the following rules
relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
The later two rules are changed as follows:
domainlabel = escalphanum | escalphanum *( escalphanum | "-" )
escalphanum
toplabel = escalpha | escalpha *( escalphanum | "-" )
escalphanum
and the following rules are added:
escalphanum = escaped8 | alphanum
escalpha = elcaped8 | alpha
escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>>
The %HH escaping is used to encode characters outside the repertoire
of US-ASCII. This is done by first encoding the characters in UTF-8
[RFC 2279], resulting in a sequence of octets, and then escaping these
octets.
Using UTF-8 assures that this encoding interoperates with IRIs (see
Section 3). It is also alligned with the recommendations in [RFC 2277]
and [RFC 2718], and is consistent with the URN syntax [RFC2141] as
well as recent URL scheme definitions that define encodings of
non-ASCII characters based on (e.g., IMAP URLs [RFC 2192] and POP URLs
[RFC 2384]).
Please note that the use of UTF-8 for encoding internationalized
domain names in URIs is independent of the choice of encoding chosen
for these names in the DNS protocol. In case something else than UTF-8
is chosen for the later, a future version of this document may give
instructions for the conversion if deemed necessary.
The above syntax rules do not extend the possible domain names based
on US-ASCII characters. This may have to be changed in case the IDN
WG should decide to allow such extensions.
The above rules also do not allow escaping of US-ASCII characters,
although this is allowed in the other parts of an URI (except for the
special provisions in case of reserved characters). Allowing such
escaping would make the syntax rules quite a bit more complicated,
would mean that the restrictions on US-ASCII characters can be
circumvented by using escaping, or would lead to much simpler syntax
rules that don't express these restrictions anymore. Even in case
escaping of US-ASCII characters is allowed in order to simplify
processing, it should be noted that it is always better not to escape
US-ASCII characters in domain names because of the possibility that
a resolver cannot unescape them. At least purely US-ASCII domain names
would then always be resolved by such a processor.
While only the restrictions on US-ASCII characters are expressed in the
rules above, all the other restrictions on internationalized
domain names that will be defined by the IDN WG MUST be respected.
The work of the IDN WG currently includes some procedures for name
preparation. Before encoding an internationalized domain name in an
URI, this preparation step SHOULD be applied. However, the resolver
MUST also apply name preparation.
2. IRI syntax changes
The syntax of IRIs [IRI] currently contains the following rules
relevant to domain names:
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
The later two rules are changed as follows:
domainlabel = intalphanum | intalphanum *( intalphanum | "-" )
intalphanum
toplabel = intalpha | intalpha *( intalphanum | "-" )
intalphanum
and the following rules are added:
intalphanum = ichar | alphanum | escaped8
intalpha = ichar | alpha | escaped8
escaped8 = "%" hexdig8 HEXDIG
hexdig8 = <<HEXDIG greater than 7>>
where ichar, as in [IRI], is:
ichar = << any character of UCS [ISO10646] beyond
U+0080, subject to limitations in Section
3.1. of [IRI] >>
With respect to the allowed domain names based on US-ASCII characters,
the same considerations as in Section 2 apply.
As in Section 2, all the other restrictions on internationalized
domain names that will be defined by the IDN WG MUST be respected.
Also, before encoding an internationalized domain name in an IRI,
name preparation SHOULD be applied. However, the IRI resolver MUST
also apply name preparation.
It is expected that the rules in Section 3.1 of [IRI] will be less
restrictive than the rules for internationalized domain names, so that
no escaping is necessary. Nevertheless, escaping is allowed for cases
where not all characters can be directly represented.
4. Security Considerations
Besides the security considerations of [RFC 2396] and [IRI] and those
applying to the various aspects of internationalized domain names in
general, there are currently no known security problems.
Acknowledgements
To be done.
Copyright
Copyright (C) The Internet Society, 1997. All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other
than English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."
Author's address
Martin J. Duerst
W3C/Keio University
5322 Endo, Fujisawa
252-8520 Japan
duerst@w3.org
http://www.w3.org/People/D%C3%BCrst/
Tel/Fax: +81 466 49 1170
Note: Please write "Duerst" with u-umlaut wherever
possible, e.g. as "D&#252;rst" in XML and HTML.
References
[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
(IRI)", Internet Draft, January 2001,
<http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-06.txt>,
work in progress.
[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
Coded Character Set (UCS) - Part 1: Architecture and Basic
Multilingual Plane, Oct. 2000, with amendments.
[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997.
[RFC 2141] R. Moats, "URN Syntax", May 1997.
[RFC 2192] C. Newman, "IMAP URL Scheme", September 1997.
[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and
Languages".
[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.",
January 1998.
[RFC 2384] R. Gellens, "POP URL Scheme", August 1998.
[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource
Identifiers (URI): Generic Syntax." August, 1998.
[RFC 2640] B. Curtis, "Internationalization of the File Transfer
Protocol", July 1999.
[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke,
"Guidelines for new URL Schemes", November 1999.
[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal
IPv6 Addresses in URL's", December 1999.

File diff suppressed because it is too large Load diff