From a831ffc8fec3a487183a381a6ae799116552f78b Mon Sep 17 00:00:00 2001 From: Andreas Gustafsson Date: Thu, 15 Nov 2001 23:46:00 +0000 Subject: [PATCH] new draft --- doc/draft/draft-hall-dm-idns-00.txt | 2739 +++++++++++++++++++++++++++ 1 file changed, 2739 insertions(+) create mode 100644 doc/draft/draft-hall-dm-idns-00.txt diff --git a/doc/draft/draft-hall-dm-idns-00.txt b/doc/draft/draft-hall-dm-idns-00.txt new file mode 100644 index 0000000000..d3bc4b4e0d --- /dev/null +++ b/doc/draft/draft-hall-dm-idns-00.txt @@ -0,0 +1,2739 @@ + + + INTERNET-DRAFT Eric A. Hall, Editor + Document: draft-hall-dm-idns-00.txt Consultant + Expires: May 2002 November 2001 + + + The Internationalized Domain Name System + + + Status of this Memo + + This document is an Internet-Draft and is in full conformance with + all provisions of Section 10 of RFC2026. + + Internet-Drafts are working documents of the Internet Engineering + Task Force (IETF), its areas, and its working groups. Note that + other groups may also distribute working documents as Internet- + Drafts. + + Internet-Drafts are draft documents valid for a maximum of six + months and may be updated, replaced, or obsoleted by other + documents at any time. It is inappropriate to use Internet-Drafts + as reference material or to cite them other than as "work in + progress." + + The list of current Internet-Drafts can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt. + + The list of Internet-Draft Shadow Directories can be accessed at + http://www.ietf.org/shadow.html. + + + 1. Abstract + + The principle intention of this specification is to facilitate the + deployment of a completely internationalized domain name syntax + and service which new protocols, applications and host systems can + use, but without disrupting the existing infrastructure. Towards + that end, this document describes a series of elective + encapsulation services and protocol extensions which cumulatively + allow internationalized domain names to be stored and transmitted + in the existing DNS message and within application data streams, + according to the compliance level of the participating systems. + + + + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + Table of Contents + + 1. Abstract..................................................1 + 2. Definitions and Terminology...............................3 + 3. Introduction..............................................4 + 3.1. Background.............................................4 + 3.2. Objectives.............................................5 + 3.3. Common Usage Scenarios.................................7 + 3.4. User Audiences.........................................9 + 3.5. Service Overview......................................11 + 3.6. Process Example.......................................13 + 4. The Internationalized Namespace..........................19 + 4.1. Internationalized Domain Names and Labels.............20 + 4.2. Internationalized Host Identifiers....................27 + 4.3. STD13 Domain Names....................................28 + 4.4. STD13 Host Identifiers................................29 + 5. Transfer Encodings and Label Types.......................30 + 5.1. The EDNS/UTF-8 Label Type.............................31 + 5.2. The STD13 Legacy Label Type...........................33 + 6. Application Guidelines...................................36 + 6.1. Input and Output Charsets.............................37 + 6.2. Protocol and Application Data.........................38 + 6.3. DNS Lookups and Resolver Calls........................40 + 7. Resolver Guidelines......................................42 + 7.1. Resolver APIs.........................................42 + 7.2. Query Processing Services.............................44 + 7.3. The Hosts Database....................................48 + 8. Server Guidelines........................................49 + 8.1. Internationalized Zones...............................50 + 8.2. Namespace Visibility Restrictions.....................51 + 8.3. The Master File Format................................52 + 9. Caching Guidelines.......................................53 + 10. Security Considerations..................................53 + 11. IANA Considerations......................................54 + 12. References...............................................54 + 13. Acknowledgements.........................................55 + 14. Editor's Address.........................................55 + + + Hall I-D Expires: May 2002 [page 2] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + 2. Definitions and Terminology + + This document unites, enhances and clarifies several pre-existing + technologies. Readers are expected to be familiar with the + following specifications: + + [AMC-ACE-Z] , "AMC-ACE-Z version + 0.3.1" + + [NAMEPREP] , "Preparation of + Internationalized Host Names" + + [STD13] (RFC 1034) "Domain names - concepts and facilities", + (RFC 1035) "Domain names - implementation and + specification" + + [STD3] (RFC 1122) "Requirements for Internet Hosts -- + Communication Layers", (RFC1123) "Requirements for Internet + Hosts -- Application and Support" + + [BCP18] (RFC 2277) "IETF Policy on Character Sets and + Languages" + + [RFC2279] "UTF-8, a transformation format of ISO 10646" + + [RFC2671] "Extension Mechanisms for DNS (EDNS0)" + + + The following abbreviations are used throughout this document: + + UCS (Universal Character Set) “ The ISO/IEC 10646 character + set repertoire, as represented by the Unicode 3.1 + specification. + + ACE (ASCII-Compatible Encoding) “ A transfer encoding which + encodes UCS character codes into a seven-bit codespace + which is compatible with US-ASCII. + + UTF-8 (UCS Transformation Format, Eight-Bit) “ A transfer + encoding which encodes UCS characters into an eight-bit + codespace which is compatible with DNS message formats. + + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL + NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" + in this document are to be interpreted as described in RFC 2119. + + Hall I-D Expires: May 2002 [page 3] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + + 3. Introduction + + The domain name system (DNS) [STD13] currently defines a message, + namespace and protocol. Although the DNS message is capable of + transferring eight-bit character codes as protocol data, + applications are currently limited to a subset of US-ASCII when + they interact with the DNS namespace, and this restricted syntax + is enforced by almost every TCP/IP application and protocol which + utilizes domain names as embedded data (including, surprisingly, + the DNS protocol). + + In order to allow for the use of a larger range of characters in + the namespace, this document extends and clarifies a variety of + Internet specifications so that characters from the Universal + Character Set (UCS) [ISO10646] may be used in domain names. This + document also extends the DNS message structure to allow for the + use of UTF-8 [RFC2279] encoded characters for the purpose of + transferring these domain names, but also provides an ASCII- + compatible encoding (ACE) [AMC-ACE-Z] of these character codes + which existing protocols and applications can use to access the + internationalized domain names, and also provides identification + mechanisms which allow the end-point systems to downwardly + negotiate when needed. Finally, this document defines behavior for + DNS systems which implement this architecture, including the end- + point applications which generate and store DNS domain names, and + the resolvers, caches and servers which process them. + + The mechanisms presented here are elective. Developers, zone + administrators and network operators who wish to make use of the + internationalized domain names may do so according to their own + schedule. Those developers, administrators and operators who + cannot or prefer not to implement the specified extensions can + continue to use their legacy systems, and will still be able to + access resources from the internationalized domain name system. + + + 3.1. Background + + From one perspective, DNS is already an "eight-bit clean" system, + in that the structured DNS message is capable of storing and + transmitting eight-bit data without any additional effort. + However, this perspective only considers one particular facet of + the domain name system, and ignores the more critical aspect of + + Hall I-D Expires: May 2002 [page 4] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + the DNS namespace, which has rules that are entirely different + from those which govern the message format. + + The DNS namespace (or more appropriately, the view of the + namespace which applications use and enforce) is governed by rules + set forth in RFC952 [RFC952], STD3 [STD3], and STD13, which + collectively define the characters that are eligible for use with + host names. These rules are meant to provide a common template + which may be applied to either the DNS namespace or a local hosts + database, such that a query for "host.example.com" can be + processed through either system. The range of valid characters + currently defined are the letters, numbers and hyphen characters + from US-ASCII [ASCII] (additional rules also govern the valid + order and length of a host name). Character code values outside of + this range are valid in domain name messages, but are undefined + when used in the namespace, and are subject to interpretation by + the applications which generate them. + + The host name rules are enforced by almost every application and + protocol which uses DNS to identify a host or system. This + includes network utilities such as ping and traceroute which + simply identify systems by name, and complex protocols such as + SMTP which use domain names to determine message-routing paths. + Portions of the DNS protocol itself are also affected by these + restrictions, such as the domain names which may be used for NS + resource records with sub-domain delegation operations (since + these servers are connection targets, they are also required to be + compliant with the host name rules). + + Because these domain names are so pervasive throughout the + Internet (and even within proprietary applications that run on + private networks), it is not possible to declare a "flag day" at + which eight-bit domain names will be considered valid encodings of + a particular character set. Instead, an extended namespace with a + larger set of charset rules must be defined, an extended DNS + protocol capable of supporting these domain names must be + deployed, and a transitional mechanism which allows the old and + new systems to interact must be established. This document + attempts to meet these objectives. + + + 3.2. Objectives + + In broad terms, this document has one overall goal, which is to + facilitate the creation and use of an internationalized domain + name system around a UCS namespace, a collection of UTF-8 and + + Hall I-D Expires: May 2002 [page 5] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + legacy-compatible encodings which are suitable for transferring + internationalized domain names within DNS and the affected + application data streams, and a negotiation mechanism which allows + end-point systems to identify the encoding that they will use for + a particular operation. + + One of the objectives stated above is to internationalize the + existing DNS namespace, by allowing UCS characters to be used in + host names and sub-domain delegations in old and new zones + equally. As such, this document does not define a new namespace, + but instead defines mechanisms by which leaf-nodes and sub-domains + may be created within the existing hierarchy. + + UTF-8 was chosen as the primary transfer encoding of these domain + names for several reasons. For one, there is a wide availability + of tools and expertise surrounding UTF-8, and it is already widely + deployed within development environments, operating systems and + applications. Furthermore, BCP18 [BCP18] requires that new + application protocols be able to use UTF-8 as application data, + and for many applications, this specifically means domain names + which are passed as data. All signs indicate that UTF-8 is + currently and will continue to be the preferred eight-bit encoding + on the Internet, and this specification embraces this position in + its design. + + However, most of the network services currently in use are bound + by the legacy host naming restrictions, and those applications and + protocols will also need to be able to interact with resources + from the internationalized namespace, even though they will not be + compliant with the UTF-8 encoding mechanisms defined in this + document. In order to allow these systems to participate, this + specification also embraces the use of ACE as a seven-bit + backwards-compatible encoding for legacy systems to use. + + Note that even though a single encoding could have been specified + by this document, past and present requirements would not have + been satisfied by a single choice. For example, supporting UTF-8 + alone would mean isolating legacy systems from resources in the + UCS namespace, while supporting ACE alone would not have provided + a truly internationalized namespace (the ACE encoded domain names + still appear in user data quite frequently). By allowing the UTF-8 + and ACE encodings to coexist, the existing and emerging + communities can both be served. + + Because both encodings will be active during the same time period, + this document also defines DNS protocol extensions which allow the + + Hall I-D Expires: May 2002 [page 6] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + end-point systems to detect the encoding that is in use for a + particular query/response pair. Note that these negotiation + mechanisms not only allow new and legacy systems to interoperate, + but they also provide a transition service for developers, zone + administrators and end-users, in that ACE encoded domain names can + be initially deployed within existing applications and DNS + systems, while individual elements of the infrastructure can be + upgraded without disturbing other components. + + + 3.3. Common Usage Scenarios + + Discussion of the mechanism provided by this document depends upon + the usage context of the domain names themselves. Domain names are + extremely pervasive, and are used by almost every TCP/IP protocol + and application in one form or another. However, most usages fall + under one or more of the following scenarios: + + * Connection identifiers “ Domain names are most commonly + used as host-specific identifiers for outbound connection + requests, whether this be for a command-line application + such as ping, or as a host name which is stored in an + application's configuration file. Another common usage + scenario for connection identifiers is with reverse + lookups, where a server is logging incoming connections by + the corresponding domain name, or where a program such as + netstat is displaying all of the application sessions which + are currently active on a host. In both of these cases, + domain names are passed through applications to a resolver, + resulting in DNS queries and responses which eventually + provide the requested DNS data. + + A related use (but one which does not generate DNS + messages) is determining the host name of the local system. + This is commonly found with applications and protocols that + need to display the domain name of the local system as part + of a protocol operation (such as an SMTP greeting banner) + or as application data. + + Connection identifiers (and lookups in general) are + probably the largest single use of domain names today, and + this is likely to be the case with internationalized domain + names as well. This document fully supports the use of + internationalized domain names for lookup operations, as + long as the calling application, the stub resolver, the + local caching servers, and the authoritative servers for + + Hall I-D Expires: May 2002 [page 7] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + the specified domain name are compliant with this + specification. If any of these components are not capable + of supporting internationalized domain names in this + manner, the ACE equivalent domain name will be negotiated + for the operation at hand. + + * Protocol data “ Some application protocols exchange domain + names as protocol data, with those domain names either + determining or altering a service-specific operation. + Examples of this usage include SMTP envelopes ("RCPT TO + ") where the domain name is used to + determine whether or not a particular email message should + be accepted for delivery, the HTTP HOST header field which + identifies a specific document tree on a shared server, + BOOTP/DHCP options, WHOIS input, and more. + + Because these protocols treat domain names as protocol + data, most of these protocols also have specific formatting + requirements which must be addressed before UTF-8 domain + names can be used by these protocols directly. This + document is intended to facilitate the use of UTF-8 encoded + domain names in this manner, although it is expected that + most of the protocol development groups will need to + develop negotiation mechanisms before these protocols can + use internationalized domain names directly. Until such + work is completed, ACE equivalent domain names can be used + to provide these protocols with access to the + internationalized namespace. + + * Structured application data “ Structured application data + is similar to protocol data in that it can trigger or + affect some protocol action, although this will not always + occur. For example, a web browser can process an embedded + IMG link which may be present in a web page, while a user + can manually follow an embedded email link which is also + stored in the same web page; even though both usage models + share the same structured data format (URLs), they are + processed differently by the application. Similarly, email + messages typically contain multiple domain names as + structured data in the message headers, and some of these + domain names will directly affect subsequent protocol + operations, while others will not. + + Because of this ambiguity, this document defines no + specific treatment for structured application data. In some + cases, no additional mechanisms will be required, while + + Hall I-D Expires: May 2002 [page 8] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + other scenarios will require negotiation mechanisms before + an internationalized domain name can be used in the + structured data (with ACE being required as the interim + format). Each protocol development group is encouraged to + analyze each usage independently, to classify the usage as + a connection identifier, protocol data, or unstructured + application data, and to determine the appropriate course + of action for each usage accordingly. + + * Unstructured application data “ Many application protocols + provide free-text data which can contain domain names, but + with those domain names existing as unstructured data. For + example, an email message which is provided as a text/plain + MIME body part may contain a domain name which identifies a + system or service in the context of a specific application, + but in an unstructured form ("your files were moved from + server1 to server2"). Similarly, an email address may be + provided in WHOIS output, but as unstructured data which + does not affect the protocol. + + Given the application-specific nature of this data, it + cannot be managed by any global protocol or process. Where + a protocol has rules or restrictions on the data itself, + then those rules are maintained, but some formatting rules + may need to be extended before internationalized domain + names (or their equivalents) can be encoded in the + application data. For example, internationalized domain + names in email messages may need to be converted to a + preferred display charset, while ACE equivalents may be + necessary for protocols which only support US-ASCII. + + Each of the above scenarios represent distinct handling cases + where internationalized domain names may or may not be used + directly. In some cases, the internationalized domain names may be + used as soon as the applications and resolvers are configured to + use them, while in other cases, measured and cautious deployment + is required in order to prevent undue breakage. In the latter + cases, however, the backwards-compatible ACE encoding is available + so that the internationalized domain names can be used. + + + 3.4. User Audiences + + Another perspective on the changes which will result from + deploying the mechanisms described in this document can be seen by + analyzing how any such changes will affect the different + + Hall I-D Expires: May 2002 [page 9] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + "audiences" who work with domain names, and who have their own + unique context-specific usage requirements and objectives. The + three main audiences discussed in this document are: + + * Developers. Protocol and application developers need to be + able to incorporate internationalized domain names into + their systems as easily as possible, although there are + many factors which will affect such usage, including the + input and output charsets and encodings which are available + to the applications and protocols. Where feasible, this + specification allows developers to choose any charset or + encoding which may be required and suitable for use, + although in most cases, a recommendation is also made for + the use of UTF-8 in particular. + + Developers may adopt internationalized domain names for + connection identifiers and lookup operations fairly + quickly, such that users can use those system as soon as + they have compliant systems (and they have a target domain + name to communicate with). Implementing support for + internationalized domain names in protocols and application + data will require additional effort by the affected + development groups. + + Support for ACE will be harder to implement, since it is a + relatively new and untested encoding syntax, with no + existing developer tools. This will likely be the largest + hurdle to overcome when developing applications for use + with this service. + + * Zone administrators. Organizations that wish to deploy + internationalized domain names should be able to do so + easily, at a reasonable cost, and without suffering + excessive pre-conditions. Towards this objective, the + mechanisms described by this document allow organizations + to deploy and use internationalized domain names within any + zone immediately, without requiring any other zone to have + been updated beforehand (although there are specific and + strong suggestions for upgrading the Internet's high-load + servers as soon as possible). + + If an organization wishes to publish internationalized + domain names for users to access and utilize, the + authoritative servers for the affected zone must be + compliant with the naming rules and message formats + described by this document, which will almost certainly + + Hall I-D Expires: May 2002 [page 10] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + require the administrators of that zone to upgrade their + servers. However, organizations may also choose to only + deploy ACE encoded domain names if an immediate migration + is not feasible, with the caveat that internationalized + domain names in their native form will not be available + from those zones. + + * Network operators. The systems and human users which + generate DNS lookups are another area of concern, as these + protocols, programs and users will expect these lookups to + succeed, and will also expect that the visible namespace + will be compatible with the capabilities of the requesting + system at a minimum investment. This is a broad range of + requirements. + + At a minimum, applications must be capable of generating + and accepting the internationalized domain names if they + are to use those domain names (see the "Developers" + discussion above for the application requirements). + Similarly, the local resolvers, caches and forwarders on + the user's network must also support the message formats if + they are to relay internationalized domain names between + their local applications and the remote zones being + queried. If the applications, resolvers and caches do not + support these requirements, intermediary systems will + perform the down-level negotiation automatically on their + behalf such that additional effort is not required on the + user's part. + + In summary, the developers, zone administrators and end-users can + immediately participate in the internationalized namespace at no + additional expense if they are content with using ACE encoded + domain names, and can use internationalized domain names in their + native form if they are willing to make the necessary investments. + Furthermore, since the native and backwards-compatible encodings + are not mutually exclusive, implementers of this specification + have the option of adopting ACE for immediate use and then + transitioning to internationalized domain names on a per-system, + per-zone, or per-application basis, according to their schedule. + + + 3.5. Service Overview + + This document specifies a variety of extensions to several + different protocols and services in order to facilitate the use of + internationalized domain names anywhere this support exists or can + + Hall I-D Expires: May 2002 [page 11] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + be implemented, and to provide a legacy-compatible domain name in + all other situations. + + More specifically, this document defines or clarifies behavior for + the following elements: + + * Host name character restrictions. Legacy protocols and + applications are currently restricted to the legacy host + naming rules, which only allow for a subset of US-ASCII + characters (letters, digits and the hyphen character). This + document redefines the characters which are valid within a + host name so that system identifiers, domain name parts of + host names, and new network services can use most of the + characters from the UCS. + + * DNS message format. This document defines an extended label + format based on the extended label services provided by + RFC2671 (Extension Mechanisms for DNS - EDNS0) [RFC2671], + with this label format being used to encapsulate UTF-8 + encoded internationalized domain names in DNS messages. Any + DNS message which carries the UTF-8 encoded domain names is + required to use the EDNS/UTF-8 label type defined in this + document. Any DNS message which carries legacy domain names + (including the ACE encoded equivalent domain names) is + required to use the traditional message format. + + * Application handling rules. Applications can use + internationalized domain names immediately for lookup + operations that do not directly affect external services or + protocols, and can use ACE encoding sequences to specify + internationalized domain names in legacy protocol + operations, and can use them both at the same time. + + * Stub resolvers. Stub resolvers will most likely need to + provide a series of internationalized APIs in order to + fully support applications that generate internationalized + domain name lookups. For example, these APIs will almost + certainly be required in order for the resolver to + determine that the calling application is compliant with + the host name requirements defined by this document, and + that the domain names should be encoded in the proper label + format. Although this specification does not dictate these + APIs, it encourages their use, and provides some guidance + on the issues surrounding their use. + + + Hall I-D Expires: May 2002 [page 12] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + * Forwarders, resolving servers and caches. The user-side + servers which process internationalized domain names have + several protocol-specific requirements, including the + negotiated fall-back service when UTF-8 queries fail. + + * Authoritative servers. A key part of this specification is + the simultaneous support for internationalized and legacy + compatible domain names in the UCS namespace, thereby + allowing a domain name to be entered into an authoritative + zone database once, and for the appropriate response to be + generated by a server according to the label encoding from + the associated query. In order for this to work, this + specification requires authoritative servers which serve + internationalized domain names to comply with specific + conditions. This specification also allows existing servers + to serve ACE equivalent domain names when the authoritative + servers cannot be upgraded, although this typically results + in lower levels of functionality. + + The elements listed above collectively define a completely + internationalized domain name system, which is capable of + servicing internationalized domain names in all compliant systems, + and which is also capable of providing ACE encoded equivalent + domain names when any component from the internationalized service + is not available. + + + 3.6. Process Example + + This section illustrates a series of query/response transactions + under which the processes and protocols defined in this document + function. This example uses a reverse lookup for the PTR resource + record associated with the "14.2.0.192.in-addr.arpa." domain name + (forward lookups work similarly, but the issues are more fully + demonstrated by PTR lookups). Each of the various technologies + shown below are described in later sections of this document. The + sole purpose of this example is to provide an illustration of + these mechanisms in order to facilitate better discussion. + + Note that this illustration represents a worst-case scenario + (thereby exercising most of the functionality provided by this + specification), and does not represent a typical scenario. + + a. First, a PTR resource record for 14.2.0.192.in-addr.arpa. + is added to the internationalized zone database on the + replication master server for the 2.0.192.in-addr.arpa. + + Hall I-D Expires: May 2002 [page 13] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + zone, with the resource record data value of + "host..example.com." (where is an + internationalized domain name compliant with the host + naming rules provided in this document). Both of these + domain names have a primary representation consisting of + UCS characters in some local encoding, but are also + available as UTF-8 and ACE encoded data so they can be + encapsulated within DNS queries and responses. + + Once the zone is reloaded and is replicated by the other + authoritative servers for that zone, the domain names can + be processed. + + b. An application on a remote system generates a DNS lookup + for the PTR resource record associated with the + 14.2.0.192.in-addr.arpa. domain name. + + If this is a legacy application, it issues the lookup using + the only method it knows, which is to pass the domain name + to the legacy resolver API. This would result in the + resolver issuing a legacy DNS query for the PTR resource + record associated with the specified domain name. + + If this application is compliant with this specification, + it performs the following steps: + + 1. Verify that the resolver is capable of processing + queries for UTF-8 domain names by probing for an + internationalized API. If this step failed, then the + domain name would be converted to the legacy STD13 + octet encoding in step 3.6.b.3 and passed to the + resolver's legacy API. + + 2. Convert the domain name from its generated encoding to + the canonical UCS characters, and then normalize and + case-convert the UCS characters. + + 3. Convert the normalized and lowercased UCS characters + to the charset or encoding used by the resolver's + internationalized API. + + 4. Issue a lookup for the PTR resource record associated + with the internationalized domain name, via the + resolver's internationalized API. + + + Hall I-D Expires: May 2002 [page 14] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + Note that even though the domain name is compatible + with the legacy host name rules, the domain name is + passed through the internationalized API so that + servers can tell whether or not the original + application is UTF-8 compliant, and can determine the + format of any internationalized domain names which are + to be returned in the response messages. This is + required in case the queried resource record includes + internationalized domain names as resource record data + (as would be the case with PTR resource records), and + is also required for the proper handling of any SOA or + NS resource records which may be returned as + additional data in the response. + + For the purpose of this example, we will assume that each + of these steps were successfully performed. + + c. The client's stub resolver generates the query, with the + Question Section of the query containing the UTF-8 encoded + domain name encapsulated in an EDNS/UTF-8 extended label. + + d. The stub resolver sends the query to one of its configured + resolving servers. + + e. The resolving server will either answer the query from its + cache or forward the query to a name server which is + authoritative for the namespace hierarchy, as per the + normal query-resolution procedure. For the purpose of this + example, we will assume that the server has no information + about the specified domain name, so it forwards the query + to one of the root zone's authoritative servers in order to + begin the iterative resolution process. + + f. The queried server responds with a referral, providing + delegation data for a zone in the path to the queried + domain name. For the purposes of this example, we will use + 192.in-addr.arpa. as the delegation domain specified in the + referral message. + + The specific format of the referral will depend on whether + or not the queried server understands the EDNS/UTF-8 label + encoding. If the server is compliant with this + specification (which it is, or else it wouldn't have + answered with a referral), then the referral will also + provide ENDS/UTF-8 encoded domain names in the Authority + and Additional-Data Sections of the referral. If the server + + Hall I-D Expires: May 2002 [page 15] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + was not compliant with this specification, it would return + an error upon seeing the extended label type, which would + cause the resolving server to restart the query using the + legacy label type. + + g. The resolving server decodes the UTF-8 encoded domain names + to their UCS character representation, caches the resource + records in their UCS form, and sends the query to one of + the authoritative servers for the referral zone. Note that + the cache did not normalize or case-convert the UCS + characters; only the end-systems perform this work. + + h. In this case, the queried server does not understand the + EDNS/UTF-8 label format, and has returned a FORMERR + response code. + + i. When these errors are encountered, the current resolver + (whether this is the client's stub resolver or a caching + server in the query path) must convert the query domain + name from its current form to a legacy-compatible encoding + (either ACE or STD13 octet sequences, depending on the UCS + characters which have been encoded), and then has to + reissue the query in that format. + + In this case, the domain name only contains printable + characters from US-ASCII, so the STD13 octet encoding is + used for the fall-back query. Because the UCS domain name + was normalized and lowercased before it was passed to the + client's stub resolver, the legacy domain name will also be + in this format (although it will be compared in a case- + neutral form by the recipient server). + + Note that once this conversion takes place, the legacy + label format is used for the remainder of the current query + chain (this prevents excessive delays from multiple fall- + back operations, which could result in timeouts at the + original resolver or application). + + + Hall I-D Expires: May 2002 [page 16] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + j. The queried server returns a delegation referral for the + 2.0.192.in-addr.arpa. zone. Since the query arrived in the + STD13 octet encoding, the server has no indicator of the + client's capabilities, so the referral NS resource records + will also be returned in legacy compatible form (either as + STD13 octet sequences or as ACE encoded data, depending on + the character codes provided in each label from each of the + associated domain names). + + Note that even though these NS resource records will be + restricted to legacy-compatible host names and label types, + they may contain and reference ACE domain names. In this + regard, a legacy server in the delegation path does not + prevent internationalized domain names from being delegated + or resolved, but only prevents them from being processed as + EDNS/UTF-8 extended labels. + + Also note that once the authoritative servers for a zone + have been discovered and cached, any subsequent UTF-8 + queries which are generated for the resources in that zone + will be sent directly to one of those servers, bypassing + the delegation hierarchy. As such, subsequent queries which + are provided in EDNS/UTF-8 labels can be processed directly + by the zone's authoritative servers, without the delegation + servers disrupting the process. + + k. The resolving server decodes the STD13 octet sequences and + ACE encoded domain names to their UCS character + representations, caches the resource records, and resends + the query to one of the authoritative servers for the + referral zone. + + l. The queried server processes the request. Since this query + arrived as an STD13 octet sequence, the server must compare + the seven-bit characters from the domain name (which is all + of them, in this example) in a case-neutral form. Note that + if the query had arrived as ACE or UTF-8 encoded domain + names, the server would have decoded the specified domain + name to its canonical UCS characters and performed a case- + exact match against the resulting characters. + + m. The queried server responds with the requested data. Note + that the query was submitted in the legacy label form due + to the fall-back processing which occurred in step 3.6.i, + so the server will only respond to this query with STD13 + + Hall I-D Expires: May 2002 [page 17] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + octet sequences or ACE encoded domain names, using the + STD13 legacy label. + + n. The resolving server decodes the STD13 octet sequences and + ACE encoded domain names to their UCS character + representations, and caches the resource records. Since the + query was originally received as an internationalized + domain name (as indicated by the EDNS/UTF-8 extended label + from the original query), the resolving server has to + encode the answer data as UTF-8 before passing it back to + the client's stub resolver. However, since the input was + not provided in an encoded UCS form, the server has to + normalize and case-convert the STD13 octet sequence in + order to provide a valid internationalized domain name. + + o. The stub resolver decodes the UTF-8 encoded domain names + which have been provided in the response message to their + UCS character representation, and passes the data to the + original calling application using the charset or encoding + favored by the resolver. + + p. The application validates the received domain name by + decoding the internationalized domain name to its canonical + UCS characters, normalizing and down-casing the resulting + domain name, and comparing the results with the answer data + which was provided by the resolver. + + As can be seen, the UTF-8 name resolution process is identical to + the current resolution process, with the addition of a single + fall-back query in step 3.6.i which resulted in one extra + query/response pair (roughly equivalent to adding one extra + delegation referral into the query path), and with several + different encoding conversions, as required by the participating + systems and services. This example also illustrates the + requirements which are placed on developers, zone administrators, + and network operators in order for typical connection identifier + services to function with UTF-8 domain names. + + However, if each system and service had used UTF-8 for encoding + purposes (including everything between the stub resolver's APIs + and the authoritative servers for the target zone), then no + additional queries or conversions would have been required (other + than the direct UCS conversions required for validation and + caching, the latter of which can be performed separately without + affecting the processing path). In this regard, the example above + illustrates how this system can function even when only a portion + + Hall I-D Expires: May 2002 [page 18] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + of the participating systems utilize UTF-8, and also illustrates + how effective the entire operation would be if all of the + recommendations and requirements provided in this specification + were adopted. + + It is also important to reiterate here that any such costs + associated with this compliance are entirely elective by the + affected parties. If they want to streamline the process, the + option is available to them, although the system also works when + very few optimizations are implemented. + + + 4. The Internationalized Namespace + + In simple terms, this specification defines an internationalized + namespace which consists of domain names and labels that contain + UCS character codes, and also specifies a series of encoding + formats which may be used whenever the UCS values need to be + encapsulated for transmission within DNS messages or application + data streams. + + In this regard, the internationalized namespace is the UCS + representation of the domain names and labels as they are used for + comparison operations once a domain name arrives for processing, + while the transfer encodings ensure that a domain name arrives at + the destination system intact, so that it may be processed in its + canonical form. + + There are four conceptual elements to this model: + + * Character codes. Labels from internationalized domain names + have a single logical canonical representation as sequences + of UCS code point values. The UCS characters are used when + a particular label from a domain name is created by an + application, stored in a zone, hosts or cache database, and + is used whenever two sets of domain names or labels need to + be compared. However, different kinds of domain names have + different rules which govern the character codes that may + be used. + + * Storage encodings. Whenever a domain name is created or + copied from the network, it must be stored in a format that + is reversible to the canonical UCS character representation + of that domain name. This specification does not mandate or + require any particular storage encoding, and allows this + decision to be made on a per-implementation basis, as long + + Hall I-D Expires: May 2002 [page 19] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + as the storage encoding supports character codes which can + be converted to UCS equivalent values for comparison + purposes. However, the use of UTF-8 for this purpose is + encouraged, since it is the most common. + + * Transfer encodings. Whenever a domain name needs to be sent + over the network, it must be packaged in a form which is + compliant with the capabilities of the transfer protocol in + use. This document specifies three transfer encodings which + may be used to encode canonical UCS character codes in DNS + messages or application streams, which are: the octet + encoding from STD13, the ACE encoding from , and the + UTF-8 encoding from RFC2279. Each encoding has different + costs and benefits in different usage scenarios. + + * Comparison operations. When two domain names need to be + compared, they also follow rules which are appropriate to + the type of domain name being provided, and the transfer + encoding which may have been used to provide the domain + name to the system. + + This document defines four distinct types of internationalized + domain names which may exist in the internationalized namespace, + and also describes how each of the above considerations affect + those domain names and their labels. These domain name types are + described throughout the remainder of this section. + + + 4.1. Internationalized Domain Names and Labels + + This section describes the master template rules for all domain + names and labels which may be used in the internationalized + namespace, although subordinate rules and restrictions are also + applied as secondary filters, depending on the intended usage of + the domain name. + + For example, domain names and labels which are to be used as + internationalized host identifiers (either as host names, or as + domain names which are used to specify a host) are restricted to a + specific subset of UCS characters. Meanwhile, domain names and + labels which are compliant with STD13's global rules are + restricted to eight-bit code values, while the domain names and + labels which are used as STD13 host identifiers are restricted to + a specific subset of US-ASCII. + + + Hall I-D Expires: May 2002 [page 20] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + The following diagram illustrates how the subordinate rules are + applied and interpreted against the master restrictions: + + +-----------------------+ + | Internationalized DNs | + +-----------------------+ + any UCS character codes + / | + / | + / | + / | + +-----------+ +-----------+ +------------+ + | Int. Host | | STD13 DNs +-----+ STD13 Host | + +-----------+ +-----------+ +------------+ + normalized character ASCII letters, + subset of codes 0x00 numbers, and + UCS chars through 0xFF hyphen char + + As can be seen, the internationalized domain names and labels + rules allow any UCS character code to be stored, although each + particular usage of the domain names and labels will have their + own secondary rules and restrictions. + + In order to allow future documents to define additional rules as + required for their usage, this document defines very few global + rules on the core internationalized domain names and labels. + + + 4.1.1. IDN syntax and structure + + In this specification, an internationalized domain name consists + of a variable number of labels, each of which contain a variable + number of UCS character codes, not all of which will have defined + UCS character interpretations. + + Furthermore, the encoding system which is used to store and + interpret those values on a system is not relevant to this + specification, and is therefore not defined. The characters in a + label can be stored in memory or on disk as UTF-8, UCS-4, ACE, or + any other storage encoding which is desired by the operators and + implementers of the affected system, as long as that encoding + system is reversible to the canonical UCS character code values, + and is able to represent the necessary range of UCS characters + (the "necessary range" varies by operation). + + + Hall I-D Expires: May 2002 [page 21] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + The only universal restrictions which apply to internationalized + domain names and labels are those which govern length. This + specification requires that labels from internationalized domain + names MUST be restricted to a minimum length of two characters and + a maximum length of 63 characters, inclusive. The exception to + this rule is the root domain, which is always represented by a + zero-length label. Note that this rule specifically refers to the + canonical UCS characters, rather than any encoded form (encoding + will often result in labels and domain names with fewer actual + characters, due to overhead from the encoding algorithm). + + A fully-qualified internationalized domain name is formed by + joining a series of labels together, with the most-contextually + specific label in the left-most position of the label sequence, + and with the root domain occupying the right-most position. The + sum total of all labels in an internationalized domain name MUST + NOT exceed 255 characters, inclusive. Any number of labels MAY be + stored in the domain name, but the sum total of their lengths MUST + NOT exceed this limit. + + However, labels which contain UCS character codes greater than + U+007F will result in multi-byte UTF-8 and ACE encodings, so the + maximum length of a label or an internationalized domain name is + governed by their UTF-8 and ACE encoded lengths. Both encodings + MUST result in an encoded length of 63 octets or less in order to + be usable, with a maximum cumulative length of 255 octets. + + + 4.1.2. IDN transfer encodings + + The UCS is currently occupies a 21-bit range of character code + values, containing tens of thousands of assigned characters, and + hundreds of thousands of unassigned characters. Due to the multi- + byte nature of the code point values, UCS characters cannot be + passed as protocol or application data in most of the existing + Internet protocols (including DNS messages), at least not without + the help of some kind of encoding scheme. At the very least, the + UCS character values have to be encoded as eight-bit sequences if + they are to fit within existing eight-bit data structures, and + have to be encoded as a subset of US-ASCII characters if they are + to be usable with legacy protocols and applications which only use + STD13's host identifier rules for their structured domain name + data types. + + With this objective in mind, this document defines three different + transfer encoding systems which can be used to convert + + Hall I-D Expires: May 2002 [page 22] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + internationalized domain names and labels into a form which is + suitable for transfer in different data streams. These are the + legacy STD13 octet encoding, ACE, and UTF-8. Each of these + encoding schemes provide different benefits and capabilities to + the internationalized DNS effort. + + * STD13 octets. The STD13 octet encoding scheme provides a + direct one-to-one mapping between eight-bit characters and + their eight-bit values, but it is only capable of storing + character codes in the range of U+0000 through U+00FF, + which severely restricts its usefulness. + + * ACE. The ACE encoding scheme is capable of storing UCS + character code value as seven-bit sequences in STD13 legacy + labels. While this makes it practically compatible with the + legacy host identifier rules, the resulting data imposes + additional labor on the Internet community, and the reuse + of the legacy label also results in certain amounts of + ambiguity with some DNS domain names and labels. + + * UTF-8. The UTF-8 encoding scheme is capable of encoding all + UCS character code values as sequences of eight-bit data + which are compatible with legacy DNS message restrictions, + but the encoded output requires explicit support from + internationalized applications and protocols. UTF-8 output + uses a new label type in order to prevent additional + ambiguity problems from arising. + + The table below illustrates the UCS character code sequences which + are supported by each of the different encoding schemes. + + STD13 + Octets ACE UTF-8 + +-------+-------+-------- + | | | + US-ASCII | Y | | Y + | | | + Eight-Bit | Y | Y | Y + | | | + Any UCS Chars | | Y | Y + | | | + + + Hall I-D Expires: May 2002 [page 23] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + More specifically, the character code sequence ranges and their + valid encodings are: + + * US-ASCII. If a label only contains character codes from the + range of U+0000 through U+007F, then it MAY be encoded as a + legacy STD13 octet sequence or UTF-8, but MUST NOT be + encoded as ACE. + + Note that this specification explicitly prohibits seven-bit + labels from being encoded as ACE data, since such an action + would be redundant, results in greater processing overhead + for those labels, and multiple representations introduce + problems with caches on legacy systems. Furthermore, + certain security risks would be introduced if this were + allowed. For example, a malicious user could register or + purposefully create an ACE encoded representation of the + "example.com" label sequence such that users mistakenly + sent sensitive data to malicious systems. + + In order to prevent these problems from occurring, this + specification requires that any ACE-encoded label which + consists entirely of seven-bit characters MUST be + immediately discarded with extreme prejudice. This rule + applies to every implementation of this specification, + including any applications, resolvers, caches or servers + which process labels. + + * Eight-bit codes. If a label contains character codes from + the eight-bit range of U+0000 through U+00FF, then it MAY + be encoded as STD13 octet sequences, ACE, or UTF-8. This + rule specifically requires that the label MUST contain at + least one character from the eight-bit range, MAY contain + any number of characters from the seven-bit range, but MUST + NOT contain characters with code values which are greater + than U+00FF. + + Since the STD13 octet encoding and ACE both use the legacy + STD13 label type, this specification relies on the input + encoding of a domain name in order to determine the output + encoding. In some cases, however, the input encoding will + not be clear, or will not be specified, and this can result + in some ambiguity with label sequences from this range. + + For example, if the domain name provided in a query + consists of seven-bit labels, then the STD13 octet sequence + is the only valid encoding for the legacy STD13 label, + + Hall I-D Expires: May 2002 [page 24] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + meaning that ACE could not have been used in the query. If + the specified domain name exists as a CNAME resource record + which refers to a domain name that contains eight-bit + character codes, then the proper output encoding for that + domain name will not be clearly discernable. Moreover, the + STD13 and ACE encodings will generate different results, + since the STD13 octet sequence will only contain a single + octet for the eight-bit character, while the ACE encoding + will contain multiple octets of encoded data. + + When this situation arises, systems MUST give preference to + the ACE encoding, on the assumption that the referenced + character is more likely to represent a UCS character than + an eight-bit code value (the UCS characters in this range + are Latin-1, which are the most common characters after the + legacy US-ASCII set). Furthermore, the ACE encoded + representation of these characters allow for a broader + range of subsequent operations (since it complies with the + legacy host naming restrictions, it can be used with CNAME + resource records that refer to hosts), while the STD13 + octet encoded representation does not. + + It is possible to avoid this scenario on authoritative zone + servers (and thus the affected caches) by allowing the + operator to specify whether or not the input is Latin-1 UCS + character data or binary data, with the server generating + the proper output accordingly. Also note that the default + encoding specified by this document is UTF-8, which does + not suffer from the ambiguity problems described above. + + * Any UCS character codes. If a label consists of any + character codes greater than U+00FF, then it MAY be encoded + as ACE or UTF-8, but MUST NOT be encoded as STD13 octet + sequences. STD13 is not capable of representing character + codes greater than U+00FF, so it cannot be used with any + UCS characters beyond the eight-bit range. + + Encodings are performed on a per-label basis. Each label MUST NOT + be encoded more than once. Also note that recursive encodings + result in applications discarding the domain name. + + When the STD13 octet encoding is used to encode labels for + transmission, the labels are encoded according to the rules + specified in STD13, and are encapsulated in STD13 legacy labels. + + + Hall I-D Expires: May 2002 [page 25] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + When ACE is used to encode labels for transmission, the labels are + encoded according to the rules specified in , and are + encapsulated in STD13 legacy labels (this process is described in + section 5.2). + + When UTF-8 is used to encode labels for transmission, the labels + are encoded according to the rules specified in RFC2279, and are + encapsulated in EDNS/UTF-8 extended labels (the format of this + label is described in section 5.1). + + Note that a domain name MAY contain any combination of STD13 octet + encoded labels and ACE encoded labels. However, if a domain name + contains any UTF-8 encoded labels, then ALL of the labels from + that domain name MUST be encoded as UTF-8 data. This rule + primarily exists so that DNS compression services can be + maintained consistently, but it also prevents mixed referrals + which can trigger unnecessary fall-back processing, and also + provides a single encoding representation to internationalized + systems which benefits efficiency. + + The root domain (as specified by the zero-length label at the + right edge of the domain name) MUST NOT be encoded with ACE. More + specifically, zero-length labels MUST NOT contain any character + data of any kind, and since ACE labels have prefix strings, they + are explicitly forbidden from being used for the root domain. + + + 4.1.3. IDN comparison operations + + When an internationalized domain name label is received from the + network as ACE or UTF-8 encoded data, the labels MUST be decoded + to their canonical UCS character representation, and the resulting + UCS characters MUST be compared as case-exact sequences to their + stored equivalents. Except where specifically required in this + specification (EG, validity tests which are performed by + applications), normalization and case-conversion MUST NOT be + performed against the resulting UCS character codes prior to any + comparison operations being performed. + + However, internationalized domain name labels which are received + as STD13 octet sequences MUST be given special treatment, as these + domain names could have originated from legacy systems operating + under STD13's rules. In this case, the seven-bit US-ASCII + alphabetic characters (U+0041 through U+005A, and U+0061 through + U+007A) from those labels MUST be compared in a case-neutral form. + All other code values MUST be compared as case-exact code values + + Hall I-D Expires: May 2002 [page 26] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + (this particularly includes eight-bit characters, which were not + defined by STD13). + + + 4.2. Internationalized Host Identifiers + + Internationalized host identifiers are a subset of the + internationalized domain names described in section 4.1, which + only use a subset of the allowable UCS characters, but which reuse + the global transfer encodings and comparison routines. + + Most of the displayable characters from the UCS can be used in + host identifiers, and there are no additional rules governing the + ordering or length of their labels. However, the characters which + are used in internationalized host identifiers MUST be normalized + and case-converted before they are encoded for storage or + transfer. This requires more effort on the part of applications + and servers when the internationalized domain names are initially + created, but results in less ambiguity and lower processing + requirements for servers, caches and resolvers during subsequent + comparison operations. + + The restrictions which govern the creation of internationalized + host identifiers are as follows: + + a. Labels MUST be restricted to the subset of characters which + are permitted by [nameprep]. Characters which + are prohibited by MUST NOT appear in any label + of any internationalized host identifier. + + b. Labels MUST be normalized through before they + are stored or encoded for transfer. Internationalized host + identifiers will not be normalized as part of any + comparison operation, so systems MUST normalize the labels + before they are stored or transmitted. + + c. Labels MUST be converted to lowercase according to the + case-mappings rules specified in before they are + stored or encoded for transfer. Internationalized host + identifiers will not be converted to lowercase as part of + any comparison operation, so systems MUST normalize the + labels before they are stored or transmitted. + + According to the rules above, a label from an internationalized + host identifier which was originally created with the UCS + character sequence of (U+0041 U+0301 U+0042) would be + normalized and lowercased to (U+00E1 U+0062). The normalized, + lowercase form would be used as the canonical UCS character + representation of that label when it was encoded for storage and + transmission purposes, and would be the form which was used for + comparison operations on any resolvers, caches and servers. + + Internationalized host identifiers which are received from the + network can contain labels which have been encoded as STD13 octet + sequences, ACE or UTF-8. In all of these cases, the comparison + rules defined in section 4.1.3 MUST be applied. + + + 4.3. STD13 Domain Names + + STD13 allows any eight-bit code values to be used in domain name + labels. However, STD13 host identifiers (as described in section + 4.4 of this specification) are the most common form of STD13 + domain names, and have much tighter restrictions. + + There are common uses of STD13 domain names which do not comply + with the STD13 host identifier subset, however. One common example + of this is SRV identifiers, which use an underscore character + (U+005F) as part of their label syntax. Another common example is + found when email addresses are provided in SOA and RP resource + records, and where the left-hand side of the email address is + stored as an STD13 domain name label which does not represent a + host identifier. Furthermore, email addresses often contain extra + characters which are not legal in STD13 host identifiers, such as + a full-stop character (U+002E). For example, "joe.admin" could be + stored as an STD13 domain name label in the fully-qualified domain + name of "joe.admin.example.com.", which would represent the email + address of "joe.admin@example.com" when that domain name was + extracted from the SOA or RP resource record and processed. + + Implementations of this specification MUST allow STD13 domain + names to be created and stored, using the following rules: + + a. Labels MUST be restricted to the code values of U+0000 + through U+00FF. Restrictions on character content MUST NOT + be applied (note that if this domain name will be used as + part of an STD13 host identifier, the rules specified in + section 4.4 MUST be used instead). + + + Hall I-D Expires: May 2002 [page 28] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + b. Labels MUST NOT be normalized or lowercased before they are + stored or encoded for transfer. + + c. Systems MUST allow STD13 domain names to be specified as + exact sequences of eight-bit octet values, and MUST NOT + treat these sequences as canonical UCS characters which are + normalized or lowercased. STD13 defines an escaping + mechanism whereby the decimal value of the octet is + prefaced with a reverse-solidus (such as "\193"), which is + suggested for this usage. + + STD13 domain names which are received from the network can contain + labels which have been encoded as STD13 octet sequences, ACE or + UTF-8. In all of these cases, the comparison rules defined in + section 4.1.3 MUST be applied. Note that some of these sequences + can contain octet code values which have not been normalized or + lowercased by the originating system, since these values can be + used to specify binary domain names. + + + 4.4. STD13 Host Identifiers + + This document does not deprecate, replace or modify the host name + rules defined by RFC952, STD3 or STD13 as they apply to legacy + host identifiers. However, there are several issues which affect + the usage of these domain names and their labels in this system. + + The range of characters which are currently defined as valid in + STD13 host identifiers are the uppercase and lowercase letters, + numbers and hyphen character from US-ASCII. No other characters + are allowed to be used. Furthermore, the current rules also + prohibit the use of the hyphen character in the first or last + character position of a host identifier label. + + Implementations of this specification MUST allow STD13 host + identifiers to be created and stored, using the following rules: + + a. Labels MUST be restricted to the code values of U+002D, + U+0031 through U+0039, U+0041 through U+005A, and U+0061 + through U+007A. + + b. Labels MUST NOT contain the code value of U+002D in either + the first or last character position of the label. + + + Hall I-D Expires: May 2002 [page 29] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + c. The alphabetic characters MUST be converted to lowercase + before they are stored or transmitted. STD13 host + identifiers are always compared in a case-neutral form. + + STD13 host identifiers which are received from the network can + contain labels which have been encoded as STD13 octet sequences + UTF-8. In both cases, the comparison rules defined in section + 4.1.3 MUST be applied. + + + 5. Transfer Encodings and Label Types + + As was discussed in section 4.1.2, internationalized domain names + and labels are required to be encoded as either eight-bit or + seven-bit data whenever they are transmitted as protocol or + application data. + + The particular output encoding format which will be used for any + given label will be primarily determined by the capabilities of + the participating end-point systems. If the application or + protocol which is relaying the domain name labels supports + internationalized domain names directly then UTF-8 encoded labels + can be used, but if the protocol or application is only capable of + supporting STD13 host identifiers as domain name data, then the + STD13 octet and/or ACE encoded labels will have to be used. + + With DNS messages in particular, the "data type" is the label + encapsulation in use. Although STD13 legacy labels allow for the + use of eight-bit codes, multiple encodings for the same basic + character data result in interpretation problems without some form + of ancillary tagging service. For this reason, each encoding is + represented differently by this specification. When the STD13 + legacy label contains STD13 octet sequences then no tagging is + provided, but if the STD13 legacy label contains ACE encoded data + then the encoded sequence is tagged with an ACE identifier (a + character prefix which does not normally appear in labels). When + UTF-8 domain names are provided, an EDNS/UTF-8 extended label is + used to encapsulate the internationalized domain name. + + Furthermore, the encoding which is used for any label in the + message will also determine the label type which is used to + encapsulate and transfer the entire domain name. If any label + contains EDNS/UTF-8 extended labels, then all of the labels from + that domain name are required to be encapsulated for transfer in + EDNS/UTF-8 extended labels. Conversely, if a domain name contains + ACE or STD13 octet encoded labels, then all of the labels from + + Hall I-D Expires: May 2002 [page 30] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + that domain name are required to be encapsulated for transfer + using the STD13 legacy label format. + + Note that other legacy applications and protocols will most likely + be required to provide extended encodings or negotiation features + before they can exchange internationalized domain names directly. + However, new applications and protocols which are subsequently + written to comply with BCP18 and this specification should not + require any such effort, as they should be capable of transferring + UTF-8 domain names from the beginning. + + + 5.1. The EDNS/UTF-8 Label Type + + Any internationalized domain name label which has been encoded as + UTF-8 for transmission in a DNS message MUST be encapsulated as a + EDNS/UTF-8 label. + + The EDNS/UTF-8 extended label is an instance of EDNS extended + label types (as defined by RFC2671). Extended labels are indicated + by the leading bit pattern of 0b01 in the label type field (the + first two bits from the "label length" octet of the STD13 legacy + label type), with the remaining six bits of this octet indicating + the extended label type in use. The EDNS/UTF-8 label type uses the + binary value of 0b000011 for this indication (note that IANA may + change this assignment). + + EDNS/UTF-8 labels contain two subordinate units of data. The first + octet contains a length indicator which works exactly the same as + the length octet as used by STD13 legacy labels: if the first two + bits of this octet are 0b00 then the rest of that octet provides + the length of the label data field, but if the first two bits of + this octet are 0b11 then the label is a pointer to some other + label, and the remainder of the length octet provides an off-set + which points to the length octet of the referenced label, as per + the rules provided in section 4.1.4 of RFC 1035 (STD13, part 2). + + + Hall I-D Expires: May 2002 [page 31] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + The structure of the EDNS/UTF-8 extended label is illustrated by + the following figure. + + 1 1 1 1 1 1 1 1 1 1 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |0 1|0 0 0 0 1 1| length | label data /// | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + 0b01 “ The extended label identifier. + + 0b000011 “ The EDNS/UTF-8 extended label type identifier. + + Length “ The number of octets in the label data, or the off- + set to the length octet of another EDNS/UTF-8 label. + + Label data “ The label data, encoded as UTF-8 octets. + + The following example shows the domain name of me.com, where the + "e" in "me" is the UCS character + (U+00E9), which has the UTF-8 encoded octet sequence of 0xC3A9. + + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 20 | 0 1 0 0 0 0 1 1| 0x03 | + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 22 | 0x6D (m) | 0xC3 (e') | + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 24 | 0xA9 (e') | 0 1 0 0 0 0 1 1| + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 26 | 0x03 | 0x63 (c) | + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 28 | 0x6F (o) | 0x6D (m) | + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + 30 | 0 1 0 0 0 0 1 1| 0x00 | + +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ + + Octet 20 identifies the EDNS/UTF-8 extended label type, while + octet 21 indicates that the label is three octets long. Octet 22 + contains the UTF-8 value for lowercase "m", while octets 23 and 24 + contain the UTF-8 value for the UCS character (encoded as 0xC3A9). + + Similarly, octet 25 identifies another EDNS/UTF-8 extended label + type, while octet 26 indicates that the label is three octets + long, while octets 27 through 29 contain the UTF-8 values for the + lowercase alphabetic sequence of "com". + + Hall I-D Expires: May 2002 [page 32] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + Finally, octet 30 identifies another EDNS/UTF-8 extended label + type, while octet 31 indicates that the label is zero octets in + length, thereby signifying the root zone (the end of the queried + domain name). + + Note that the use of the EDNS/UTF-8 extended label type serves + multiple purposes. On the one hand, it provides a method of + signaling the resolver's capabilities to the server, so that the + server can determine which format it needs to use when returning + answers, referrals or errors. Moreover, using an encapsulation + format which is not backwards compatible prevents certain + ambiguity problems which can result from overloading the STD13 + legacy label with multiple encodings. These problems are seen in + certain situations with STD13 octet encoding and ACE, where a + server cannot adequately determine which encoding a resolver + desires. By using a separate extended label type for UT-8, these + kinds of ambiguities are avoided. + + There are additional benefits which come from using EDNS extended + label types, which are best expressed as "future possibilities". + Once the EDNS extended label mechanisms are widely deployed, it + becomes feasible to specify additional encoding mechanisms as soon + as the Internet community deems it desirable. In this regard, + defining alternative encodings is much easier the second time. + + + 5.2. The STD13 Legacy Label Type + + Any internationalized domain name label which has been encoded as + ACE or STD13 octet sequences for transmission in a DNS message + MUST be encapsulated within an STD13 legacy label. + + This document does not deprecate, replace or extend the STD13 + octet encoding or label encapsulation rules defined by STD13. + However, this document does provide some guidance on the creation + and interpretation of ACE encoded labels when they are stored in + legacy labels, which is necessary in order for recipient systems + to properly detect and decode the label contents. + + Note that STD13 octet sequences and ACE data MAY both be provided + the same domain name. As such, each STD13 legacy label from a DNS + message must be examined and processed independently. + + + + Hall I-D Expires: May 2002 [page 33] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + 5.2.1. ACE encoded labels + + ACE encoded labels always begin with the character sequence of + (this document uses "zz--" as a placeholder sequence until a + formal assignment is made). Any label which contains ACE encoded + data MUST begin with this character sequence prefix. Similarly, + any label which begins with this character sequence MUST be + recognized and processed as an ACE encoded label, according to the + rules defined in this specification. + + Encoding and encapsulating a label as ACE data is a three-part + process, as follows: + + a. Encode the canonical UCS character data from the + internationalized domain name label into ACE using the + procedure defined in + + b. Preface the encoded output with the "zz--" prefix sequence, + thereby indicating that this label contains ACE encoded UCS + character data. + + c. Determine the length of the encoded data and store this + value in the STD13 legacy label's length octet. + + Decoding an ACE label is the opposite of that process. + + Note that whenever the ACE algorithm encounters a seven-bit + character code in the input, it is passed through unmodified to + the encoded output. If a label only contains seven-bit character + codes, the label MUST NOT be encoded as ACE, and MUST be encoded + as either STD13 octet sequences or UTF-8. Forcing a seven-bit + label to be encoded as ACE serves no benefit, incurs additional + processing on the end-point systems, and can also expose certain + security risks. Any system which is capable of generating and + deciphering ACE encoded labels is required to treat such sequences + as hostile, and MUST dispose of them immediately without any + further processing immediately; systems are forbidden to even + return these labels in DNS error messages. + + Similarly, ACE MUST NOT be used to encode any zero-length labels + (including but not specifically limited to the root domain), since + the presence of prefix characters in these labels can invalidate + their protocol-specific interpretations. + + When an STD13 legacy label is received which has "zz--" in the + first four character positions, the label MUST be treated as an + + Hall I-D Expires: May 2002 [page 34] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + ACE-encoded internationalized domain name, and MUST be decoded to + its canonical UCS character values for further processing. + + Note that STD13 legacy labels MUST be verified before the ACE + encoded data is extracted (as per the rules defined in STD13 which + govern the STD13 legacy label type), but systems which are + compliant with this specification MUST perform all subsequent + comparison, caching, or storage operations against the canonical + UCS characters, and MUST NOT use the ACE encoded label sequence + for any of these operations. + + Note that the legacy systems which are not compliant with this + specification will treat ACE encoded labels as any other STD13 + legacy label. + + + 5.2.2. STD13 octet encoded labels + + Any STD13 legacy labels which do not begin with the ACE prefix + MUST be treated as STD13 octet encoding sequences. The rules for + this process are defined by STD13's default label encapsulation + services, although this document also provides some clarifications + on the use of this encoding with internationalized domain names + and labels. + + Whenever the STD13 octet sequence is used to encode the labels + from an internationalized domain name, the octet values of the + canonical UCS characters are stored directly in the label. Because + the DNS message is limited to octets, the range of UCS character + codes which are eligible for use with STD13 octet sequences is + limited to U+0000 through U+00FF. If any UCS character codes + outside this range need to be transferred, the internationalized + domain name label will have to be encoded as ACE or UTF-8. + + Note that comparison operations for the seven-bit range of + alphabetic character values MUST be performed in a case-neutral + form, although eight-bit code values MUST NOT be normalized or + case-converted as part of a comparison operation. These rules are + required in order to ensure backwards compatibility with the STD13 + compliant systems which may be generating these labels as parts of + an STD13 domain name while also supporting the normalization and + case-conversion which may have been applied to the UCS characters + in the storage or transfer encoding systems. + + + + Hall I-D Expires: May 2002 [page 35] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + 6. Application Guidelines + + As was discussed in section 3.3, there are multiple scenarios in + which an application can make use of internationalized domain + names, ranging from simple lookups of connection identifiers to + abstract encapsulations of unstructured application data. This is + an extremely broad range of uses, which is complicated by the + extreme pervasiveness of applications and protocols that use + domain names for one or more of these purposes. + + Furthermore, network applications face a complex array of input + and output operations which will cumulatively affect the ability + of that application to make use of the internationalized domain + name system for various services and functions. These issues are + illustrated by the figure below: + + [IDNs] [IDNs] + | ^ + | | + +------V------+ +------+------+ + | input | | output | + | charset | | charset | + +-----------+-+ +-+-----------+ + \ / + +---+-----+---+ + | Application | + +---+-----+---+ + / \ + +-----------+-+ +-+-----------+ + | lookups | | app data <---> [IDNs] + +------+------+ +-------------+ + | + +------+------+ + | resolver <---> [IDNs] + +-------------+ + + As can be seen, the ability for an applications to complete adopt + internationalized domain names will be determined by many factors, + any one of which could prevent the application from completely + incorporating the restrictions and recommendations prescribed by + this specification. + + In order to allow for a flexible adoption schedule, this + specification defines very few mandates that applications must + adopt, but instead focuses on recommendations which applications + should comply with whenever they need to use internationalized + + Hall I-D Expires: May 2002 [page 36] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + domain names, and also provides recommendations for situations + where the preferred behavior is not feasible. Applications which + are compliant with all of the recommendations provided in this + specification will be able to generate, store, transfer and + resolve internationalized domain names throughout all of their + operations, using UTF-8 as a common encoding for all of these + operations. Meanwhile, applications which are not in complete + compliance with this specification will still be able to make use + of the internationalized domain names in these operations, + although such access may be limited to using backwards-compatible + encodings which require greater amounts of effort to implement and + which provide fewer benefits. + + + 6.1. Input and Output Charsets + + If an application is unable to accept, process, store or display + characters from the complete UCS repertoire, that application's + support for internationalized domain names will be somewhat + limited, by definition. + + Although this document does not mandate any particular charset or + encoding which all applications must use for all operations, + applications SHOULD use coded character sets or encodings which + can handle characters from a reasonable number of scripts. + + In particular, the following areas have specific requirements: + + * Input charsets and encodings. Since UTF-8 is used as the + default encoding for internationalized domain names + throughout this specification (and others, such as BCP18), + UTF-8 is also RECOMMENDED for use with input encodings of + internationalized domain names in particular, although this + is not required. Many platforms and development + environments support UTF-8 as a local encoding of the UCS + and it can be reasonably used with many types of input + (such as configuration files), although many systems will + require a specific encoding (such as UCS-2, or ISO/IEC + 8859-1) in situations which require memory access or + keyboard input. + + Regardless of the input encodings used, implementations + MUST map domain names and labels to their canonical UCS + characters for any normalization and case-conversion work + which is subsequently required by any DNS lookups (see + section 6.3). + + Hall I-D Expires: May 2002 [page 37] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + * Output choices will likely be limited to a system-preferred + charset or encoding. In general, this document RECOMMENDS + that output systems choose an output charset or encoding + which reflects the data being provided. However, + applications MUST NOT display unknown characters with + generic replacement characters (such as boxes or circles) + if it is known that the original characters are not + available for display with the specified charset, as such + characters will almost certainly trigger failure conditions + in subsequent protocol operations. + + In those situations where adequate input or output charsets or + encodings are unavailable, applications MAY use ACE to encode + internationalized domain names for the purpose of ensuring that + the data is provided intact. Since ACE is capable of representing + UCS characters as sequences of seven-bit characters, it is + functionally usable as a last line of defense in almost any + environment, with the caveat that ACE encoding sequences are + extremely cryptic and will likely result in lower levels of + usability and functionality. + + + 6.2. Protocol and Application Data + + There are several interrelated issues which will determine an + application's ability to provide or accept internationalized + domain names as protocol or application data, although the + principle determining factors for any such usage will generally be + the capabilities of the underlying protocol itself. + + If a protocol allows negotiation or tagging services in order to + distinguish between different encodings, that protocol can likely + be extended to support the use of UTF-8 as protocol or application + data through command/response negotiation options or through data- + type tags. Older protocols which do not provide any negotiation + services or which mandate the use of US-ASCII in all data will + likely require the use of ACE encoded domain names as a short-term + measure until the protocol is made compliant with BCP18. + + * Protocol data. If the protocol supports UTF-8 encoded + internationalized domain names in commands or responses, + then that encoding SHOULD be used wherever it is allowed. + If UTF-8 is not supported by the protocol, STD13 octet + sequences and/or ACE encoded equivalents of the + internationalized domain name MUST be used. + + Hall I-D Expires: May 2002 [page 38] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + + In some cases, this negotiation can be performed on a per- + session basis, while in other cases this work will need to + be performed for each transaction within the session, while + in other cases the internationalized domain names will have + to be tagged whenever they are provided as protocol or + application data. + + The DNS protocol is itself an example of a protocol which + requires tagging in order for internationalized domain + names to be exchanged within the existing DNS message (with + these indicators taking the form of ACE encoding prefixes + and EDNS/UTF-8 extended label type codes). Meanwhile, a + protocol such as WHOIS can theoretically support a session- + wide negotiation option that allowed the use of + internationalized domain names as protocol and application + data for the duration of that session. Conversely, a + protocol such as SMTP will likely require the use of + session-specific identifiers for some operations, while + other operations may be able to use label tags (similar to + the existing support for domain literals, which are + identified by a pair of surrounding square brackets). + + Regardless of the encodings which are used, implementations + MUST map domain names and labels to their canonical UCS + characters for any normalization and case-conversion work + which is subsequently required as part of a DNS lookup (see + section 6.3). + + * Structured application data. Structured application data + such as URLs and email addresses MUST be processed + according to the rules which govern those data formats. + Applications MUST NOT perform any conversion or + transliteration which is not explicitly prescribed by the + governing documents, since non-standard usages are likely + to result in misinterpreted data. + + * Unstructured application data. Domain names which appear as + unstructured data in application content are beyond the + control of this specification, and are generally subject to + the encoding and formatting desires of the end-users who + created the data. Generally speaking, it is RECOMMENDED + that applications allow users to enter or view documents in + whatever format they prefer, but that any conversion + between multiple source and destination charsets and + encodings use UCS as the translation intermediary, such + + Hall I-D Expires: May 2002 [page 39] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + that internationalized domain names are properly converted + along with the rest of the application data. + + In some cases, the application will need to probe the resolver + before it can use internationalized domain names as data. For + example, a participating system may need to determine the + internationalized domain name of the local system so that it can + provide this data in a protocol-specific banner message, and in + these cases, the application will have to communicate with the + resolver before this data can be provided. + + Due to the usage-specific nature of internationalized domain names + within protocol and application data streams, each development + group will have to analyze the restrictions and capabilities which + affect their specific services independently. + + + 6.3. DNS Lookups and Resolver Calls + + One of the most frequent uses for domain names is for lookup + operations, such as for locating the IP addresses associated with + a specified domain name, determining the domain name associated + with a specified IP address, or performing a protocol-specific + lookup operation for a specific resource record (such as the MX or + SOA resource records associated with a specific domain). + + Since these lookup operations do not directly affect external + protocols or data, internationalized domain names can be used for + lookup operations at the application's discretion. For example, + applications such as ping and netstat only use domain names for + display purposes, and can therefore make immediate use of + internationalized domain names within their protocol operations. + Similarly, a protocol can be limited to STD13 host identifiers as + protocol identifiers which will require the application to provide + internationalized domain names as ACE encoded sequences, but any + lookup operations which are necessary for the internationalized + domain names can still be performed in their native form. In these + cases, the protocol operations and lookup operations are separate + tasks with separate rules. + + Similarly, applications are not required to use internationalized + domain names and internationalized resolver APIs for every lookup. + In some cases, it may be more efficient for an application to only + use internationalized domain names for lookup operations against + connection identifiers, and to use STD13 octet sequences or ACE + encoded legacy lookups for domain names which were obtained as + + Hall I-D Expires: May 2002 [page 40] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + protocol or application data (this will be especially true in + those cases where the protocol does not yet provide an + internationalized domain name data-type). In those cases where an + application prefers to use the legacy resolution path, the + application MUST use the resolver's legacy APIs. For lookups + against internationalized domain names, the application MUST use + the resolver's internationalized APIs. + + Note that this specification does not define a mandatory encoding + which must be used between the applications and the local + resolver. However, resolvers MUST provide at least one encoding + which is capable of supporting the entire UCS repertoire of + character codes, including character codes which are currently + unassigned. Since UTF-8 is the default encoding which is used + throughout this specification, it is also RECOMMENDED for use with + resolver APIs, although this is not required. Resolvers MAY + dictate a local encoding, with the only requirement being support + for the entire range of UCS character codes. + + Regardless of the data being provided or the charset or encoding + which is used to provide that data, applications MUST normalize + and case-convert any internationalized host identifiers which it + generates or receives from a lookup operation. This process MUST + use the canonical UCS characters of the domain name according to + the rules specified in for every host identifier which + is sent to or received from a resolver. + + If the application knows that the requested data specifically + refers to a host identifier, then the domain name data which is + returned by the resolver MUST be normalized and case-converted, + and the resulting domain name MUST be compared to the original + domain name which was received prior to the normalization and + case-conversion steps. If the processed domain name does not match + the domain name which was received, the domain name MUST be + discarded as malformed. + + This step is necessary in order to ensure the integrity and + veracity of internationalized domain names which are processed by + applications, since there are multiple opportunities for errors to + be introduced (such as mistyped entries in the resolver's hosts + database, or malicious data which has been purposefully provided + in a zone), and these errors can result in sensitive data being + directed to the wrong network. Note that the above rule + specifically applies to host identifiers and not to all + internationalized domain names as a whole; applications MUST NOT + arbitrarily normalize and case-convert any and all domain names, + + Hall I-D Expires: May 2002 [page 41] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + but MUST apply these steps to any and all domain names which are + known to be used as host identifiers. + + As part of the processing rules for DNS lookups, it is expected + that an application can exchange internationalized domain names + with the resolver using a charset or encoding which is capable of + representing the entire UCS character code range. Towards this + objective, applications SHOULD test the capabilities of the + resolver prior to transferring internationalized domain names. In + those situations where the resolver is unable to support this + usage, the application MUST encode the internationalized domain + name as STD13 octet sequences or ACE, and pass the resulting STD13 + host identifier to the resolver. + + + 7. Resolver Guidelines + + Resolvers play a crucial role in the use of internationalized + domain names, in that they provide the internationalized namespace + which applications work with. As part of this service, resolvers + provide encapsulation services for the internationalized domain + names which are exchanged with the applications, resolve queries + in the internationalized namespace on behalf of the applications, + and provide lookup matching for entries which are stored in a + local hosts database. Note that resolvers which cache answer data + for subsequent operations are also governed by the caching + restrictions provided in section 9. + + + 7.1. Resolver APIs + + Stub resolvers which communicate directly with applications that + are compliant with this specification are strongly encouraged to + provide a separate set of APIs for those applications to use + whenever internationalized domain names need to be provided in + queries or response messages. + + The use of an internationalized API will generally facilitate + smoother operations for the applications, in that it will allow + the application to determine the capabilities of the resolver, to + obtain the internationalized domain name of the local system, and + to process queries for internationalized domain names as special + data types. + + Furthermore, the use of internationalized versus legacy APIs + provides a way for resolvers to separate internationalized and + + Hall I-D Expires: May 2002 [page 42] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + legacy application query paths, such that the legacy APIs only + result in STD13 legacy labels, while the internationalized APIs + generate and trigger EDNS/UTF-8 extended labels. The output + formatting of the DNS messages are controlled by tight + restrictions, and the use of alternative APIs will likely result + in simpler resolver implementations. + + For example, it is suggested that applications use the + internationalized APIs for all of the DNS lookups they generate, + even if the domain name only contains seven-bit characters. This + is required in case the queried domain name only exists with a + CNAME or PTR resource record which references an internationalized + domain name, and the server has to know which encoding to use for + that query. If the client had not used the internationalized API + for the original lookup of the domain name, the resolver may have + chosen the wrong label type, and thus the response data would only + be returned as ACE encoded data. + + Conversely, older applications which generate malformed eight-bit + queries through the legacy APIs will result in those queries being + properly rejected by the DNS servers, preventing undue problems + with these applications from occurring. For example, an older + application may process an internationalized domain name through + the system-default charset or encoding (such as MacRoman), which + would result in the domain name being malformed when the + application tried to do something important with that domain name + (such as send an email message over SMTP). The use of multiple + APIs causes these malformed applications to break, and the invalid + domain names are kept out of the application protocol space. + + Internationalized APIs are optional to the extent that an + application MAY use an embedded resolver which is known to be + capable of generating and processing internationalized domain + names through the existing function calls. However, the use of + separate APIs for internationalized domain names is encouraged. + + Although this document does not mandate any specific APIs, the + following functions SHOULD be provided for in some form: + + * Test Wide. Applications MUST be able to test the resolver + for compliance with this specification. In those cases + where this function is performed by some other function + (such as one of the following), the capabilities of the + resolver MUST be detectable even if the requested operation + fails. For example, if an application issues a call for the + internationalized domain name of the local system, the + + Hall I-D Expires: May 2002 [page 43] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + capability of the resolver to handle internationalized + domain names MUST be uniquely represented even if the local + host name cannot be determined. + + * Get Wide X-By-Y. Applications SHOULD be able to specify any + resource record associated with any internationalized + domain name as part of a lookup operation. Whether this + service is provided as a series of lookup-specific APIs or + as a general purpose API is up to the resolver. + + * Get Wide Local Name. Applications which utilize + internationalized domain names as data will need to be able + to determine the internationalized form of their local + system name for some operations (such as a protocol- + specific welcome banner). When this function is called, the + resulting data MUST be provided as the canonical UCS + character code values, or their equivalent as represented + by a locally mandated charset or encoding. + + Note that an ACE equivalent of the system name SHOULD be + returned when the relevant legacy API is queried. In those + cases where the legacy and internationalized domain names + both contain seven-bit character codes (possibly because + the host name is only available in US-ASCII, or because the + host name was assigned as ACE by an external configuration + service), the internationalized host name MUST still be + accessible through the internationalized function. + + Note that this application does not specify a charset or encoding + which must be used by the resolver APIs. However, wherever an + internationalized API is presented, the resolver MUST utilize a + charset or encoding which supports the entire UCS repertoire of + character codes, including character codes which are currently + unassigned. Since UTF-8 is the default charset for most of the + operations specified in this document, it is also RECOMMENDED for + this service, but is not required. + + + 7.2. Query Processing Services + + Resolvers which are compliant with the recommendations provided in + this specification will provide two query paths, one of which + supports STD13 domain names and another which supports + internationalized domain names. Technically, there is no + requirement for two processing paths, although these paths will + + Hall I-D Expires: May 2002 [page 44] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + likely exist as conceptual paths even if they are not represented + or implemented uniquely in all resolvers. + + The legacy processing path is defined by STD13. This document does + not update, modify or extend the rules that resolvers operate + under when an STD13 compliant domain name is received by a legacy + application through any legacy APIs which may exist. However, when + an internationalized domain name is received from an + internationalized application through any internationalized APIs, + the processing rules defined in this section MUST be followed. + Note that these rules apply to all resolvers, whether they are + stub resolvers, forwarders or caching servers. + + Generally speaking, the internationalized domain name resolution + process has two major components: processing internationalized + domain names as queries, and performing fall-back processing if an + EDNS/UTF-8 query is rejected by an authoritative server. + + + 7.2.1. Internationalized queries + + Queries for internationalized domain names which are received + through internationalized APIs can be expected to have originated + at an application which is capable of accepting and processing + internationalized domain names in the response messages. + + Resolvers MUST encode the labels from the queried domain name as + UTF-8 and encapsulate the resulting encoded labels into EDNS/UTF-8 + extended labels for transfer within DNS messages, per the + instructions provided in section 5.1. + + Any and all responses to these queries will also be encoded as + UTF-8 and encapsulated in EDNS/UTF-8 extended labels. Resolvers + MUST decode the provided response data, convert the labels to + their canonical UCS character codes, and return the requested data + to the calling application. + + The resolver MUST NOT normalize or case convert internationalized + domain names which may be received in queries or response + messages. Since the queries have originated from applications + which have indicated that they are compliant with this + specification (via the API) while the responses will have + originated from caches or servers which indicate that they are + also compliant (via the EDNS/UTF-8 extended labels), those systems + are assumed to have normalized and case-converted the domain names + before they were generated or stored. Also note that applications + + Hall I-D Expires: May 2002 [page 45] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + will validate the host identifiers that they receive in response + messages, so an additional check is expected to be performed on + the answer data by those systems. + + + 7.2.2. Fall-back processing + + If a queried server is unable to process EDNS/UTF-8 extended + labels, then it is required by STD13 to generate an error + signifying the problem. Resolvers MUST interpret these errors, + decode the UTF-8 queried domain name, re-encode it as STD13 octets + and/or ACE per the instructions provided in section 5.2, and then + reissue the query as an STD13 legacy label sequence. + + The legacy DNS error responses which will trigger this series of + events are FORMERR and NOTIMPL. Any other errors indicate that the + EDNS/UTF-8 extended label was successfully processed but that the + query was not matched, and those errors MUST be returned to the + application. If the fallback processing results in any error + responses whatsoever, then the resolver MUST return those errors + to the calling application. + + Any servers which subsequently receive the fall-back queries and + which are compliant with this specification will process the + queries as internationalized domain names, and will return the + answer data as STD13 octet sequences or ACE encoded data, using + the STD13 legacy label. + + Generally speaking, fall-back processing serves two purposes: + + * Answering the initial query. If a UTF-8 domain name cannot + be resolved because a server in the delegation path does + not understand the EDNS/UTF-8 label type, the resolver can + reissue the query as an ACE encoded legacy label type so + that the query proceeds past the problematic server. + + * Seeding the resolver's cache. As a result of the above, the + resolver will learn about the authoritative name servers + for the target zone, and this information can be used for + any subsequent queries for domain names within the + specified zone (for as long as the data is cached, anyway). + As such, any subsequent EDNS/UTF-8 queries which are issued + for the portion of the namespace served by that zone will + be sent directly to one of those authoritative servers + where they can be answered directly. In this regard, + + Hall I-D Expires: May 2002 [page 46] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + subsequent lookups do not require fall-back processing if + they are received during the cache window. + + Regardless of whether or not fall-back processing has been + performed, if the calling application issued the original query as + an internationalized domain name, then the resolver MUST respond + to the query in that form as well. This means that the resolver + MUST convert any STD13 octet sequences or ACE encoded labels into + their canonical UCS characters, convert the answer data into the + resolver's native charset or encoding, and return the data to the + calling process. The resolver MUST NOT perform any normalization + or case-conversion during this process, as such an action can + corrupt domain names which are not used for host identifiers. + + If the original query was received through the resolver's legacy + APIs, then the query MUST be generated and returned in the legacy + format, and MUST NOT be converted to an internationalized domain + name prior to the query or response being passed through. + + Once fall-back processing occurs, the process MUST NOT be repeated + for any additional queries in the current lookup operation. No + other queries from the current lookup operations MUST NOT be sent + as EDNS/UTF-8 extended labels, since multiple fall-back operations + can result in time-outs on the client systems. + + Because the fall-back process results in two lookups being issued + against the rejecting zone, eliminating the fall-back processing + as soon as possible will be an operational requirement for many + organizations. Any caches or forwarders which are used by stub + resolvers within an end-user network are practically required to + be able to process the EDNS/UTF-8 queries, since those servers + will receive every query which is issued by the stub resolvers. + While this isn't a technical requirement (fall-back processing + will get around the problematic servers), it will likely prove to + be a consideration for network operators looking to support + internationalized domain names on their local networks. + + This document also strongly encourages the root and TLD servers to + be upgraded as soon as possible (even if they do not intend to + directly provide UTF-8 domain name delegations), in order to allow + those servers to read and process the EDNS/UTF-8 extended labels, + thereby reducing the number of fall-back queries which are sent to + those servers. + + + + Hall I-D Expires: May 2002 [page 47] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + 7.3. The Hosts Database + + Generally speaking, there are two areas of consideration for stub + resolvers that provide local hosts databases for name resolution + services. These are the input requirements for internationalized + domain names which will be added to the hosts database, and the + requirements which govern how queries will be compared to the + entries in the hosts database. + + Note that resolvers are not required to implement a hosts database + or local lookup services (STD3 says "a host MAY also implement a + host name translation mechanism that searches a local Internet + host table"). However, wherever a hosts database is provided with + an internationalized resolver, compliance with the rules specified + in this section is required. + + If a stub resolver offers the capability to compare + internationalized domain names against a local hosts database, + that database MUST be compatible with the internationalized domain + name rules specified in section 4 of this document. + + In particular, the resolver SHOULD allow internationalized domain + names with any code values to be stored, even if the canonical UCS + characters for those values are undefined or are illegal for use + with internationalized host identifiers (this is required to + support domain names which are not host identifiers). In those + cases where an internationalized domain name specifies an exact + sequence of octets for binary comparison, the hosts database MUST + provide a mechanism for tagging the eight-bit characters so that + they are not interpreted, processed or compared as the canonical + UCS character equivalents of those codes. + + However, entries which explicitly provide host identifiers MUST be + normalized and case-converted prior to being stored. In order to + satisfy both of these requirements, it is RECOMMENDED that hosts + databases store internationalized host identifiers as untagged + data, but that they also provide some sort of tagging service for + character code values which are to be returned as-is. STD13 + defines an escaping mechanism whereby the decimal value of the + octet is prefaced with a reverse-solidus (such as "\193"), which + is suggested for this usage. + + The storage format of the hosts database MAY use any charset or + encoding the resolver deems most suitable for that platform, as + long as the rules and restrictions provided above are followed. + Since UTF-8 is used as the default encoding throughout this + + Hall I-D Expires: May 2002 [page 48] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + specification, it is RECOMMENDED as the default encoding for hosts + databases as well, although this is not required. + + Not all of the applications which use a resolver are likely to be + compliant with this specification, so resolvers MUST ensure that + they are able to interpret and process any queries from the legacy + APIs which provide the ACE equivalent of an internationalized + domain name that is stored in the hosts database. When such a + query arrives, the domain name MUST be converted to the canonical + UCS character codes represented by the ACE encoded sequence and + compared to entries in the hosts database in that form (tagged + octets excluded). Any internationalized domain names which are + required to be returned through the legacy APIs MUST be converted + to STD13 octet sequences and/or ACE before they are returned. + + + 8. Server Guidelines + + When a zone administrator desires to provide internationalized + domain names in a zone, they are presented with two options: they + can add the STD13 octets or ACE encoded internationalized domain + names to an existing zone, or they can use internationalized zone + databases directly. Both of these usage scenarios have their own + benefits and restrictions. + + Using STD13 octet sequences and ACE with legacy servers allows for + the immediate deployment of internationalized domain names on + existing servers, and within hierarchies which include + internationalized domain names. However, any such queries which + originate at applications that are compliant with this + specification will always initially fail, guaranteeing that fall- + back processing will always occur for those zones. + + Conversely, using internationalized zones directly allows servers + to process legacy, ACE and EDNS/UTF-8 queries equally, thereby + providing greater value to the applications and resolvers which + have been made compliant with this specification. However, + internationalized zones have additional requirements (most + notably, they are required to be upgraded simultaneously), and + these will prove burdensome to some zone operators. + + This specification focuses on the processing requirements for + internationalized zones which support the use of internationalized + domain names as explicit data, and which also support the + necessary subordinate mechanisms such as EDNS/UTF-8 queries. When + STD13 octet sequences or ACE encoded domain names are used with + + Hall I-D Expires: May 2002 [page 49] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + legacy servers, the rules defined in STD13 for those servers MUST + be used. + + Note that each zone SHOULD be configurable independently. If a + server hosts multiple zones, each of those zones SHOULD be + operable as independent entities, with any of them using ACE or + internationalized domain names as necessary. This rule is + necessary since each zone is likely to have different replication + partners and configuration rules which will require different + migration strategies. + + + 8.1. Internationalized Zones + + All domain names which are published by an internationalized zone + MUST be compatible with the restrictions specified in section 4 of + this document. In particular, the zone database MUST allow binary + domain names to be stored as any octet value, but MUST also comply + with the normalization and case-mapping rules when a domain name + represents a host identifier. These restrictions MUST be applied + as part of the process in which the domain name is being added to + the zone database. In those cases where an internationalized + domain name specifies an exact sequence of octets for binary + comparison, the hosts database MUST provide a mechanism for + tagging the eight-bit characters so that they are not interpreted, + processed or compared as the canonical UCS character equivalents + of those codes. STD13 defines an escaping mechanism whereby the + decimal value of the octet is prefaced with a reverse-solidus + (such as "\193"), which is suggested for this usage. + + Servers which are compliant with this specification MUST be + capable of providing UTF-8 and ACE encoded representations of the + UCS domain names which are stored in the zone, and servers MUST + restrict output to only one label type for any protocol operation, + such that queries containing STD13 legacy labels MUST be answered + with STD13 octet sequences and/or ACE encoded domain names, while + EDNS/UTF-8 queries MUST only be answered with UTF-8 encoded domain + names (this not only includes basic operations such as simple + queries, but also includes advanced operations such as zone + transfers; see section 8.2). Similarly, external operations such + as exporting the contents of the zone to a master file (as + discussed in section 8.3) MUST result in a single encoding form + being used for that specific operation. + + Note that the underlying zone database technology which may be + employed by any particular server is beyond the scope of this + + Hall I-D Expires: May 2002 [page 50] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + document. Servers MAY use any database technology, charset or + encoding deemed appropriate for the local environment, although + the contents of the zone MUST be mapped to the canonical UCS + character codes for all comparison operations (octet values + excluded). Since UTF-8 is used as the default encoding throughout + this specification, it is RECOMMENDED for use as the default + encoding with zone databases as well, but is not required. + + Servers MUST NOT normalize or case-map any UCS characters which + are decoded from UTF-8 or ACE encoded labels, and MUST restrict + comparison operations of these labels to precise matches of the + UCS domain names which are stored in the zone database. However, + the seven bit character codes from any labels which are received + as STD13 octet sequences MUST be compared in a case-neutral form, + and MUST NOT be normalized as part of the comparison operation. + + When a zone is converted to support internationalized domain + names, all of the servers which replicate that zone MUST be + upgraded. This is required due to ambiguities that can occur with + labels which may be encoded as either STD13 octet sequences or ACE + data, and where the label only uses character codes from the + eight-bit range of character codes (this problem is described in + detail in section 4.1.2). In order to ensure that all of the + servers for a zone respond to one of those queries correctly, all + of the servers which replicate the zone MUST fully support this + document and its requirements. + + + 8.2. Namespace Visibility Restrictions + + In all cases, the encoding format of the domain names which are + returned in response to a query MUST be the same as the encoding + format which was used by the query. If the query was provided as a + sequence of legacy labels, then all of the domain names which are + provided in the response message MUST be provided as legacy labels + (containing either ACE or STD13 octet encoded values). + + Similarly, if a query is provided as EDNS/UTF-8 encoded data, all + domain names which are provided in the response message MUST be + provided as UTF-8 encoded data in EDNS/UTF-8 extended labels. In + some situations, this process may require the server to perform an + extra conversion. + + For example, assume that the .example.com. domain name has + two associated MX resource records, one of which points to the UCS + domain name of mail..example.com, while the other points to + + Hall I-D Expires: May 2002 [page 51] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + the ACE encoded domain name of mail..example.net. (where the + "" label is the ACE equivalent of an internationalized sub- + domain in the example.net. zone). If a UTF-8 query arrives for the + MX resource records associated with the .example.com. domain + name, both resource records MUST be returned as EDNS/UTF-8 data. + In order for this requirement to be satisfied, the server will + have to decode the label to its UCS canonical form for zone + storage purposes, and encode the domain name as UTF-8 for + transmission whenever an EDNS/UTF-8 answer set is required. + + The visibility rules specified in this section are mandatory for + every domain name which is provided in any message. If a system + requests a zone transfer and uses the EDNS/UTF-8 extended label + type in the request, all of the domain names in all of the + messages which are sent as part of the zone transfer MUST be + provided in their UTF-8 encoded form. Similarly, if a zone + transfer is requested and uses the legacy label type, then all of + the domain names from all of the messages which are sent as part + of the zone transfer MUST be provided as either STD13 octet + sequences or ACE encoded data, using the legacy label type. + + + 8.3. The Master File Format + + STD13 specifies a "master file" format which is used as a + platform-neutral storage and transfer format for importing and + exporting the contents of a particular zone. Note that the master + file is not the same as the operating database for a zone; the + master file format is used (or is useful) for copying a zone to + another server, storing a copy of the zone database off-line, + emailing a copy of the zone to another user or system, and + performing other off-line actions against the database' contents. + Once a zone is loaded on a server, however, any database + technology can be used for managing the zones and generating + response messages. + + In order to facilitate the continued use of master files, any zone + which is compliant with this specification MUST support the use of + UTF-8 as an import and export encoding format for the master file + associated with that zone. + + Furthermore, compliant versions of a master file are required to + have the "$UTF-8" control literal at the beginning of the first + line of text in the master file if it contains UTF-8 encoded data. + Master files from zones which do not contain UTF-8 encoded domain + + Hall I-D Expires: May 2002 [page 52] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + names MUST NOT contain the "$UTF-8" control literal in the first + print position of any line. + + If the master file contains the "$UTF-8" control literal, all of + the data within the master file MUST be encoded in UTF-8 as + specified by RFC2279, and SHOULD be managed with UTF-8 compliant + tools (such as UTF-8 text editors, mailers that support UTF-8 MIME + encodings, and so forth). + + + 9. Caching Guidelines + + Whenever an internationalized domain name is stored in a cache, it + MUST be stored in its canonical UCS character code form, + regardless of whether the domain name was received as STD13 octet + encoding sequences, UTF-8, or ACE data. Caches MUST NOT normalize + or case convert any domain names that they store, as such a + process could invalidate domain names that are not used for host + identifiers. + + Any subsequent queries which are processed through the cache MUST + be compared against the stored UCS characters. Internationalized + domain name labels which are decoded from UTF-8 or ACE labels MUST + NOT be normalized or case-converted as part of the comparison + operation, although labels which are provided as STD13 octet + sequences MUST be compared as case-neutral octet values. + + Caches MUST be capable of providing UTF-8 and ACE encoded + representations of the UCS domain names which are stored in the + cache, with the appropriate format determined by the format used + in the corresponding query. However, answer data MUST be + restricted to only one encoding form for any protocol operation, + meaning that queries containing legacy labels MUST only be + answered with STD13 octet sequences and/or ACE encoded labels, + while UTF-8 queries MUST only be answered with UTF-8 encoded + domain names. + + + 10. Security Considerations + + This document defines an extension to the domain name system, and + as such, it inherits the weaknesses which already exist in DNS. + Where possible, this specification strengthens DNS with multiple + checks. For example, this specification requires that domain names + be validated three times before they are used by applications: + once on specification, once on entry at the authoritative zone or + + Hall I-D Expires: May 2002 [page 53] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + hosts database, and once again when the answer data is received by + the requesting application. Despite these checks, the root + weaknesses inherent in DNS are still present. + + This document uses multiple encoding algorithms, although boundary + conditions from the existing DNS are preserved for both the source + and encoded representations. + + + 11. IANA Considerations + + This document requires the use of an EDNS extended label type + identification code. This document uses the b000011 ELT code. + + + 12. References + + [AMC-ACE-Z] , "AMC-ACE-Z version + 0.3.1" + + [NAMEPREP] , "Preparation of + Internationalized Host Names" + + [RFC2119] "Key words for use in RFCs to Indicate Requirement + Levels" + + [RFC952] "DoD Internet host table specification" + + [STD13] (RFC 1034) "Domain names - concepts and facilities", + (RFC 1035) "Domain names - implementation and + specification" + + [STD3] (RFC 1122) "Requirements for Internet Hosts -- + Communication Layers", (RFC1123) "Requirements for Internet + Hosts -- Application and Support" + + [BCP18] (RFC 2277) "IETF Policy on Character Sets and + Languages" + + [RFC2279] "UTF-8, a transformation format of ISO 10646" + + [RFC2671] "Extension Mechanisms for DNS (EDNS0)" + + [ASCII] "ANSI X3.4-1968. USA Standard Code for Information + Interchange" + + + Hall I-D Expires: May 2002 [page 54] + INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001 + + + [ISO10646] "ISO/IEC 10646-1:2000. International Standard -- + Information technology -- Universal Multiple-Octet Coded + Character Set (UCS) -- Part 1: Architecture and Basic + Multilingual Plane" + + + 13. Acknowledgements + + This document is an assembly of multiple ideas and proposals which + have been made on the IDN working group mailing list. Many of the + ideas presented here have been proposed by multiple parties in one + form or another, although Dan Oscarsson is credited for proposing + a dual-mode operation which is capable of simultaneously + supporting UTF-8 and legacy mode encodings. Other contributors to + key elements from this specification (some of them unknowingly or + unwillingly) include (alphabetically) Marc Blanchett, Adam + Costello, Mark Davis, Martin Duerst, Patrik Faltstrom, Paul + Hoffman, David Hopwood, and many others. + + + 14. Editor's Address + + Eric A. Hall + ehall@ehsco.com + + + + + Hall I-D Expires: May 2002 [page 55]