diff --git a/doc/draft/draft-ietf-idn-amc-ace-m-00.txt b/doc/draft/draft-ietf-idn-amc-ace-m-00.txt new file mode 100644 index 0000000000..a2401e5d8d --- /dev/null +++ b/doc/draft/draft-ietf-idn-amc-ace-m-00.txt @@ -0,0 +1,1741 @@ +INTERNET-DRAFT Adam M. Costello +draft-ietf-idn-amc-ace-m-00.txt 2001-Feb-12 +Expires 2001-Aug-14 + + AMC-ACE-M version 0.1.0 + +Status of this Memo + + This document is an Internet-Draft and is in full conformance with + all provisions of Section 10 of RFC2026. + + Internet-Drafts are working documents of the Internet Engineering + Task Force (IETF), its areas, and its working groups. Note + that other groups may also distribute working documents as + Internet-Drafts. + + Internet-Drafts are draft documents valid for a maximum of six + months and may be updated, replaced, or obsoleted by other documents + at any time. It is inappropriate to use Internet-Drafts as + reference material or to cite them other than as "work in progress." + + The list of current Internet-Drafts can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt + + The list of Internet-Draft Shadow Directories can be accessed at + http://www.ietf.org/shadow.html + + Distribution of this document is unlimited. Please send comments + to the author at amc@cs.berkeley.edu, or to the idn working + group at idn@ops.ietf.org. A non-paginated (and possibly + newer) version of this specification may be available at + http://www.cs.berkeley.edu/~amc/charset/amc-ace-m + +Abstract + + AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE] + characters to a sequence of letters (A-Z, a-z), digits (0-9), and + hyphen-minus (-), henceforth called LDH characters. Such a map + (called an "ASCII-Compatible Encoding", or ACE) might be useful for + internationalized domain names [IDN], because host name labels are + currently restricted to LDH characters by [RFC952] and [RFC1123]. + + AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient + but complex) and DUDE [DUDE00] (which is simple and provides case + preservation). AMC-ACE-M is much simpler than BRACE but similarly + efficient, and provides case preservation like DUDE. + + Besides domain names, there might also be other contexts where it is + useful to transform Unicode characters into "safe" (delimiter-free) + ASCII characters. (If other contexts consider hyphen-minus to be + unsafe, a different character could be used to play its role, like + underscore.) + +Contents + + Features + Name + Overview + Base-32 characters + Encoding procedure + Decoding procedure + Signature + Case sensitivity models + Comparison with RACE, BRACE, LACE, and DUDE + Example strings + Security considerations + References + Author + Example implementation + +Features + + Uniqueness: Every Unicode string maps to at most one LDH string. + + Completeness: Every Unicode string maps to an LDH string. + Restrictions on which Unicode strings are allowed, and on length, + may be imposed by higher layers. + + Efficient encoding: The ratio of encoded size to original size is + small for all Unicode strings. This is important in the context + of domain names because [RFC1034] restricts the length of a domain + label to 63 characters. + + Simplicity: The encoding and decoding algorithms are reasonably + simple to implement. The goals of efficiency and simplicity are at + odds; AMC-ACE-M aims at a good balance between them. + + Case-preservation: If the Unicode string has been case-folded prior + to encoding, it is possible to record the case information in the + case of the letters in the encoding, allowing a mixed-case Unicode + string to be recovered if desired, but a case-insensitive comparison + of two encoded strings is equivalent to a case-insensitive + comparison of the Unicode strings. This feature is optional; see + section "Case sensitivity models". + + Readability: The letters A-Z and a-z and the digits 0-9 appearing + in the Unicode string are represented as themselves in the label. + This comes for free because it usually the most efficient encoding + anyway. + +Name + + AMC-ACE-M is a working name that should be changed if it is adopted. + (The M merely indicates that it is the thirteenth ACE devised by + this author. BRACE was the third. D through L did not deliver + enough efficiency to justify their complexity.) Rather than waste + good names on experimental proposals, let's wait until one proposal + is chosen, then assign it a good name. Suggestions (assuming the + primary use is in domain names): + + UniHost + UTF-A ("A" for "ASCII" or "alphanumeric", + but unfortunately UTF-A sounds like UTF-8) + UTF-H ("H" for "host names", + but unfortunately UTF-H sounds like UTF-8) + UTF-D ("D" for "domain names") + NUDE (Normal Unicode Domain Encoding) + +Overview + + AMC-ACE-M maps characters to characters--it does not consume or + produce code points, code units, or bytes, although the algorithm + makes use of code points, and implementations will of course need to + represent the input and output characters somehow, usually as bytes + or other code units. + + Each character in the Unicode string is represented by an + integral number of characters in the encoded string. There is no + intermediate bit string or octet string. + + The encoded string alternates between two modes: literal mode and + base-32 mode. LDH characters in the Unicode string are encoded + literally, except that hyphen-minus is doubled. Non-LDH characters + in the Unicode string are encoded using base-32, in which each + character of the encoded string represents five bits (a "quintet"). + A non-paired hyphen-minus in the encoded string indicates a mode + change. + + In base-32 mode a group of one to five quintets are used to + represent a number, which is added to an offset to yield a + Unicode code point, which in turn represents a Unicode character. + (Surrogates, which are code units used by UTF-16 in pairs to + refer to code points, are not used and not allowed in AMC-ACE-M.) + Similarities between the code points are exploited to make the + encoding more compact. + +Base-32 characters + + "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 + "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 + "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 + "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 + "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 + "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 + "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 + "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 + "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 + "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 + "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 + "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 + "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 + "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 + "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 + "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 + + The digits "0" and "1" and the letters "o" and "l" are not used, to + avoid transcription errors. + + All decoders must recognize both the uppercase and lowercase + forms of the base-32 characters. The case may or may not convey + information, as described in section "Case sensitivity models". + +Encoding procedure + + The encoder first examines the Unicode string and chooses some + parameters. It writes these parameters into the output string, then + proceeds to encode each Unicode character, one at a time. The exact + sequence of steps is given below. All ordering of bits and quintets + is big-endian (most significant first). The >> and << operators + used below mean bit shift, as in C. For >> there is no question of + logical versus arithmetic shift because AMC-ACE-M makes no use of + negative numbers. + + 0) Determine the Unicode code point for each non-LDH character in + the Unicode string. Since LDH characters are encoded literally, + their code points are not needed. Depending on how the Unicode + string is presented to the encoder, this step may be a no-op. + + 1) Verify that there are are no invalid code points in the input; + that is, none exceed 0x10FFFF (the highest code point in the + Unicode code space) and none are in the range D800..DFFF + (surrogates). + + 2) Determine the most populous row: Row n is defined as the 256 + code points starting with n << 8, except that this definition + would makes rows D8..DF useless, because they would contain only + surrogates. Therefore AMC-ACE-M defines rows D8..DF to be the + following non-aligned blocks of 256 code points: + + row D8 = 0020..001F + row D9 = 005B..015A + row DA = 007B..017A + row DB = 00A0..019F + row DC = 00C0..01BF + row DD = 00DF..01DE + row DE = 0134..0233 + row DF = 0270..036F + + (Rationale: Whereas almost every small script is confined to + a single row, the Latin script is split across a few rows, + and the row boundaries are not especially convenient for many + languages.) + + Determine the row containing the most non-LDH input code points, + breaking ties in favor of smaller-numbered rows. (If a code + point appears multiple times in the input, it counts multiple + times. This applies to steps 3 and 4 also.) Call it row B. + Let offsetB be the first code point of row B. + + 3) Determine the most populous 16-window: For each n in 0..31 let + offset = ((offsetB >> 3) + n) << 3 and count the number of code + points in the range offset through offset + 0xF. Let A be the + value of n that maximizes this count, breaking ties in favor + of smaller values of n, and let offsetA be the corresponding + offset. + + 4) Determine the most populous 20k-window: If the input is empty, + then let C = 0. Otherwise, for each input code point, let n = + code_point >> 11, and count the number of non-LDH input code + points that are not in row B and are in the range (n << 11) + through (n << 11) + 0x4FFF. Determine the value of n that + maximizes the count, breaking ties in favor of smaller values of + n, and let C be that value. + + 5) Choose a style: One of the base-32 codes used in step 7.3 has + two variants, and so base-32 mode is subdivided into two styles, + narrow and wide, depending on which variant is used. Compute + the total number of base-32 characters that would be produced + if narrow style were used, and the number if wide style were + used. The easiest way to do this is to mimic the logic of steps + 6 and 7.3. Use whichever style would produce fewer base-32 + characters. In case of a tie, use narrow style. + + 6) Encode the parameters. If narrow style is used, then let + offsetC = (offsetB >> 12) << 12, and encode B and A as three or + four base-32 characters: + + 00bbb bbbbb aaaaa if B <= 0xFF + 01bbb bbbbb bbbbb aaaaa otherwise + + If wide style is used, then let offsetC = C << 11, and encode B + and C as three or five base-32 characters: + + 10bbb bbbbb ccccc if B <= 0xFF and C <= 0x1F + 11bbb bbbbb bbbbb ccccc ccccc otherwise + + 7) Encode each input character in turn, using the first of the + following cases that applies. The mode is initially base-32. + + 7.1) The character is a hyphen-minus (U+002D). Encode it as + two hyphen-minuses. + + 7.2) The character is an LDH character. If in base-32 mode + then output a hyphen-minus and switch to literal mode. + Copy the character to the output. + + 7.3) The character is a non-LDH character. If in literal + mode then output a hyphen-minus and switch to base-32 + mode. Encode the character's code point using the + first of the following cases that applies. Square + brackets enclose quintets that can be used to record + the upper/lowercase attribute of the Unicode character + (because the corresponding base-32 characters are + guaranteed to be letters rather than digits) (see section + "Case sensitivity models"). + + 7.3.1) Narrow style was chosen and the code point is in + the range offsetA through offsetA + 0xF. Subtract + offsetA and encode the difference as a single + base-32 character: + + [0xxxx] + + 7.3.2) The code point is in the range offsetB through + offsetB + 0xFF. Subtract offsetB and encode the + difference as two base-32 characters: + + 1xxxx [0xxxx] + + 7.3.3) The code point is in the range offsetC through + offsetC + 0xFFF. Subtract offsetC and encode the + difference as three base-32 characters: + + 1xxxx 1xxxx [0xxxx] + + 7.3.4) Wide style was chosen and the code point is in + the range offsetC + 0x1000 through offsetC + + 0x4FFF. Subtract offsetC + 0x1000 and encode the + difference as three base-32 characters: + + [0xxxx] xxxxx xxxxx + + 7.3.5) The code point is in the range 0 through 0xFFFF. + Encode it as four base-32 characters: + + 1xxxx 1xxxx 1xxxx [0xxxx] + + 7.3.6) If we've come this far, the code point must be + in the range 0x10000 through 0x10FFFF. Subtract + 0x10000 and encode the difference as five base-32 + characters: + + 1xxxx 1xxxx 1xxxx 1xxxx [0xxxx] + +Decoding procedure + + The details of the decoding procedure are implied by the encoding + procedure. The overall sequence of steps is as follows. + + 1) Undo the encoder's step 6: From the first few base-32 + characters, determine whether narrow or wide style is used, and + determine the offsets. + + 2) Set the mode to base-32. For each remaining input character, use + the first of the following cases that applies: + + 2.1) The character is a hyphen-minus, and the following + character is also a hyphen-minus. Consume them both and + output a hyphen-minus. + + 2.2) The character is a hyphen-minus. Consume it and toggle + the mode flag. + + 2.3) The current mode is literal. Consume the input character + and output it. + + 2.4) Interpret the input character and up to four of its + successors as base-32. Consume characters until one is + found whose value has the form 0xxxx. That is the one + that carries the upper/lowercase information. Remember + the length of the code. If the length is one and wide + style is being used, consume two more characters. + Decode the base-32 characters into an integer, add the + appropriate offset (which depends on the remembered code + length), and output the Unicode character corresponding to + the resulting code point. + + If the case-flexible or case-preserving model is being + used (see section "Case sensitivity models"), the decoder + must either perform the case conversion as it is decoding, + or construct a separate record of the case information to + accompany the output string. + + 3) Before returning the output (be it a string or a string plus + case information), the decoder must invoke the encoder on it, + and compare the result to the input string. The comparison + must be case-sensitive if the case-sensitive or case-flexible + model is being used, case-insensitive if the case-insensitive + or case-preserving model is being used. If the two strings do + not match, it is an error. This check is necessary to guarantee + the uniqueness property (there cannot be two distinct encoded + strings representing the same Unicode string). + + If the decoder at any time encounters an unexpected character, or + unexpected end of input, then the input is invalid. + +Signature + + The issue of how to distinguish ACE strings from unencoded strings + is largely orthogonal to the encoding scheme itself, and is + therefore not specified here. In the context of domain name labels, + a standard prefix and/or suffix (chosen to be unlikely to occur + naturally) would presumably be attached to ACE labels. (In that + case, it would probably be good to forbid the encoding of Unicode + strings that appear to match the signature, to avoid confusing + humans about whether they are looking at a Unicode string or an ACE + string.) + + In order to use AMC-ACE-M in domain names, the choice of signature + must be mindful of the requirement in [RFC952] that labels never + begin or end with hyphen-minus. The raw encoded string will never + begin with a hyphen-minus, and will end with a hyphen-minus iff the + Unicode string ends with a hyphen-minus. The easiest solution is + to use a suffix as the signature. Alternatively, if the Unicode + strings were forbidden from ending with a hyphen-minus, a prefix + could be used. + + It appears that "---" is extremely rare in domain names; among the + four-character prefixes of all the second-level domains under .com, + .net, and .org, "---" never appears at all. Therefore, perhaps the + signature should be of the form ?--- (prefix) or ---? (suffix), + where ? could be "u" for Unicode, or "i" for internationalized, or + "a" for ACE, or maybe "q" or "z" because they are rare. + +Case sensitivity models + + The higher layer must choose one of the following four models. + + Models suitable for domain names: + + * Case-insensitive: Before a string is encoded, all its non-LDH + characters must be case-folded so that any strings differing + only in case become the same string (for example, strings could + be forced to lowercase). Folding LDH characters is optional. + The case of base-32 characters and literal-mode characters is + arbitrary and not significant. Comparisons between encoded + strings must be case-insensitive. The original case of non-LDH + characters cannot be recovered from the encoded string. + + * Case-preserving: The case of the Unicode characters is not + considered significant, but it can be preserved and recovered, + just like in non-internationalized host names. Before a string + is encoded, all its non-LDH characters must be case-folded + as in the previous model. LDH characters are naturally able + to retain their case attributes because they are encoded + literally. The case attribute of a non-LDH character is + recorded in one of the base-32 characters that represent + it (section "Encoding procedure" tells which one). If the + base-32 character is uppercase, it means the Unicode character + is caseless or should be forced to uppercase after being + decoded (which is a no-op if the case folding already forces + to uppercase). If the base-32 character is lowercase, it + means the Unicode character is caseless or should be forced to + lowercase after being decoded (which is a no-op if the case + folding already forces to lowercase). The case of the other + base-32 characters in a multi-quintet encoding is arbitrary + and not significant. Only uppercase and lowercase attributes + can be recorded, not titlecase. Comparisons between encoded + strings must be case-insensitive, and are equivalent to + case-insensitive comparisons between the Unicode strings. The + intended mixed-case Unicode string can be recovered as long as + the encoded characters are unaltered, but altering the case of + the encoded characters is not harmful--it merely alters the case + of the Unicode characters, and such a change is not considered + significant. + + In this model, the input to the encoder and the output of the + decoder can be the unfolded Unicode string (in which case the + encoder and decoder are responsible for performing the case + folding and recovery), or can be the folded Unicode string + accompanied by separate case information (in which case the + higher layer is responsible for performing the case folding and + recovery). Whichever layer performs the case recovery must + first verify that the Unicode string is properly folded, to + guarantee the uniqueness of the encoding. + + It is easy to extend the nameprep algorithm [NAMEPREP02] to + remember case information. It merely requires an additional + bit to be associated with each output code point in the mapping + table. + + The case-insensitive and case-preserving models are interoperable. + If a domain name passes from a case-preserving entity to a + case-insensitive entity, the case information will be lost, but + the domain name will still be equivalent. This phenomenon already + occurs with non-internationalized domain names. + + Models unsuitable for domain names, but possibly useful in other + contexts: + + * Case-sensitive: Unicode strings may contain both uppercase and + lowercase characters, which are not folded. Base-32 characters + must be lowercase. Comparisons between encoded strings must be + case-sensitive. + + * Case-flexible: Like case-preserving, except that the choice + of whether the case of the Unicode characters is considered + significant is deferred. Therefore, base-32 characters must + be lowercase, except for those used to indicate uppercase + Unicode characters. Comparisons between encoded strings may be + case-sensitive or case-insensitive, and such comparisons are + equivalent to the corresponding comparisons between the Unicode + strings. + +Comparison with RACE, BRACE, LACE, and DUDE + + In this section we compare AMC-ACE-M and four other ACEs: RACE + [RACE03], BRACE [BRACE00], LACE [LACE01], and Extended DUDE + [DUDE00]. We do not include SACE [SACE], UTF-5 [UTF5], or UTF-6 + [UTF6] in the comparison, because SACE appears obviously too + complex, UTF-5 appears obviously too inefficient, and UTF-6 can + never be more efficient than its similarly simple successor, DUDE. + + Case preservation support: + + DUDE, AMC-ACE-M: all characters + BRACE: only the letters A-Z, a-z + RACE, LACE: none + + RACE, BRACE, and LACE transform the Unicode string to an + intermediate bit string, then into a base-32 string, so there is no + particular alignment between the base-32 characters and the Unicode + characters. DUDE and AMC-ACE-M do not have this intermediate stage, + and enforce alignment between the base-32 characters and the Unicode + characters, which facilitates the case preservation. + + Complexity is hard to measure. This author would subjectively + describe the complexity of the algorithms as: + + RACE, LACE, DUDE: fairly simple but not trivial + AMC-ACE-M: moderate + BRACE: complex + + The complexity of AMC-ACE-M is in the number of rules, but the + individual rules are not very complex, and they are generally + non-interacting. + + The relative efficiency of the various algorithms is suggested + by the sizes of the encodings in section "Example strings". For + each ACE there is a graph below showing a horizontal bar for + each example string, representing the ACE length divided by the + minimum length among all the ACEs for that example string (so the + ratio is at least 1). Example R is excluded because it violates + nameprep [NAMEPREP02]. The other example strings all use different + languages, except that there are several Japanese examples. To + avoid skewing the results, each graph collapses all the Japanese + ratios into a single bar representing the median ratio. A ratio r + is represented by a bar of length r/0.04 characters. Since the bar + will always be at least 1/0.04 = 25 characters long, we show the + first 25 characters as "O" and the rest as "@". The bars are sorted + so that the graph looks like a cummulative distribution. Each bar + is labeled with the language of the corresponding example string. + (The difference between the Chinese and Taiwanese strings is that + the former uses simplified characters.) + + RACE: + Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@ + Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@ + Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ + Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ + Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ + Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ + Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@ + Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@ + Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@ + Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@@@@@@@@ + + LACE: + Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@ + Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ + Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ + Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ + Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ + Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@ + Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@ + Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@@@@@@@@@@@ + + DUDE: + Russian OOOOOOOOOOOOOOOOOOOOOOOOO + Arabic OOOOOOOOOOOOOOOOOOOOOOOOO + Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@ + Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@ + Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ + Japanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ + Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ + Spanish OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@ + Czech OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@ + Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@@@@ + + AMC-ACE-M: + Czech OOOOOOOOOOOOOOOOOOOOOOOOO + Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO + Japanese OOOOOOOOOOOOOOOOOOOOOOOOO + Korean OOOOOOOOOOOOOOOOOOOOOOOOO + Russian OOOOOOOOOOOOOOOOOOOOOOOOO + Spanish OOOOOOOOOOOOOOOOOOOOOOOOO + Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO + Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO + Chinese OOOOOOOOOOOOOOOOOOOOOOOOO@ + Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@@@ + Hindi OOOOOOOOOOOOOOOOOOOOOOOOO@@@@@ + + BRACE: + Chinese OOOOOOOOOOOOOOOOOOOOOOOOO + Hindi OOOOOOOOOOOOOOOOOOOOOOOOO + Japanese OOOOOOOOOOOOOOOOOOOOOOOOO + Spanish OOOOOOOOOOOOOOOOOOOOOOOOO + Taiwanese OOOOOOOOOOOOOOOOOOOOOOOOO + Arabic OOOOOOOOOOOOOOOOOOOOOOOOO@ + Czech OOOOOOOOOOOOOOOOOOOOOOOOO@ + Vietnamese OOOOOOOOOOOOOOOOOOOOOOOOO@ + Hebrew OOOOOOOOOOOOOOOOOOOOOOOOO@@ + Korean OOOOOOOOOOOOOOOOOOOOOOOOO@@ + Russian OOOOOOOOOOOOOOOOOOOOOOOOO@@@ + + These results suggest that DUDE is preferrable to RACE and LACE, + because it has similar simplicity, better support for case + preservation, and is somewhat more efficient. + + The results also suggest that AMC-ACE-M is preferrable to BRACE, + because it has similar efficiency, better support for case + preservation, and is simpler. + + DUDE and AMC-ACE-M have equal support for case preservation, but + AMC-ACE-M offers significantly better efficiency, at the cost of + significantly greater complexity, so choosing between them entails a + value judgement. + +Example strings + + In the ACE encodings below, signatures (like "bq--" for RACE) are + not shown. Non-LDH characters in the Unicode string are forced to + lowercase before being encoded using BRACE, RACE, and LACE. For + RACE and LACE, the letters A-Z are likewise forced to lowercase. + UTF-8 and UTF-16 are included for length comparisons, with non-ASCII + bytes shown as "?". AMC-ACE-M is abbreviated AMC-M. Backslashes + show where line breaks have been inserted in ACE strings too long + for one line. The RACE and LACE encodings are courtesy of Mark + Davis's online UTF converter [UTFCONV] (slightly modified to remove + the length restrictions). + + The first several examples are all names of Japanese music artists, + song titles, and TV programs, just because the author happens to + have them handy (but Japanese is useful for providing examples + of single-row text, two-row text, ideographic text, and various + mixtures thereof). + + (A) 3B (Japanese TV program title) + + = U+5E74 (kanji) + = U+7D44 (kanji) + = U+91D1 U+516B U+5148 U+751F (kanji) + + UTF-16: ???????????????? + UTF-8: 3???B??????????????? + AMC-M: utk-3-8ze-B-hkenqtymwifi9 + BRACE: u-3-ygj-b-ynb6gjc7pp4k5p5w + DUDE: j3le74G062nd44p1d1l16bk8n51f + RACE: 3aadgxtuabrh2rer2fiwwukioupq + LACE: 74adgxtuabrh2rer2fiwwukioupq + + (B) -with-SUPER-MONKEYS (Japanese music group name) + + = U+5B89 U+5BA4 U+5948 U+7F8E U+6075 (kanji) + + UTF-8: ??????????????????-with-SUPER-MONKEYS + AMC-M: u5m2j4etwif6q2zf---with--SUPER--MONKEYS + BRACE: uvj7fuaqcahy982xa---with--SUPER--MONKEYS + DUDE: lb89q4p48nf8em075-g077m9n4m8-N3LGM5N2-MdVURLN9J + UTF-16: ???????????????????????????????????????????????? + LACE: ajnytjablfeac74oafqhkeyafv3qm5difvzxk4dfoiww233onnsxs4y + RACE: 3bnysw5elfeh7dtaouac2adxabuqa5aanaac2adtab2qa4aamuaheab\ + nabwqa3yanyagwadfab4qa4y + + (C) Hello-Another-Way- (Japanese song title) + + = U+305D U+308C U+305E U+308C U+306E (hiragana) + = U+5834 U+6240 (kanji) + + UTF-8: Hello-Another-Way-????????????????????? + BRACE: ji7-Hello--Another--Way---v3jhaefvd2ufj62 + AMC-M: bsk-Hello--Another--Way---p2nq2nyqx2veyuwa + DUDE: M8lssv-Huvn4m8ln2-Nm1n9-j05docleocmel834m240 + UTF-16: ?????????????????????????????????????????????????? + LACE: ciagqzlmnrxs2ylon52gqzlsfv3wc6jnauyf3dc6rrxacwbuafrea + RACE: 3aagqadfabwaa3aan4ac2adbabxaa3yaoqagqadfabzaaliao4agcad\ + zaawtaxjqrqyf4memgbxfqndcia + + (D) 2 (Japanese TV program title) + + = U+3072 U+3068 U+3064 (hiragana) + = U+5C4B U+6839 (kanji) + = U+306E (hiragana) + = U+4E0B (kanji) + + UTF-16: ???????????????? + UTF-8: ?????????????????????2 + AMC-M: bsnzciex6wmy2vjqw8sm-2 + BRACE: ji96u56uwbhf2wqxnw4s-2 + DUDE: j072m8klc4bm839j06eke0bg032 + RACE: 3ayhemdigbsfys3iheyg4tqlaaza + LACE: 74yhemdigbsfys3iheyg4tqlaaza + + (E) MajiKoi5 (Japanese song title) + + = U+3067 (hiragana) + = U+3059 U+308B (hiragana) + = U+79D2 U+524D (kanji) + + UTF-8: Maji???Koi??????5?????? + UTF-16: ?????????????????????????? + AMC-M: bsm-Maji-r-Koi-b2m-5-z37cxuwp + BRACE: ji8-Maji-g-Koi-qe7x-5-wx7p6ma + DUDE: Mdhqpj067G06bvpj059obg035n9d2l24d + RACE: 3aag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq + LACE: 74ag2adbabvaa2jqm4agwadpabutawjqrmadk6oskjgq + + (F) de (Japanese song title) + + = U+30D1 U+30D5 U+30A3 U+30FC (katakana) + = U+30EB U+30F3 U+30D0 (katakana) + + UTF-16: ?????????????? + BRACE: 3iu8pazt-de-pygi + AMC-M: bs3jp4d9n-de-8m9di + RACE: gdi5li7475sp6zpl6pia + DUDE: j0d1lq3vcg064lj0ebv3t0 + UTF-8: ????????????de????????? + LACE: aqyndvnd7qbaazdfamyox46q + + (G) (Japanese song title) + + = U+305D U+306E (hiragana) + = U+30B9 U+30D4 U+30FC U+30C9 (katakana) + = U+3067 (hiragana) + + RACE: gbow5oou7tewo + UTF-16: ?????????????? + BRACE: bidprdmp9wt7mi + LACE: a4yf23vz2t6mszy + AMC-M: bsmfyq5j7e9n6jr + DUDE: j05dmer9t4vcs9m7 + UTF-8: ????????????????????? + + The next several examples are all translations of the sentence "Why + can't they just speak in ?" (courtesy of Michael Kaplan's + "provincial" page [PROVINCIAL]). Word breaks and punctuation have + been removed, as is often done in domain names. + + (H) Arabic (Egyptian): + U+0644 U+064A U+0647 U+0645 U+0627 U+0628 U+062A U+0643 U+0644 + U+0645 U+0648 U+0634 U+0639 U+0631 U+0628 U+064A U+061F + + DUDE: m44qnli7oqk3kloj4phi8kahf + BRACE: 28akcjwcmp3ciwb4t3ngd4nbaz + AMC-M: agiekhfuhuiukdefivevjvbuiktr + RACE: azceur2fe4ucuq2eivediojrfbfb6 + LACE: cedeisshiutsqksdircuqnbzgeueuhy + UTF-16: ?????????????????????????????????? + UTF-8: ?????????????????????????????????? + + (I) Chinese (simplified): + U+4ED6 U+4EEC U+4E3A U+4EC0 U+4E48 U+4E0D U+8BF4 U+4E2D U+6587 + + UTF-16: ?????????????????? + BRACE: kgcqqsgp26i5h4zn7req5i + AMC-M: uqj7g8nvk6awispn9wupdnh + DUDE: ked6ucjas0k8gdobf4ke2dm587 + UTF-8: ??????????????????????????? + LACE: azhnn3b2ybea2aml6qau4libmwdq + RACE: 3bhnmtxmjy5e5qcojbha3c7ujywwlby + + (J) Czech: Proprostnemluvesky + + = U+010D + = U+011B + = U+00ED + + UTF-8: Pro??prost??nemluv????esky + AMC-M: g26-Pro-p-prost-9m-nemluv-6pp-esky + BRACE: i32-Pro-u-prost-8y-nemluv-29f3n-esky + DUDE: N0imfh0dg70imfn3kh1bg6eltsn5mudh0dg65n3mbn9 + UTF-16: ???????????????????????????????????????????? + LACE: amaha4tpaeaq2biaobzg643uaearwbyanzsw23dvo3wqcainaqagk43\ + lpe + RACE: ah7xb73s75xq373q75zp6377op7xig77n37wl73n75wp65p7o3762dp\ + 7mx7xh73l754q + + (K) Hebrew: + U+05DC U+05DE U+05D4 U+05D4 U+05DD U+05E4 U+05E9 U+05D5 U+05D8 + U+05DC U+05D0 U+05DE U+05D3 U+05D1 U+05E8 U+05D9 U+05DD U+05E2 + U+05D1 U+05E8 U+05D9 U+05EA + + AMC-M: af4nqeep8e8jfinaqdb8ijp8cb8ij8k + DUDE: ldcukktu4pt5osgujhu8t9tu2t1u8t9ua + BRACE: 27vkyp7bgwmbpfjgc4ynx5nd8xsp5nd9c + RACE: axon5vgu3xsotvoy3tin5u6r5dm53ywr5dm6u + LACE: cyc5zxwu2to6j2ov3donbxwt2huntxpc2hunt2q + UTF-8: ???????????????????????????????????????????? + UTF-16: ???????????????????????????????????????????? + + (L) Hindi: + U+092F U+0939 U+0932 U+094B U+0917 U+0939 U+093F U+0928 U+094D + U+0926 U+0940 U+0915 U+094D U+092F U+094B U+0902 U+0928 U+0939 + U+0940 U+0902 U+092C U+094B U+0932 U+0938 U+0915 U+0924 U+0947 + U+0939 U+0948 U+0902 (Devanagari) + + BRACE: 2b7xtenqdr7zc6uma2pmcz7ibage237kdemicnk9gei32 + RACE: bextsmslc44t6kcnezabktjpjmbcqokaaiwewmrycuseookiai + LACE: dyes6ojsjmltspzijuteafknf5fqekbziabcyszshaksirzzjaba + AMC-M: ajhurbvcwmthbhuiwpugitfwpurwmscuibiscunwmvcatfuerbwisc + DUDE: p2fj9ikbh7j9vi8kdi6k0h5kdifkbg2i8j9k0g2ickbj2oh5i4k7j9k\ + 8g2 + UTF-16: ???????????????????????????????????????????????????????\ + ????? + UTF-8: ???????????????????????????????????????????????????????\ + ??????????????????????????????????? + + (M) Korean: + U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774 + U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74 + U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C (Hangul syllables) + + UTF-16: ???????????????????????????????????????????????? + UTF-8: ???????????????????????????????????????????????????????\ + ????????????????? + AMC-M: yhxcj2w6exiaxi68acfn92n68ezehk6xypdpwam6zehmwhk648eavwd\ + p6aqi23ieemweywn + BRACE: y394qebjusrcndbs82pkvstf96sxufcr7ffr4vbgdwsxufcx8pdktgb\ + gmnsqydmk7im56arju6pt82 + LACE: 77atrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ + exj2mlpfzzcyjrsely5ck4ta + RACE: 3datrlgey5mlvkfu4dakzn4mwtsmo5gvlsww3rnuxf6mo5gvotkvzmx\ + exj2mlpfzzcyjrsely5ck4ta + DUDE: s138qcc4s758raa8ke0s0acr78cke4s774t55cqd6ds5b4r97cs774t\ + 574lcr2e4q74s5bcr9c8g98s88bn44qe4c + + (N) Russian: + U+041F U+043E U+0447 U+0435 U+043C U+0443 U+0436 U+0435 U+043E + U+043D U+0438 U+043D U+0435 U+0433 U+043E U+0432 U+043E U+0440 + U+044F U+0442 U+043F U+043E U+0440 U+0443 U+0441 U+0441 U+043A + U+0438 (Cyrillic) + + DUDE: K3fuk7j5sk3j6lutotljuiuk0vijfuk0jhhjao + AMC-M: aehHgrvfemvgvfgfafvfvdgvcgiwrkhgimjjca + BRACE: 269xyjvcyafqfdwyr3xfd8z8byi6z39xyi692s7ug2 + RACE: aq7t4rzvhrbtmnj6hu4d2njthyzd4qcpii7t4qcdifatuoa + LACE: dqcd6pshgu6egnrvhy6tqpjvgm7depsaj5bd6psainaucory + UTF-16: ???????????????????????????????????????????????????????\ + ??? + UTF-8: ??????????????????????????????????????????????????????? + ??? + + (O) Spanish: PorqunopuedensimplementehablarenEspaol + + = U+00E9 + = U+00F1 + + UTF-8: Porqu??nopuedensimplementehablarenEspa??ol + AMC-M: aa7-Porqu-b-nopuedensimplementehablarenEspa-j-ol + BRACE: 22x-Porqu-9-nopuedensimplementehablarenEspa-j-ol + DUDE: N0mfn2hlu9mevn0lm5klun3m9tn0mcltlun4m5ohishn2m5uLn3gm1v\ + 1mfs + RACE: abyg64troxuw433qovswizloonuw24dmmvwwk3tumvugcytmmfzgk3t\ + fonygd4lpnq + LACE: faaha33sof26s3tpob2wkzdfnzzws3lqnrsw2zloorswqylcnrqxezl\ + omvzxayprn5wa + UTF-16: ???????????????????????????????????????????????????????\ + ????????????????????????? + + (P) Taiwanese: + U+4ED6 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D U+6587 + + UTF-16: ?????????????????? + UTF-8: ??????????????????????????? + AMC-M: uqj7g2tbgtu6a385pspnxkupdnh + BRACE: kgcqui49gatc2wyrn8y7cndgte9 + RACE: 3bhnmuaroize5qe6xvha3cvkjywwlby + LACE: 75hnmuaroize5qe6xvha3cvkjywwlby + DUDE: ked6l011n232kec0pebdke0doaaake2dm587 + + (Q) Vietnamese: + Taisaohokhngthchi\ + noitingVit + + = U+0323 + = U+00F4 + = U+00EA + = U+0309 + = U+0301 + + UTF-8: Ta??isaoho??kh??ngth????chi??no??iti????ngVi????t + AMC-M: ada-Ta-ud-isaoho-ud-kh-s9e-ngth-s8kj-chi-j-no-b-iti-s8k\ + b-ngVi-s8kud-t + BRACE: i54-Ta-8-isaoho-ay-kh-29n-ngth-s2xa6i-chi-k-no-2g-iti-2\ + 9c29-ngVi-25p48-t + UTF-16: ???????????????????????????????????????????????????????\ + ????????????????????? + DUDE: N4m1j23g69n3m1vovj23g6bov4menn4m8uaj09g63opj09g6evj01g6\ + 9n4m9uaj01g6enN6m9uaj23g74 + LACE: aiahiyibamrqmadjonqw62dpaebsgcaannupi3thoruouaidbebqay3\ + ineaqgcicabxg6aidaecaa2lunhvacaybauag4z3wnhvacazdaeahi + RACE: ap7xj73bep7wt73t75q76377nd7w6i77np7wr77u75xp6z77ot7wr77\ + kbh7wh73i75uqt73o75xqd73j752p62p75ia763x7m77xn73j77vch7\ + 3u + + The last example is an ASCII string that breaks not only the + existing rules for host name labels but also the rules proposed in + [NAMEPREP02] for internationalized domain names. + + (R) -> $1.00 <- + + UTF-8: -> $1.00 <- + DUDE: -jei0kj1iej0gi0jc- + RACE: aawt4ibegexdambahqwq + LACE: bmac2praeqys4mbqea6c2 + UTF-16: ?????????????????????? + AMC-M: aae--vqae-1-q-00-avn-- + BRACE: 229--t2b4-1-w-00-i9i-- + +Security considerations + + Users expect each domain name in DNS to be controlled by a single + authority. If a Unicode string intended for use as a domain label + could map to multiple ACE labels, then an internationalized domain + name could map to multiple ACE domain names, each controlled by + a different authority, some of which could be spoofs that hijack + service requests intended for another. Therefore AMC-ACE-M is + designed so that each Unicode string has a unique encoding. + + However, there can still be multiple Unicode representations of the + "same" text, for various definitions of "same". This problem is + addressed to some extent by the Unicode standard under the topic + of canonicalization, but some text strings may be misleading or + ambiguous to humans when used as domain names, such as strings + containing dots, slashes, at-signs, etc. These issues are being + further studied under the topic of "nameprep" [NAMEPREP02]. + +References + + [ACEID01] Yoshiro Yoneya, Naomasa Maruyama, "Proposal for + a determining process of ACE identifier", 2000-Dec-19, + draft-ietf-idn-aceid-01. + + [BRACE00] Adam Costello, "BRACE: Bi-mode Row-based + ASCII-Compatible Encoding for IDN version 0.1.2", 2000-Sep-19, + draft-ietf-idn-brace-00. + + [DUDE00] Brian Spolarich, Mark Welter, "DUDE: Differential Unicode + Domain Encoding", 2000-Nov-21, draft-ietf-idn-dude-00. + + [IDN] Internationalized Domain Names (IETF working group), + http://www.i-d-n.net/, idn@ops.ietf.org. + + [LACE01] Paul Hoffman, Mark Davis, "LACE: Length-based ASCII + Compatible Encoding for IDN", 2001-Jan-05, draft-ietf-idn-lace-01. + + [NAMEPREP02] Paul Hoffman, Marc Blanchet, "Preparation + of Internationalized Host Names", 2001-Jan-17, + draft-ietf-idn-nameprep-02. + + [PROVINCIAL] Michael Kaplan, "The 'anyone can be provincial!' page", + http://www.trigeminal.com/samples/provincial.html. + + [RACE03] Paul Hoffman, "RACE: Row-based ASCII Compatible Encoding + for IDN", 2000-Nov-28, draft-ietf-idn-race-03. + + [RFC952] K. Harrenstien, M. Stahl, E. Feinler, "DOD Internet Host + Table Specification", 1985-Oct, RFC 952. + + [RFC1034] P. Mockapetris, "Domain Names - Concepts and Facilities", + 1987-Nov, RFC 1034. + + [RFC1123] Internet Engineering Task Force, R. Braden (editor), + "Requirements for Internet Hosts -- Application and Support", + 1989-Oct, RFC 1123. + + [SACE] Dan Oscarsson, "Simple ASCII Compatible Encoding (SACE)", + draft-ietf-idn-sace-*. + + [UNICODE] The Unicode Consortium, "The Unicode Standard", + http://www.unicode.org/unicode/standard/standard.html. + + [UTF5] James Seng, Martin Duerst, Tin Wee Tan, "UTF-5, a + Transformation Format of Unicode and ISO 10646", draft-jseng-utf5-*. + + [UTF6] Mark Welter, Brian W. Spolarich, "UTF-6 - Yet Another + ASCII-Compatible Encoding for IDN", draft-ietf-idn-utf6-*. + + [UTFCONV] Mark Davis, "UTF Converter", + http://www.macchiato.com/unicode/convert.html. + +Author + + Adam M. Costello + http://www.cs.berkeley.edu/~amc/ + + +Example implementation + + +/******************************************/ +/* amc-ace-m.c 0.1.0 (2001-Feb-12-Mon) */ +/* Adam M. Costello */ +/******************************************/ + +/* This is ANSI C code implementing AMC-ACE-M version 0.1.*. */ + + +/************************************************************/ +/* Public interface (would normally go in its own .h file): */ + +#include + +enum amc_ace_status { + amc_ace_success, + amc_ace_invalid_input, + amc_ace_output_too_big +}; + +enum case_sensitivity { case_sensitive, case_insensitive }; + +#if UINT_MAX >= 0x10FFFF +typedef unsigned int u_code_point; +#else +typedef unsigned long u_code_point; +#endif + +int amc_ace_m_encode( + unsigned int input_length, + const u_code_point *input, + const unsigned char *uppercase_flags, + unsigned int *output_size, + unsigned char *output ); + + /* amc_ace_m_encode() converts Unicode to AMC-ACE-M. The input */ + /* must be represented as an array of Unicode code points */ + /* (not code units; surrogate pairs are not allowed), and the */ + /* output will be represented as null-terminated ASCII. The */ + /* input_length is the number of code points in the input. The */ + /* output_size is an in/out argument: the caller must pass */ + /* in the maximum number of characters that may be output */ + /* (including the terminating null), and on successful return */ + /* it will contain the number of characters actually output */ + /* (including the terminating null, so it will be one more than */ + /* strlen() would return, which is why it is called output_size */ + /* rather than output_length). The uppercase_flags array must */ + /* hold input_length boolean values, where nonzero means the */ + /* corresponding Unicode character should be forced to uppercase */ + /* after being decoded, and zero means it is caseless or should */ + /* be forced to lowercase. Alternatively, uppercase_flags may */ + /* be a null pointer, which is equivalent to all zeros. The */ + /* letters a-z and A-Z are always encoded literally, regardless */ + /* of the corresponding flags. The encoder always outputs */ + /* lowercase base-32 characters except when nonzero values */ + /* of uppercase_flags require otherwise, so the encoder is */ + /* compatible with any of the case models. The return value */ + /* may be any of the amc_ace_status values defined above; if */ + /* not amc_ace_success, then output_size and output may contain */ + /* garbage. On success, the encoder will never need to write an */ + /* output_size greater than input_length*5+6, because of how the */ + /* encoding is defined. */ + +int amc_ace_m_decode( + enum case_sensitivity case_sensitivity, + unsigned char *scratch_space, + const unsigned char *input, + unsigned int *output_length, + u_code_point *output, + unsigned char *uppercase_flags ); + + /* amc_ace_m_decode() converts AMC-ACE-M to Unicode. The input */ + /* must be represented as null-terminated ASCII, and the output */ + /* will be represented as an array of Unicode code points. */ + /* The case_sensitivity argument influences the check on the */ + /* well-formedness of the input string; it must be case_sensitive */ + /* if case-sensitive comparisons are allowed on encoded strings, */ + /* case_insensitive otherwise (see also section "Case sensitivity */ + /* models" of the AMC-ACE-M specification). The scratch_space */ + /* must point to space at least as large as the input, which will */ + /* get overwritten (this allows the decoder to avoid calling */ + /* malloc()). The output_length is an in/out argument: the */ + /* caller must pass in the maximum number of code points that */ + /* may be output, and on successful return it will contain the */ + /* actual number of code points output. The uppercase_flags */ + /* array must have room for at least output_length values, or it */ + /* may be a null pointer if the case information is not needed. */ + /* A nonzero flag indicates that the corresponding Unicode */ + /* character should be forced to uppercase by the caller, while */ + /* zero means it is caseless or should be forced to lowercase. */ + /* The letters a-z and A-Z are output already in the proper case, */ + /* but their flags will be set appropriately so that applying the */ + /* flags would be harmless. The return value may be any of the */ + /* amc_ace_status values defined above; if not amc_ace_success, */ + /* then output_length, output, and uppercase_flags may contain */ + /* garbage. On success, the decoder will never need to write */ + /* an output_length greater than the length of the input (not */ + /* counting the null terminator), because of how the encoding is */ + /* defined. */ + + +/**********************************************************/ +/* Implementation (would normally go in its own .c file): */ + +#include + +/* Character utilities: */ + +/* is_ldh(codept) returns 1 if the code point represents an LDH */ +/* character (ASCII letter, digit, or hyphen-minus), 0 otherwise. */ + +static int is_ldh(u_code_point codept) +{ + if (codept == 45) return 1; + if (codept < 48) return 0; + if (codept <= 57) return 1; + if (codept < 65) return 0; + if (codept <= 90) return 1; + if (codept < 97) return 0; + if (codept <= 122) return 1; + return 0; +} + +/* is_AtoZ(c) returns 1 if c is an */ +/* uppercase ASCII letter, zero otherwise. */ + +static unsigned char is_AtoZ(unsigned char c) +{ + return c >= 65 && c <= 90; +} + +/* special_row_offset[n] holds the offset of the */ +/* bottom of special row 0xD8 + n, where n is in 0..7. */ + +static u_code_point special_row_offset[] = + { 0x0020, 0x005B, 0x007B, 0x00A0, 0x00C0, 0x00DF, 0x0134, 0x0270 }; + +/* base32[n] is the lowercase base-32 character representing */ +/* the number n from the range 0 to 31. Note that we cannot */ +/* use string literals for ASCII characters because an ANSI C */ +/* compiler does not necessarily use ASCII. */ + +static const unsigned char base32[] = { + 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */ + 109, 110, /* m-n */ + 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */ + 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */ +}; + +/* base32_decode(c) returns the value of a base-32 character, in the */ +/* range 0 to 31, or the constant base32_invalid if c is not a valid */ +/* base-32 character. */ + +enum { base32_invalid = 32 }; + +static unsigned int base32_decode(unsigned char c) +{ + if (c < 50) return base32_invalid; + if (c <= 57) return c - 26; + if (c < 97) c += 32; + if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid; + return c - 97 - (c > 108) - (c > 111); +} + +/* unequal(case_sensitivity,a1,a2,n) returns 0 if the arrays */ +/* a1 and a2 are equal in the first n positions, 1 otherwise. */ +/* If case_sensitivity is case_insensitive, then ASCII A-Z are */ +/* considered equal to a-z respectively. */ + +static int unequal( + enum case_sensitivity case_sensitivity, + const unsigned char *a1, + const unsigned char *a2, + unsigned int n ) +{ + const unsigned char *end; + unsigned char c1, c2; + + if (case_sensitivity != case_insensitive) return memcmp(a1,a2,n); + + for (end = a1 + n; a1 < end; ++a1, ++a2) { + c1 = *a1; + c2 = *a2; + if (c1 >= 65 && c1 <= 90) c1 += 32; + if (c2 >= 65 && c2 <= 90) c2 += 32; + if (c1 != c2) return 1; + } + + return 0; +} + + +/* Encoder: */ + +int amc_ace_m_encode( + unsigned int input_length, + const u_code_point *input, + const unsigned char *uppercase_flags, + unsigned int *output_size, + unsigned char *output ) +{ + unsigned int literal, wide; /* boolean */ + u_code_point codept, n, diff, morebits; + u_code_point A, B, C, offsetA, offsetB, offsetC, offset; + const u_code_point *input_end, *p, *pp; + unsigned int count, max, next_in, next_out, max_out, codelen, i; + unsigned char c; + + input_end = input + input_length; + + /* 1) Verify that only valid code points appear: */ + + for (p = input; p < input_end; ++p) { + if (*p >> 11 == 0x1B || *p > 0x10FFFF) return amc_ace_invalid_input; + } + + /* 2) Determine the most populous row: B and offsetB */ + + /* first check the special rows: */ + + B = 0xD8; + offsetB = special_row_offset[0]; + max = 0; + + for (n = 0; n < 8; ++n) { + offset = special_row_offset[n]; + count = 0; + + for (p = input; p < input_end; ++p) { + if (*p - offset <= 0xFF && !is_ldh(*p)) ++count; + } + + if (count > max) { + B = 0xD8 + n; + offsetB = offset; + max = count; + } + } + + /* now check the regular rows: */ + + for (pp = input; pp < input_end; ++pp) { + n = *pp >> 8; + count = 0; + + for (p = input; p < input_end; ++p) { + if (*p >> 8 == n && !is_ldh(*p)) ++count; + } + + if (count > max || (count == max && n < B)) { + B = n; + offsetB = n << 8; + max = count; + } + } + + /* 3) Determine the most populous 16-window: A and offsetA */ + + A = 0; + max = 0; + + for (n = 0; n <= 0x1F; ++n) { + offset = ((offsetB >> 3) + n) << 3; + count = 0; + + for (p = input; p < input_end; ++p) { + if (*p - offset <= 0xF && !is_ldh(*p)) ++count; + } + + if (count > max) { + A = n; + offsetA = offset; + max = count; + } + } + + /* 4) Determine the most populous 20k-window: C */ + + C = 0; + max = 0; + + for (pp = input; pp < input_end; ++pp) { + count = 0; + n = *pp >> 11; + offset = n << 11; + + for (p = input; p < input_end; ++p) { + if (*p - offset <= 0x4FFF && !is_ldh(*p)) ++count; + + if (count > max || (count == max && n < C)) { + C = n; + max = count; + } + } + } + + /* 5) Determine the style to use: wide or narrow */ + + /* if narrow style were used: */ + + offsetC = (offsetB >> 12) << 12; + count = 3 + (B > 0xFF); + + for (p = input; p < input_end; ++p) { + if (is_ldh(*p)) { } + else if (*p - offsetA <= 0xF) count += 1; + else if (*p - offsetB <= 0xFF) count += 2; + else if (*p - offsetC <= 0xFFF) count += 3; + else if (*p <= 0xFFFF) count += 4; + else count += 5; + } + + max = count; + + /* if wide style were used: */ + + offsetC = C << 11; + count = B <= 0xFF && C <= 0x1F ? 3 : 5; + + for (p = input; p < input_end; ++p) { + if (is_ldh(*p)) { } + else if (*p - offsetB <= 0xFF) count += 2; + else if (*p - offsetC <= 0x4FFF) count += 3; + else if (*p <= 0xFFFF) count += 4; + else count += 5; + } + + wide = (count < max); + + /* 6) Initialize offsetC, and encode the style and offsets: */ + + max_out = *output_size; + next_out = 0; + + if (wide) { + offsetC = C << 11; + + if (B <= 0xFF && C <= 0x1F) { + if (max_out - next_out < 3) return amc_ace_output_too_big; + output[next_out++] = base32[0x10 | (B >> 5)]; + output[next_out++] = base32[B & 0x1F]; + output[next_out++] = base32[C]; + } + else { + if (max_out - next_out < 5) return amc_ace_output_too_big; + output[next_out++] = base32[0x18 | (B >> 10)]; + output[next_out++] = base32[(B >> 5) & 0x1F]; + output[next_out++] = base32[B & 0x1F]; + output[next_out++] = base32[C >> 5]; + output[next_out++] = base32[C & 0x1F]; + } + } + else { + offsetC = (offsetB >> 12) << 12; + + if (B <= 0xFF) { + if (max_out - next_out < 3) return amc_ace_output_too_big; + output[next_out++] = base32[B >> 5]; + output[next_out++] = base32[B & 0x1F]; + } + else { + if (max_out - next_out < 4) return amc_ace_output_too_big; + output[next_out++] = base32[8 | (B >> 10)]; + output[next_out++] = base32[(B >> 5) & 0x1F]; + output[next_out++] = base32[B & 0x1F]; + } + + output[next_out++] = base32[A]; + } + + /* 7) Main encoding loop: */ + + literal = 0; + + for (next_in = 0; next_in < input_length; ++next_in) { + codept = input[next_in]; + + if (codept == 45 /* hyphen-minus */) { + /* case 7.1 */ + if (max_out - next_out < 2) return amc_ace_output_too_big; + output[next_out++] = 45; + output[next_out++] = 45; + continue; + } + + if (is_ldh(codept)) { + /* case 7.2 */ + if (!literal) { + if (max_out - next_out < 1) return amc_ace_output_too_big; + output[next_out++] = 45; + literal = 1; + } + + if (max_out - next_out < 1) return amc_ace_output_too_big; + output[next_out++] = codept; + continue; + } + + /* case 7.3 */ + + if (literal) { + if (max_out - next_out < 1) return amc_ace_output_too_big; + output[next_out++] = 45; + literal = 0; + } + + if (!wide) { + diff = codept - offsetA; + + if (diff <= 0xF) { + /* case 7.3.1 */ + codelen = 1; + goto encoder_base32_bottom; + } + } + + diff = codept - offsetB; + + if (diff <= 0xFF) { + /* case 7.3.2 */ + codelen = 2; + goto encoder_base32_bottom; + } + + diff = codept - offsetC; + + if (diff <= 0xFFF) { + /* case 7.3.3 */ + codelen = 3; + goto encoder_base32_bottom; + } + + if (wide) { + diff = codept - offsetC - 0x1000; + + if (diff <= 0x3FFF) { + /* case 7.3.4 */ + codelen = 1; + morebits = diff & 0x3FF; + diff >>= 10; + goto encoder_base32_bottom; + } + } + + if (codept <= 0xFFFF) { + /* case 7.3.5 */ + diff = codept; + codelen = 4; + goto encoder_base32_bottom; + } + + /* case 7.3.6 */ + diff = codept - 0x10000; + codelen = 5; + + encoder_base32_bottom: /* output diff as n base-32 digits: */ + if (max_out - next_out < codelen) return amc_ace_output_too_big; + i = codelen - 1; + c = base32[diff & 0xF]; + if (uppercase_flags && uppercase_flags[next_in]) c -= 32; + output[next_out + i] = c; + + while (i > 0) { + diff >>= 4; + output[next_out + --i] = base32[0x10 | (diff & 0xF)]; + } + + next_out += codelen; + + if (wide && codelen == 1) { + /* case 7.3.4 */ + if (max_out - next_out < 2) return amc_ace_output_too_big; + output[next_out++] = base32[morebits >> 5]; + output[next_out++] = base32[morebits & 0x1F]; + } + } + + /* null terminator: */ + if (max_out - next_out < 1) return amc_ace_output_too_big; + output[next_out++] = 0; + *output_size = next_out; + return amc_ace_success; +} + + +/* Decoder: */ + +int amc_ace_m_decode( + enum case_sensitivity case_sensitivity, + unsigned char *scratch_space, + const unsigned char *input, + unsigned int *output_length, + u_code_point *output, + unsigned char *uppercase_flags ) +{ + unsigned int literal, wide, large; /* boolean */ + const unsigned char *next_in; + unsigned char c; + unsigned int next_out, max_out, codelen, input_size, scratch_size; + u_code_point q, B, offsets[6], diff, offset; + enum amc_ace_status status; + + /* 1) Decode the style and offsets: */ + + next_in = input; + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + wide = q >> 4; + large = (q >> 3) & 1; + B = q & 7; + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + B = (B << 5) | q; + + if (large) { + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + B = (B << 5) | q; + } + + /* offsets[codelen] is for base-32 codes with codelen characters */ + /* (not counting the extra two in wide-style 0xxxx xxxxx xxxxx) */ + + offsets[2] = B >> 3 == 0x1B ? special_row_offset[B & 7] : B << 8; + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + + if (!wide) { + offsets[1] = ((offsets[2] >> 3) + q) << 3; + offsets[3] = (offsets[2] >> 12) << 12; + } + else { + offset = q << 11; + + if (large) { + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + offset = (offset << 5) | q; + } + + offsets[3] = offset; + offsets[1] = offset + 0x1000; + } + + offsets[4] = 0; + offsets[5] = 0x10000; + + /* 2) Main decoding loop: */ + + max_out = *output_length; + next_out = 0; + literal = 0; + + for (;;) { + c = *next_in++; + if (!c) break; + + if (c == 45 /* hyphen-minus */) { + if (*next_in == 45) { + /* case 2.1: "--" decodes to "-" */ + ++next_in; + if (max_out - next_out < 1) return amc_ace_output_too_big; + if (uppercase_flags) uppercase_flags[next_out] = 0; + output[next_out++] = 45; + continue; + } + + /* case 2.2: unpaired hyphen-minus toggles mode */ + literal = !literal; + continue; + } + + if (!is_ldh(c)) return amc_ace_invalid_input; + if (max_out - next_out < 1) return amc_ace_output_too_big; + + if (literal) { + /* case 2.3: literal letter/digit */ + if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c); + output[next_out++] = c; + continue; + } + + /* case 2.4: base-32 sequence */ + + diff = 0; + codelen = 1; + + for (;;) { + q = base32_decode(c); + if (q == base32_invalid) return amc_ace_invalid_input; + diff = (diff << 4) | (q & 0xF); + if ((q & 0x10) == 0) break; + if (++codelen > 5) return amc_ace_invalid_input; + c = *next_in++; + } + + /* Now codelen is the number of input characters read, */ + /* and c is the character holding the uppercase flag. */ + + if (wide && codelen == 1) { + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + diff = (diff << 5) | q; + q = base32_decode(*next_in++); + if (q == base32_invalid) return amc_ace_invalid_input; + diff = (diff << 5) | q; + } + + offset = offsets[codelen]; + if (uppercase_flags) uppercase_flags[next_out] = is_AtoZ(c); + output[next_out++] = offset + diff; + } + + /* 3) Re-encode the output and compare to the input: */ + + input_size = next_in - input; + scratch_size = input_size; + status = amc_ace_m_encode(next_out, output, uppercase_flags, + &scratch_size, scratch_space); + if (status != amc_ace_success || + scratch_size != input_size || + unequal(case_sensitivity, scratch_space, input, input_size) + ) return amc_ace_invalid_input; + *output_length = next_out; + return amc_ace_success; +} + + +/******************************************************************/ +/* Wrapper for testing (would normally go in a separate .c file): */ + +#include +#include +#include +#include + +/* For testing, we'll just set some compile-time limits rather than */ +/* use malloc(), and set a compile-time option rather than using a */ +/* command-line option. */ + +enum { + unicode_max_length = 256, + ace_max_size = 256, + test_case_sensitivity = case_insensitive +}; + + +static void usage(char **argv) +{ + fprintf(stderr, + "%s -e reads big-endian UTF-32 and writes AMC-ACE-M ASCII.\n" + "%s -d reads AMC-ACE-M ASCII and writes big-endian UTF-32.\n" + "UTF-32 is extended: bit 31 is used as force-to-uppercase flag.\n" + , argv[0], argv[0]); + exit(EXIT_FAILURE); +} + + +static void fail(const char *msg) +{ + fputs(msg,stderr); + exit(EXIT_FAILURE); +} + +static const char too_large[] = + "input or output is too large, recompile with larger limits\n"; + +static const char invalid_input[] = "invalid input\n"; + +int main(int argc, char **argv) +{ + enum amc_ace_status status; + + if (argc != 2) usage(argv); + if (argv[1][0] != '-') usage(argv); + if (argv[1][2] != '\0') usage(argv); + + if (argv[1][1] == 'e') { + u_code_point input[unicode_max_length]; + unsigned char uppercase_flags[unicode_max_length]; + unsigned char output[ace_max_size]; + unsigned int input_length, output_size; + int c0, c1, c2, c3; + + /* Read the UTF-32 input string: */ + + input_length = 0; + + for (;;) { + c0 = getchar(); + c1 = getchar(); + c2 = getchar(); + c3 = getchar(); + + if (c1 == EOF || c2 == EOF || c3 == EOF) { + if (c0 != EOF) fail("input not a multiple of 4 bytes\n"); + break; + } + + if (input_length == unicode_max_length) fail(too_large); + + if ((c0 != 0 && c0 != 0x80) + || c1 < 0 || c1 > 0x10 + || c2 < 0 || c2 > 0xFF + || c3 < 0 || c3 > 0xFF ) { + fail(invalid_input); + } + + input[input_length] = ((u_code_point) c1 << 16) | + ((u_code_point) c2 << 8) | (u_code_point) c3; + uppercase_flags[input_length] = (c0 >> 7); + ++input_length; + } + + /* Encode, and output the result: */ + + output_size = ace_max_size; + status = amc_ace_m_encode(input_length, input, uppercase_flags, + &output_size, output); + if (status == amc_ace_invalid_input) fail(invalid_input); + if (status == amc_ace_output_too_big) fail(too_large); + assert(status == amc_ace_success); + fputs((char *) output, stdout); + return EXIT_SUCCESS; + } + + if (argv[1][1] == 'd') { + unsigned char input[ace_max_size], scratch[ace_max_size]; + u_code_point output[unicode_max_length], codept; + unsigned char uppercase_flags[unicode_max_length]; + unsigned int output_length, i; + size_t n; + + /* Read the AMC-ACE-M ASCII input string: */ + + n = fread(input, 1, ace_max_size, stdin); + if (n == ace_max_size) fail(too_large); + input[n] = 0; + + /* Decode, and output the result: */ + + output_length = unicode_max_length; + status = amc_ace_m_decode(test_case_sensitivity, scratch, input, + &output_length, output, uppercase_flags); + if (status == amc_ace_invalid_input) fail(invalid_input); + if (status == amc_ace_output_too_big) fail(too_large); + assert(status == 0); + + for (i = 0; i < output_length; ++i) { + putchar(uppercase_flags[i] ? 0x80 : 0); + codept = output[i]; + putchar(codept >> 16); + putchar((codept >> 8) & 0xFF); + putchar(codept & 0xFF); + } + + return EXIT_SUCCESS; + } + + usage(argv); + return EXIT_SUCCESS; /* not reached, but quiets a compiler warning */ +} + + + + INTERNET-DRAFT expires 2001-Aug-12 diff --git a/doc/draft/draft-ietf-idn-mua-00.txt b/doc/draft/draft-ietf-idn-mua-00.txt new file mode 100644 index 0000000000..45c71b1fe8 --- /dev/null +++ b/doc/draft/draft-ietf-idn-mua-00.txt @@ -0,0 +1,374 @@ +Internet Draft Maynard Kang +draft-ietf-idn-mua-00.txt i-EMAIL.net +February 5, 2001 +Expires on August 5, 2001 + + Internationalizing Domain Names in Mail User Agents + +Status of this Memo + +This document is an Internet-Draft and is in full conformance with all +provisions of Section 10 of RFC2026. + +Internet-Drafts are working documents of the Internet Engineering Task +Force (IETF), its areas, and its working groups. Note that other +groups may also distribute working documents as Internet-Drafts. + +Internet-Drafts are draft documents valid for a maximum of six months +and may be updated, replaced, or obsoleted by other documents at any +time. It is inappropriate to use Internet-Drafts as reference material +or to cite them other than as "work in progress." + + + The list of current Internet-Drafts can be accessed at + http://www.ietf.org/ietf/1id-abstracts.txt + + The list of Internet-Draft Shadow Directories can be accessed at + http://www.ietf.org/shadow.html. + + + +Abstract + +This document describes a way where domain names used in Internet e-mail +can be internationalized by making changes only to end-user Mail User +Agents and, by doing so, avoid damaging other applications which handle +Internet e-mail, such as Message Transfer Agents and Delivery Agents. + +1. Introduction + +One of the proposed solutions for internationalized domain names (IDN) +involves only updating the user applications with no changes required +to the DNS protocol, servers and resolvers [IDNA] compared to other +solutions which require changes to be made to protocol, servers, +resolvers and applications. + +The underlying principle of [IDNA] may be similarly applied to the +Internet e-mail system today - by effecting changes to only the Mail +User Agent (MUA) component of the e-mail system. Thus, existing +Message Transfer Agents, Delivery Agents and other applications which +handle e-mail do not have to be changed at all. + +1.1 Definitions and Conventions + +Usage of terms related to the character encoding model are in +reference to Unicode Technical Report 17 [UTR17]. + +The terms "international character", "non-ASCII character" and +"multilingual character", which are used interchangeably, are taken +to mean any abstract character which is not included in the range +specified by [US-ASCII]. + +1.2 Terminology + +The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", +and "MAY" in this document are to be interpreted as described in RFC +2119 [RFC2119]. + +1.3. Design Philosophy + +As the Internet e-mail system is a diverse, distributed and +heterogeneous system with many vendors deploying a vast number of +applications, it is of utmost importance that interoperability amongst +these various components is maintained. Thus, the ideal solution would +be one which does not compromise or damage the operation of any of these +existing components once internationalized domain names are encountered. + +Also, solutions which call for changes to be made to many or even all +components of the Internet e-mail system would require far too much +time and effort to deploy, given that Internet e-mail has such a huge +installed base. + +This solution adheres to both of the above principles, in that +interoperability is preserved and that the cost and speed of +implementation is low. All that the user has to do to use IDNs in e-mail +is update his or her MUA. + +1.4. IDN Summary + +This solution specifies an IDN architecture of arch-3 (just send ACE) +and a transition strategy of trans-1 (always do current plus new +architecture) as described in [IDNCOMP]. The choice of ACE format is not +defined in this document, but MUST be the same as that specified in +[IDNA] in order to maintain uniqueness and consistency. + +1.5. E-mail Internationalization Summary + +As many Internet e-mail standards such as the SMTP protocol [RFC821] +and the e-mail message format [RFC822] only specify usage of the 7-bit +ASCII character set [US-ASCII], international characters which use octet- +based character encoding schemes (CES) cannot be used in e-mail +transmission, headers and bodies. + +Although this issue has been addressed in [RFC2045] for message bodies +and [RFC2047] for message headers through the use of a Transfer Encoding +Syntax (TES) such as Quoted-Printable or Base64, there is no similar +solution which extends the functionality of [RFC821] to include usage of +international characters, except for [RFC1652] which allows transmission +of 8-bit data passed by the DATA command in an SMTP session. + +[RFC1652] however, does not fully address the problem of using IDNs in +an SMTP session - the IDN may be used in areas within the SMTP session +other than the DATA command, such as the MAIL FROM and RCPT TO commands, +where an IDN may be part of the e-mail address(es) specified there. + +Hence, this would be a major stumbling block to deploying "just-send- +8bit" IDNs for use in Internet e-mail, as these IDNs would not be able +to be used in SMTP e-mail transmissions due to [RFC821] restrictions. + +2. Architectural Overview + +The end-user MUA may encounter IDNs in the scenarios below: + +(i) When specifying the transmission server (i.e. SMTP server) +(ii) When specifying the retrieval server (i.e. POP3/IMAP4/any other + retrieval mechanism) +(iii) When specifying e-mail addresses during composition of a message +(iv) When reading messages with e-mail addresses in it + +As with [IDNA], the MUA is updated in a similar fashion to process IDNs +which are input by users and process IDNs which are displayed to users, +in all of the scenarios above. + +For (i) and (ii), the IDN MUST be handled in the same manner as +specified in [IDNA]. The method of handling an IDN For (iii) and (iv) is +described below in 2.1. + +2.1 Interfaces between E-mail components when composing/reading a mail + +The interfaces between e-mail components can be pictorially represented +as shown below. + +The example assumes the setup of a POP3/IMAP4 retrieval client and +server, but the exact nature of end-to-end e-mail transmission may vary +accordingly (e.g. elm or pine would read directly from the mail store). +However, these variations do not impact an accurate description of this +solution to a large extent as no changes are required at these levels. + + +------+ +------+ + | User | | User | + +------+ +---^--| + | User Input: User Display: Characters/ | + | Keyboard/Pen/etc Glyphs on CRT or other | + +-----v---------------+ Representation (e.g. sound) | + | Input Method Editor | +------------|-----+ + +---------------------+ | Rendering Engine | + | Input: Any localized/ +---------^--------+ + | internationalized Output: Any localized/ | + | charset internationalized | + +----v-----------------+ charset | + | +------------------+ | +----------|-------------+ + | | Mail Composition | | | +--------------+ | + | | Interface | | Sender's | | Mail Reading | | + | +------------------+ | MUA | | Interface | | + | | | | +--------^-----+ | + | | Nameprepped ACE | Receiver's | | Nameprepped | + | v | MUA | | ACE | + | +-------------+ | | +-------------------+ | + | | SMTP Client | | | | POP3/IMAP4 Client | | + | +-------------+ | | +-------------------+ | + +----|-----------------+ +----------^-------------+ + | Nameprepped | Nameprepped + v ACE Nameprepped Nameprepped | ACE + +-------------+ ACE +------------+ ACE +-------------------+ + | SMTP Server | -----> | Mail Store | -----> | POP3/IMAP4 Server | + +-------------+ +------------+ +-------------------+ + +2.1.1 Interface between User and Input Method Editor + +For ASCII characters, input is straightforward: the user types on the +keyboard and whichever character that is pressed is sent to the +application. + +However, for international characters, the end-user has to use a script- +specific Input Method Editor (IME), which may or may not be built-into +the OS, to interpret what the user communicates to the system and +thereafter send the respective international characters to the +application. + +For example, for input of Chinese characters, some users use IMEs +which support the "Pinyin" input method. When a user types "zhongguo" +(in ASCII characters) on the keyboard and selects the characters which +represent "China" (in Chinese) from a list, the IME sends the +international characters to the application in a user-determined +charset (e.g. GB2312). + +2.1.2 Interface between Input Method Editor and MUA Composition + Interface + +The MUA mail composition interface (i.e. the "Compose Message" +function of the MUA) SHOULD be able to accept IDNs using 8-bit character +encoding schemes, including those represented in any localized (e.g. +GB2312) or internationalized (e.g. UTF-8) charsets. + +This input typically takes place where e-mail addresses are entered +such as the "From", "To", "Cc", "Bcc" fields, amongst others, as IDNs +may be used at the right-hand-side of the "@" sign in an e-mail address +(domain-parts). + +The mail composition interface MAY allow ACE input for the same +reasons as specified in [IDNA], but is not recommended as ACE is opaque +and ugly. + +2.1.3 Interface between MUA Composition Interface and SMTP Client + +The MUA composition interface communicates with the SMTP client in the +MUA typically through internal function calls within the software itself +or through an API. It is at this level where ACE conversion of any IDN +encountered by the MUA composition interface takes place. + +Before converting the name parts of the IDN into ACE, the MUA MUST +prepare each name part as specified in [NAMEPREP]. Thereafter, the MUA +MUST convert the name parts into ACE before passing any data to the SMTP +client. + +The SMTP client then prepares the e-mail for transmission using the +SMTP protocol [RFC821], and thereafter establishes an SMTP connection +with the user-specified SMTP server to transmit the e-mail. + +It is important to note that an IDN specified in the parameters of any +SMTP command MUST be represented in nameprepped ACE at this point in +time. This includes SMTP commands which require domain parameters (such +as the HELO and EHLO commands) and commands where e-mail addresses are +specified (such as the MAIL FROM, RCPT TO, DATA, VRFY, EXPN, SEND, SOML +and SAML commands). + +As for data passed by the DATA command, ACE conversion MUST be +performed when the "domain" portion of an "addr-spec" or when a "domain" +itself, within the context of [RFC822], is encountered. This is +necessary as an updated MUA may originate a message which is read by a +non-updated MUA. If this happens, the non-updated MUA may face +operational problems dealing with IDNs that appear in the "addr-spec" +which are not in ACE. + +Any transfer encoding syntax to be applied to the mail headers as +specified in [RFC2047] SHOULD be performed before nameprepped ACE +conversion. This is to reduce confusion between IDNs within "addr-spec" +and "domain" portions, in the context of [RFC822], and IDNs which appear +as arbitrary data in mail headers and bodies. + +2.1.4. Interface between POP3/IMAP4 client (or local mail store) and + Mail Reading Interface + +The MUA mail reading interface (i.e. "Read mail" function of an MUA) +typically displays e-mail data retrieved from either a POP3/IMAP4 +client or from a local mail store through internal function calls within +the MUA software or through an API. + +When e-mail containing an ACE-represented IDN is to be displayed, the +MUA SHOULD convert the ACE-represented IDN contained within the +"addr-spec" or "domain" portion specified in [RFC822] back into any +localized or internationalized charset of the user's choice, whenever +possible. In the event that it is impossible to achieve conversion back +into the selected localized charset (for example, conversion of RACE- +represented Hangeul characters into ISO-8859-1 is impossible), the MUA +should prompt the user with an error message. + +It may be possible to save and retrieve information about the original +charset of the ACE-converted IDN through the use of additional +[RFC822] mail headers, but that is not (yet) addressed by this memo. + +Although it is possible to render ACE into properly decoded glyphs and +display the actual abstract characters without any conversion to other +charsets, the MUA SHOULD NOT do this as it is not the primary function +of an MUA to render characters. This should be left to a rendering +engine which is separate from the MUA and typically embedded into the +OS. It is sufficient for the MUA to pass the appropriate charset to the +rendering engine for proper display. + +3. ACE Length Considerations + +As [RFC821] in Section 4.5.3 restricts the maximum total length of a +domain name to 64 characters, representation of IDNs using ACE may +pose a potential problem. Most ACEs typically require 3-4 ASCII +characters to represent one international character (especially in the +case of CJK characters, where compression is less effective). + +That would leave only about 16-24 characters for the whole IDN, +including all name parts and dots. This is highly undesirable as some +languages such as Arabic are unable to be abbreviated and the domain +names may require a larger length than that which is allowed by +[RFC821]. + +To further complicate matters, several mailing list software such as +ezmlm embed domain names into the local-parts portion of an e-mail +address during management of subscriptions, together with randomly- +generated subscription information. This would leave an even smaller +maximum ACE length, if interoperability with these mailing list software +were to be maintained, given that there is also a 64 character +restriction on local parts. + +4. Security Considerations + +As this memo is based on [IDNA], security considerations are similar +to that faced by [IDNA]. This includes security considerations from +[NAMEPREP] as well. + +5. Other Considerations + +Although this document addresses end-user MUAs (e.g. elm, mutt, pine, +Eudora, Outlook Express, etc) to a large extent, the definition of an +MUA could be extended to include web-based e-mail server software and +automated programs such as mailing list management software. + +End-user MUAs may also include additional functionality where IDNs may +be encountered, such as calendaring/scheduling, directory services and +digital certificate storage. This is not (yet) addressed in this memo. + +6. Future Extensions + +It is possible to achieve internationalization of the entire e-mail +address by representation of international characters in the local-parts +of an "addr-spec" using nameprepped ACE conversion in a similar fashion +as described in this memo. + +However, this is a different problem altogether and is currently beyond +the scope of this memo. + +7. References + +[IDNA] Paul Hoffman & Patrik Faltstrom, "Internationalizing Host Names +in Applications (IDNA)", draft-ietf-idn-idna. + +[UTR17] K. Whistler & M. Davis, Unicode Consortium, "Character Encoding +Model", Unicode Technical Report #17, +http://www.unicode.org/unicode/reports/tr17/ + +[US-ASCII] United States of America Standards Institute, "USA Code for +Information Interchange", X3.4, 1968. + +[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate +Requirement Levels", March 1997, RFC 2119. + +[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name +Proposals", draft-ietf-idn-compare. + +[RFC821] Jonathan B. Postel, "Simple Mail Transfer Protocol", August +1982, RFC 821. + +[RFC822] David H. Crocker, "Standard for the Format of ARPA Internet +Text Messages", August 1982, RFC 822. + +[RFC2045] N. Freed & N. Borenstein, "Multipurpose Internet Mail +Extensions (MIME) Part One: Format of Internet Message Bodies", +November 1996, RFC 2045. + +[RFC2047] K. Moore, "MIME (Multipurpose Internet Mail Extensions) +Part Three: Message Header Extensions for Non-ASCII Text", November +1996, RFC 2047. + +[RFC1652] J. Klensin et al., "SMTP Service Extension for 8bit- +MIMEtransport", July 1994, RFC 1652. + + +[NAMEPREP] Paul Hoffman & Marc Blanchet, "Preparation of +Internationalized Host Names", draft-ietf-idn-nameprep. + +A. Author's Address + +Maynard Kang +i-EMAIL.net Pte Ltd +1 Kim Seng Promenade #12-07 +Great World City West Tower +Singapore 237994 +E-mail: maynard@i-email.net \ No newline at end of file diff --git a/doc/draft/draft-ietf-idn-nameprep-00.txt b/doc/draft/draft-ietf-idn-nameprep-00.txt deleted file mode 100644 index da21fad96c..0000000000 --- a/doc/draft/draft-ietf-idn-nameprep-00.txt +++ /dev/null @@ -1,855 +0,0 @@ -Internet Draft Paul Hoffman -draft-ietf-idn-nameprep-00.txt IMC & VPNC -July 3, 2000 Marc Blanchet -Expires in six months ViaGenie - - Preparation of Internationalized Host Names - -Status of this memo - -This document is an Internet-Draft and is in full conformance with all -provisions of Section 10 of RFC2026. - -Internet-Drafts are working documents of the Internet Engineering Task -Force (IETF), its areas, and its working groups. Note that other groups -may also distribute working documents as Internet-Drafts. - -Internet-Drafts are draft documents valid for a maximum of six months -and may be updated, replaced, or obsoleted by other documents at any -time. It is inappropriate to use Internet-Drafts as reference material -or to cite them other than as "work in progress." - - - The list of current Internet-Drafts can be accessed at - http://www.ietf.org/ietf/1id-abstracts.txt - - The list of Internet-Draft Shadow Directories can be accessed at - http://www.ietf.org/shadow.html. - - -Abstract - -This document describes how to prepare internationalized host names for -transmission on the wire. The steps include excluding characters that -are prohibited from appearing in internationalized host names, changing -all characters that have case properties to be lowercase, and -normalizing the characters. Further, this document lists the prohibited -characters. - - -1. Introduction - -When expanding today's DNS to include internationalized host names, -those new names will be handled in many parts of the DNS. The IDN -Working Group's requirements document [IDNReq] describes a framework for -domain name handling as well as requirements for the new names. The IDN -Working Group's comparison document [IDNComp] gives a framework for how -various parts of the IDN solution work together. - -A user can enter a domain name into an application program in a myriad -of fashions. Depending on the input method, the characters entered in -the domain name may or may not be those that are allowed in -internationalized host names. Thus, there must be a way to canonicalized -the user's input before the name is resolved in the DNS. - -It is a design goal of this document to allow users to enter host names -in applications and have the highest chance of getting the name correct. -This means that the user should not be limited to only entering exactly -the characters that might have been used, but to instead be able to -enter characters that unambiguously canonicalize to characters in the -desired host name. At the same time, this process must not introduce any -chance that two host names could be represented by two distinct strings -of characters that look identical to typical users. It is also a design -goal to have all preprocessing of IDN done before going on the wire, so -that no transformation is done in the DNS server space. - -This document describes the steps needed to convert a name part from one -that is entered by the user to one that can be used in the DNS. - -1.1 Terminology - -The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and -"MAY" in this document are to be interpreted as described in RFC 2119 -[RFC2119]. - -Examples in this document use the notation from the Unicode Standard -[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the -letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER -A". In the lists of prohibited characters, the "U+" is left off to make -the lists easier to read. - -1.2 IDN summary - -Using the terminology in [IDNComp], this document specifies all of the -prohibited characters and the canonicalization for an IDN solution. -Specifically, it covers the following sections from [IDNComp]: - -prohib-1: Identical and near-identical characters -prohib-2: Separators -prohib-3: Non-displaying and non-spacing characters -prohib-4: Private use characters -prohib-5: Punctuation -prohib-6: Symbols -canon-1.2: Normalization Form KC -canon-2.1: Case folding in ASCII -canon-2.2: Case folding in non-ASCII - -Note that this document does not cover: -canon-1.1: Normalization Form C -canon-2.3: Han folding - -1.3 Open issues - -This is the first draft of this document. Although there has been much -discussion on the WG mailing list about the topics here, there has not -yet been much agreement on some issues. Now that there is a document to -talk about, that discussion can be more focussed. - -1.3.1 Where to do name preparation - -Section 2.1 says to do name preparation in the resolver. An argument can -be made for doing name preparation in the application, before the -application service interface. An advantage of that proposal is that -resolvers would not need to do any name preparation. A disadvantage is -that applications would have to be updated each time the IDN protocol is -updated, such as if new characters are added to the repertoire of -allowed characters. It seems likely that resolvers are more easily -updated than all the individual applications that use internationalized -host names. - -1.3.2 Choosing between normalization form C and KC - -Much of the discussion of normalization on the WG mailing list assumed -that normalization form C would be used. Near the time that this -document was written, people started considering form KC instead of C. -This document used form KC, but the reasons for doing so could be -contentious. - -1.3.3 Does the prohibition catch all bad characters? - -On the mailing list, it was discussed doing prohibition in two steps: a -short list of prohibited characters before case folding in order to -prevent uppercase characters that have no lowercase equivalents from -getting through, and then a full check on the output of normalization. -In this draft, all checking is done before case folding, based on the -(possibly wrong) assumption that none of the prohibited characters will -re-appear after the case folding and normalization. If that assumption -turns out to be wrong, a check for just those problematic characters can -be added after normalization, or a full check against the prohibited -characters can be added. - - -2. Preparation Overview - -This section describes where name preparation happens and the steps that -name preparation software must take. - -2.1 Where name preparation happens - -Part of the chart in section 1.4 of [IDNReq] looks like this: - -+---------------+ -| Application | -+---------------+ - | Application service interface - | For ex. GethostbyXXXX interface -+---------------+ -| Resolver | -+---------------+ - | <----- DNS service interface -+-------------------------------------------+ - -In this specification, the name preparation is done in the resolver, -before the DNS service interface. That is, it is acceptable for software -in the application service interface (such as a "GetHostByName" API) to -pass the resolver a name that has not been prepared. However, the -resolver MUST prepare the name as described in this specification before -passing it to the DNS service interface. - -2.2 Name preparation steps - -The steps for preparing names are: - -1) Input from the application service interface -- This can be done in -many ways and is not specified in this document - -2) Look for prohibited input -- Check for any characters that are not -allowed in the input. If any are found, return an error to the -application service interface. This step is necessary to prevent errors -in the following two steps. This step fulfills prohib-1, prohib-2, -prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp]. - -3) Fold case -- Change all uppercase characters into lowercase -characters. Design note: this step could just as easily have been -"change all lowercase characters into uppercase characters". However, -the upper-to-lower folding was chosen because most users of the Internet -today enter host names in lowercase. This step fulfills canon-2.1 and -canon-2.2 from [IDNComp]. - -4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2 -from [IDNComp]. - -5) Resolution of the prepared name -- This must be specified in a -different IDN document. - -The above steps MUST be performed in the order given in order to comply -with this specification. - - -3. Prohibited Input - -Before the text can be processed, it must be checked for prohibited -characters. There is a variety of prohibited characters, as described in -this section. - -Note that one of the goals of IDN is to allow the widest possible set of -host names as long as those host names do not cause other problems, such -as possible ambiguity. Specifically, experience with current DNS names -have shown that there is a desire for host names that include personal -names, company names, and spoken phrases. A goal of this section is to -prohibit as few characters that might be used in these contexts as -possible while making sure that characters that might easily cause -confusion or ambiguity are prohibited. - -Note that every character listed in this section MUST NOT be transmitted -on the DNS service interface. Although the checking is being performed -before case folding and canonicalization, those steps cannot result in -any of these characters if these characters are not in the input stream. -[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS -server receives a request containing a prohibited character, then the -IDN protocol MUST return an error message. - - -Note that some characters listed in one section would also appear in -other sections. Each character is only listed once. - -3.1 prohib-1: Identical and near-identical characters - -Many characters in [ISO10646] are identical or nearly identical to other -characters. These were often included for compatibility with other -character sets. - -The characters prohibited because they are identical or nearly identical -to allowed characters are: - -00AD SOFT HYPHEN -00D7 MULTIPLICATION SIGN -01C3 LATIN LETTER RETROFLEX CLICK -02B0-02FF [SPACING MODIFIER LETTERS] -066D ARABIC FIVE POINTED STAR -1806 MONGOLIAN TODO SOFT HYPHEN -2010 HYPHEN -2011 NON-BREAKING HYPHEN -2012 FIGURE DASH -2013 EN DASH -2014 EM DASH -2160-217F [ROMAN NUMERALS] -FB1D-FB4F [HEBREW PRESENTATION FORMS] -FB50-FDFF [ARABIC PRESENTATION FORMS A] -FE20-FE2F [COMBINING HALF MARKS] -FE30-FE4F [CJK COMPATIBILITY FORMS] -FE50-FE6F [SMALL FORM VARIANTS] -FE70-FEFC [ARABIC PRESENTATION FORMS B] -FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS] - -3.2 prohib-2: Separators - -Horizontal and vertical spacing characters would make it unclear where a -host name begins and ends. The prohibited spacing characters are: - -0020 SPACE -00A0 NO-BREAK SPACE -1680 OGHAM SPACE MARK -2000-200B [SPACES] -2028 LINE SEPARATOR -2029 PARAGRAPH SEPARATOR -202F NARROW NO-BREAK SPACE -3000 IDEOGRAPHIC SPACE - -Allowing periods and period-like characters as characters within a name -part would also cause similar confusion. The prohibited periods, -characters that look like periods, and characters that canonicalize to a -period or to a period-like character are: - -002E FULL STOP -06D4 ARABIC FULL STOP -2024 ONE DOT LEADER -2025 TWO DOT LEADER -2026 HORIZONTAL ELLIPSIS -2488 DIGIT ONE FULL STOP -2489 DIGIT TWO FULL STOP -248A DIGIT THREE FULL STOP -248B DIGIT FOUR FULL STOP -248C DIGIT FIVE FULL STOP -248D DIGIT SIX FULL STOP -248E DIGIT SEVEN FULL STOP -248F DIGIT EIGHT FULL STOP -2490 DIGIT NINE FULL STOP -2491 NUMBER TEN FULL STOP -2492 NUMBER ELEVEN FULL STOP -2493 NUMBER TWELVE FULL STOP -2494 NUMBER THIRTEEN FULL STOP -2495 NUMBER FOURTEEN FULL STOP -2496 NUMBER FIFTEEN FULL STOP -2497 NUMBER SIXTEEN FULL STOP -2498 NUMBER SEVENTEEN FULL STOP -2499 NUMBER EIGHTEEN FULL STOP -249A NUMBER NINETEEN FULL STOP -249B NUMBER TWENTY FULL STOP -33C2 SQUARE AM -33C2 SQUARE AM -33C7 SQUARE CO -33D8 SQUARE PM -33D8 SQUARE PM - -3.3 prohib-3: Non-displaying and non-spacing characters - -There are many characters that cannot be seen in the ISO 10646 character -set. These include control characters, non-breaking spaces, formatting -characters, and tagging characters. These characters would certainly -cause confusion if allowed in host names. - -0000-001F [CONTROL CHARACTERS] -007F DELETE -0080-009F [CONTROL CHARACTERS] -070F SYRIAC ABBREVIATION MARK -180B MONGOLIAN FREE VARIATION SELECTOR ONE -180C MONGOLIAN FREE VARIATION SELECTOR TWO -180D MONGOLIAN FREE VARIATION SELECTOR THREE -180E MONGOLIAN VOWEL SEPARATOR -200C ZERO WIDTH NON-JOINER -200D ZERO WIDTH JOINER -200E LEFT-TO-RIGHT MARK -200F RIGHT-TO-LEFT MARK -202A LEFT-TO-RIGHT EMBEDDING -202B RIGHT-TO-LEFT EMBEDDING -202C POP DIRECTIONAL FORMATTING -202D LEFT-TO-RIGHT OVERRIDE -202E RIGHT-TO-LEFT OVERRIDE -206A INHIBIT SYMMETRIC SWAPPING -206B ACTIVATE SYMMETRIC SWAPPING -206C INHIBIT ARABIC FORM SHAPING -206D ACTIVATE ARABIC FORM SHAPING -206E NATIONAL DIGIT SHAPES -206F NOMINAL DIGIT SHAPES -FEFF ZERO WIDTH NO-BREAK SPACE -FFF9 INTERLINEAR ANNOTATION ANCHOR -FFFA INTERLINEAR ANNOTATION SEPARATOR -FFFB INTERLINEAR ANNOTATION TERMINATOR -FFFC OBJECT REPLACEMENT CHARACTER -FFFD REPLACEMENT CHARACTER - -3.4 prohib-4: Private use characters - -Because private-use characters do not have defined meanings, they are -prohibited. The private-use characters are: - -E000-F8FF [PRIVATE USE, PLANE 0] - -3.5 prohib-5: Punctuation - -The following characters are reserved or delimiters in URLs [RFC2396] -and [RFC2732]: - -" # $ % & + , . / : ; < = > ? @ [ ] - -3.5.1 Characters from URLs - -The following punctuation characters are prohibited because they are -reserved or delimiters in URLs. - -0022 QUOTATION MARK -0023 NUMBER SIGN -0024 DOLLAR SIGN -0025 PERCENT SIGN -0026 AMPERSAND -002B PLUS SIGN -002C COMMA -002E FULL STOP -002F SOLIDUS -003A COLON -003B SEMICOLON -003C LESS-THAN SIGN -003D EQUALS SIGN -003E GREATER-THAN SIGN -003F QUESTION MARK -0040 COMMERCIAL AT -005B LEFT SQUARE BRACKET -005D RIGHT SQUARE BRACKET - -3.5.2 Characters that canonicalize to characters from URLs - -The following punctuation characters are prohibited because their -normalization contains one or more of the characters from section 3.5.1. - -037E GREEK QUESTION MARK -2048 QUESTION EXCLAMATION MARK -2049 EXCLAMATION QUESTION MARK -207A SUPERSCRIPT PLUS SIGN -207C SUPERSCRIPT EQUALS SIGN -208A SUBSCRIPT PLUS SIGN -208C SUBSCRIPT EQUALS SIGN -2100 ACCOUNT OF -2101 ADDRESSED TO THE SUBJECT -2105 CARE OF -2106 CADA UNA - -3.5.3 Characters that look like characters from URLs - -The following are prohibited because they look indistinguishable from -the characters listed in section 3.5.1. - -037E GREEK QUESTION MARK -0589 ARMENIAN FULL STOP -060C ARABIC COMMA -061B ARABIC SEMICOLON -066A ARABIC PERCENT SIGN -201A SINGLE LOW-9 QUOTATION MARK -2030 PER MILLE SIGN -2031 PER TEN THOUSAND SIGN -2033 DOUBLE PRIME -2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK -2044 FRACTION SLASH -203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK -203D INTERROBANG -3001 IDEOGRAPHIC COMMA -3002 IDEOGRAPHIC FULL STOP -3003 DITTO MARK -3008 LEFT ANGLE BRACKET -3009 RIGHT ANGLE BRACKET -3014 LEFT TORTOISE SHELL BRACKET -3015 RIGHT TORTOISE SHELL BRACKET -301A LEFT WHITE SQUARE BRACKET -301B RIGHT WHITE SQUARE BRACKET - -3.5.4 Other punctuation - -The following punctuation are prohibited because they are unlikely to -be used in names and may be confusing to users or to character-entry -processes: - -005C REVERSE SOLIDUS - -3.6 prohib-6: Symbols - -[UniData] has non-normative categories for symbols. The four symbol -categories are: - -Symbol, Currency: Currency symbols could appear in company names and -spoken phrases, so they are not prohibited. - -Symbol, Modifier: Stand-alone modifiers might appear in personal names, -company names, and spoken phrases, so they are not prohibited. - -Symbol, Math: It is very unlikely that there are any significant -personal names, company names, or spoken phrases that contain -mathematical symbols. Further, many of these symbols are the same or -similar to other punctuation, thereby leading to ambiguity. For this -reason, math-specific symbols are prohibited. These prohibited math -symbols are: - -00AC NOT SIGN -00B1 PLUS-MINUS SIGN -2200-22FF [MATHEMATICAL OPERATORS] - -Further, the following characters canonicalize to characters in the -above math list, and therefore are also prohibited: - -00BC VULGAR FRACTION ONE QUARTER -00BD VULGAR FRACTION ONE HALF -00BE VULGAR FRACTION THREE QUARTERS -207B SUPERSCRIPT MINUS -208B SUBSCRIPT MINUS -2153 VULGAR FRACTION ONE THIRD -2154 VULGAR FRACTION TWO THIRDS -2155 VULGAR FRACTION ONE FIFTH -2156 VULGAR FRACTION TWO FIFTHS -2157 VULGAR FRACTION THREE FIFTHS -2158 VULGAR FRACTION FOUR FIFTHS -2159 VULGAR FRACTION ONE SIXTH -215A VULGAR FRACTION FIVE SIXTHS -215B VULGAR FRACTION ONE EIGHTH -215C VULGAR FRACTION THREE EIGHTHS -215D VULGAR FRACTION FIVE EIGHTHS -215E VULGAR FRACTION SEVEN EIGHTHS -215F FRACTION NUMERATOR ONE -33A7 SQUARE M OVER S -33A8 SQUARE M OVER S SQUARED -33AE SQUARE RAD OVER S -33AF SQUARE RAD OVER S SQUARED -33C6 SQUARE C OVER KG - -Symbol, Other: This category covers a multitude of symbols, few of which -would ever appear in personal names, company names, and spoken phrases. -The rest of the prohibited symbols are: - -2190-21FF [ARROWS] -2300-23FF [MISCELLANEOUS TECHNICAL] -2400-243F [CONTROL PICTURES] -2440-245F [OPTICAL CHARACTER RECOGNITION] -2500-257F [BOX DRAWING] -2580-259F [BLOCK ELEMENTS] -25A0-25FF [GEOMETRIC SHAPES] -2600-267F [MISCELLANEOUS SYMBOLS] -2700-27BF [DINGBATS] -2800-287F [BRAILLE PATTERNS] - -3.7 Additional prohibited characters - -3.7.1 Unassigned characters - -All characters not yet assigned in [ISO10646] are prohibited. Although -this may at first seem trivial, it is extremely important because -characters that may be assigned in the future might have properties that -would cause them to be prohibited or might have case-folding properties. -As is the case of all prohibited characters, if a DNS server receives a -request containing an unassigned character, then the IDN protocol MUST -return an error message. - -3.7.2 Surrogate characters - -So far, all proposals for binary encodings of internationalized name -parts have specified UTF-8 as the encoding format. In such an encoding, -surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings, -the following are prohibited: - -D800-DFFF [SURROGATE CHARACTERS] - -3.7.3 Uppercase characters with no lowercase mappings - -There are many uppercase characters in [ISO10646] which do not have -lowercase equivalents in [UniData]. Therefore, they are prohibited on -input because they would get through the case mapping step while still -being in uppercase. - -The characters that are prohibited on input because they are uppercase -but have no lowercase mappings are: - -03D2 GREEK UPSILON WITH HOOK SYMBOL -03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL -03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL -04C0 CYRILLIC LETTER PALOCHKA -10A0-10C5 [GEORGIAN CAPITAL LETTERS] - -Note that many characters in the range U+1200 to U+213A, the letterlike -symbols, also are uppercase but have no lowercase mappings. However, -they are not listed here because the entire range is already prohibited -in section 3.6. - -3.7.4 Radicals and Ideographic Description - -Some Han characters can be informally defined in terms of ideographic -descriptions. However, ideographic descriptions can lead to multiple -character streams leading to the same character in a fashion that does -not canonicalize. Thus, the radicals for ideographic description and the -ideographic description characters themselves are prohibited. These -characters are: - -2E80-2EFF [CJK RADICALS SUPPLEMENT] -2F00-2FDF [KANGXI RADICALS] -2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS] - -3.8 Summary of prohibited characters - -The following is a collected list from the previous sections. - -0000-001F [CONTROL CHARACTERS] -0020 SPACE -0022 QUOTATION MARK -0023 NUMBER SIGN -0024 DOLLAR SIGN -0025 PERCENT SIGN -0026 AMPERSAND -002B PLUS SIGN -002C COMMA -002E FULL STOP -002E FULL STOP -002F SOLIDUS -003A COLON -003B SEMICOLON -003C LESS-THAN SIGN -003D EQUALS SIGN -003E GREATER-THAN SIGN -003F QUESTION MARK -0040 COMMERCIAL AT -005B LEFT SQUARE BRACKET -005C REVERSE SOLIDUS -005D RIGHT SQUARE BRACKET -007F DELETE -0080-009F [CONTROL CHARACTERS] -00A0 NO-BREAK SPACE -00AC NOT SIGN -00AD SOFT HYPHEN -00B1 PLUS-MINUS SIGN -00BC VULGAR FRACTION ONE QUARTER -00BD VULGAR FRACTION ONE HALF -00BE VULGAR FRACTION THREE QUARTERS -00D7 MULTIPLICATION SIGN -01C3 LATIN LETTER RETROFLEX CLICK -02B0-02FF [SPACING MODIFIER LETTERS] -037E GREEK QUESTION MARK -037E GREEK QUESTION MARK -03D2 GREEK UPSILON WITH HOOK SYMBOL -03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL -03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL -04C0 CYRILLIC LETTER PALOCHKA -0589 ARMENIAN FULL STOP -060C ARABIC COMMA -061B ARABIC SEMICOLON -066A ARABIC PERCENT SIGN -066D ARABIC FIVE POINTED STAR -06D4 ARABIC FULL STOP -070F SYRIAC ABBREVIATION MARK -10A0-10C5 [GEORGIAN CAPITAL LETTERS] -1680 OGHAM SPACE MARK -1806 MONGOLIAN TODO SOFT HYPHEN -180B MONGOLIAN FREE VARIATION SELECTOR ONE -180C MONGOLIAN FREE VARIATION SELECTOR TWO -180D MONGOLIAN FREE VARIATION SELECTOR THREE -180E MONGOLIAN VOWEL SEPARATOR -2000-200B [SPACES] -200C ZERO WIDTH NON-JOINER -200D ZERO WIDTH JOINER -200E LEFT-TO-RIGHT MARK -200F RIGHT-TO-LEFT MARK -2010 HYPHEN -2011 NON-BREAKING HYPHEN -2012 FIGURE DASH -2013 EN DASH -2014 EM DASH -201A SINGLE LOW-9 QUOTATION MARK -2024 ONE DOT LEADER -2025 TWO DOT LEADER -2026 HORIZONTAL ELLIPSIS -2028 LINE SEPARATOR -2029 PARAGRAPH SEPARATOR -202A LEFT-TO-RIGHT EMBEDDING -202B RIGHT-TO-LEFT EMBEDDING -202C POP DIRECTIONAL FORMATTING -202D LEFT-TO-RIGHT OVERRIDE -202E RIGHT-TO-LEFT OVERRIDE -202F NARROW NO-BREAK SPACE -2030 PER MILLE SIGN -2031 PER TEN THOUSAND SIGN -2033 DOUBLE PRIME -2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK -203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK -203D INTERROBANG -2044 FRACTION SLASH -2048 QUESTION EXCLAMATION MARK -2049 EXCLAMATION QUESTION MARK -206A INHIBIT SYMMETRIC SWAPPING -206B ACTIVATE SYMMETRIC SWAPPING -206C INHIBIT ARABIC FORM SHAPING -206D ACTIVATE ARABIC FORM SHAPING -206E NATIONAL DIGIT SHAPES -206F NOMINAL DIGIT SHAPES -207A SUPERSCRIPT PLUS SIGN -207B SUPERSCRIPT MINUS -207C SUPERSCRIPT EQUALS SIGN -208A SUBSCRIPT PLUS SIGN -208B SUBSCRIPT MINUS -208C SUBSCRIPT EQUALS SIGN -2100 ACCOUNT OF -2101 ADDRESSED TO THE SUBJECT -2105 CARE OF -2106 CADA UNA -2153 VULGAR FRACTION ONE THIRD -2154 VULGAR FRACTION TWO THIRDS -2155 VULGAR FRACTION ONE FIFTH -2156 VULGAR FRACTION TWO FIFTHS -2157 VULGAR FRACTION THREE FIFTHS -2158 VULGAR FRACTION FOUR FIFTHS -2159 VULGAR FRACTION ONE SIXTH -215A VULGAR FRACTION FIVE SIXTHS -215B VULGAR FRACTION ONE EIGHTH -215C VULGAR FRACTION THREE EIGHTHS -215D VULGAR FRACTION FIVE EIGHTHS -215E VULGAR FRACTION SEVEN EIGHTHS -215F FRACTION NUMERATOR ONE -2160-217F [ROMAN NUMERALS] -2190-21FF [ARROWS] -2200-22FF [MATHEMATICAL OPERATORS] -2300-23FF [MISCELLANEOUS TECHNICAL] -2400-243F [CONTROL PICTURES] -2440-245F [OPTICAL CHARACTER RECOGNITION] -2488 DIGIT ONE FULL STOP -2489 DIGIT TWO FULL STOP -248A DIGIT THREE FULL STOP -248B DIGIT FOUR FULL STOP -248C DIGIT FIVE FULL STOP -248D DIGIT SIX FULL STOP -248E DIGIT SEVEN FULL STOP -248F DIGIT EIGHT FULL STOP -2490 DIGIT NINE FULL STOP -2491 NUMBER TEN FULL STOP -2492 NUMBER ELEVEN FULL STOP -2493 NUMBER TWELVE FULL STOP -2494 NUMBER THIRTEEN FULL STOP -2495 NUMBER FOURTEEN FULL STOP -2496 NUMBER FIFTEEN FULL STOP -2497 NUMBER SIXTEEN FULL STOP -2498 NUMBER SEVENTEEN FULL STOP -2499 NUMBER EIGHTEEN FULL STOP -249A NUMBER NINETEEN FULL STOP -249B NUMBER TWENTY FULL STOP -2500-257F [BOX DRAWING] -2580-259F [BLOCK ELEMENTS] -25A0-25FF [GEOMETRIC SHAPES] -2600-267F [MISCELLANEOUS SYMBOLS] -2700-27BF [DINGBATS] -2800-287F [BRAILLE PATTERNS] -2E80-2EFF [CJK RADICALS SUPPLEMENT] -2F00-2FDF [KANGXI RADICALS] -2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS] -3000 IDEOGRAPHIC SPACE -3001 IDEOGRAPHIC COMMA -3002 IDEOGRAPHIC FULL STOP -3003 DITTO MARK -3008 LEFT ANGLE BRACKET -3009 RIGHT ANGLE BRACKET -33A7 SQUARE M OVER S -33A8 SQUARE M OVER S SQUARED -33AE SQUARE RAD OVER S -33AF SQUARE RAD OVER S SQUARED -33C2 SQUARE AM -33C2 SQUARE AM -33C6 SQUARE C OVER KG -33C7 SQUARE CO -33D8 SQUARE PM -33D8 SQUARE PM -D800-DFFF [SURROGATE CHARACTERS] -E000-F8FF [PRIVATE USE, PLANE 0] -FB1D-FB4F [HEBREW PRESENTATION FORMS] -FB50-FDFF [ARABIC PRESENTATION FORMS A] -FE20-FE2F [COMBINING HALF MARKS] -FE30-FE4F [CJK COMPATIBILITY FORMS] -FE50-FE6F [SMALL FORM VARIANTS] -FE70-FEFC [ARABIC PRESENTATION FORMS B] -FEFF ZERO WIDTH NO-BREAK SPACE -FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS] -FFF9 INTERLINEAR ANNOTATION ANCHOR -FFFA INTERLINEAR ANNOTATION SEPARATOR -FFFB INTERLINEAR ANNOTATION TERMINATOR -FFFC OBJECT REPLACEMENT CHARACTER -FFFD REPLACEMENT CHARACTER -Unassigned characters - - -4. Case Folding - -After it has been verified that the input text has none of the -characters prohibited for case folding, the case-folding step itself is -quite straight-forward. For each character in the input, if there is a -lowercase mapping for that character in [UniData], the input character -is changed to the mapped lowercase letter. - - -5. Canonicalization - -After case folding, the input string is normalized using form KC, as -described in [UTR15]. - -6. IDN Table Revisions - -A table consisting of all characters allowed and prohibited and the -rules for case folding and canonicalization will be created based on the -content of the [UniData] and on the content of this document. This table -will be the authority for implementations to follow and will be -normatively referenced by this document. Such a table will enable the -IDN protocol to have versions independent of the revisions to Unicode -and/or to ISO 10646 because the revision of IDN and its deployment may -not in sync with revisions to Unicode and ISO 10646. - -In a future draft of this document, IANA will be asked to keep this -table, with an initial version number of 1. Each new version of the -table will have a new, higher version number. - - -7. Security Considerations - -Much of the security of the Internet relies on the DNS. Thus, any change -to the characteristics of the DNS can change the security of much of the -Internet. - -Host names are used by users to connect to Internet servers. The -security of the Internet would be compromised if a user entering a -single internationalized name could be connected to different servers -based on different interpretations of the internationalized host name. - - -8. References - -[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name -Proposals", draft-ietf-idn-compare. - -[IDNReq] James Seng, "Requirements of Internationalized Domain Names", -draft-ietf-idn-requirement. - -[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information -technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part -1: Architecture and Basic Multilingual Plane. Five amendments and a -technical corrigendum have been published up to now. UTF-16 is described -in Annex Q, published as Amendment 1. 17 other amendments are currently -at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE -UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]] - -[Normalize] Character Normalization in IETF Protocols, -draft-duerst-i18n-norm-03 - -[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate -Requirement Levels", March 1997, RFC 2119. - -[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI): -Generic Syntax", August 1998, RFC 2396. - -[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in -URL's, December 1999, RFC 2732. - -[STD13] Paul Mockapetris, "Domain names - implementation and -specification", November 1987, STD 13 (RFC 1035). - -[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version -3.0", ISBN 0-201-61633-5. Described at -. - -[UniData] The Unicode Consortium. UnicodeData File. -. - -[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. -Unicode Technical Report #15. -. - - -A. Acknowledgements - -Many people from the IETF IDN Working Group and the Unicode Technical -Committee contributed ideas that went into the first draft of this -document. Mark Davis was particularly helpful in some of the early -ideas. - - -B. Changes From Previous Versions of this Draft - -This is the -00 version, so there are no changes. - - -C. IANA Considerations - -There are no specific IANA considerations in this draft, but there will -be in a future draft of this document. - - -D. Author Contact Information - -Paul Hoffman -Internet Mail Consortium and VPN Consortium -127 Segre Place -Santa Cruz, CA 95060 USA -paul.hoffman@imc.org and paul.hoffman@vpnc.org - -Marc Blanchet -Viagenie inc. -2875 boul. Laurier, bur. 300 -Ste-Foy, Quebec, Canada, G1V 2M2 -Marc.Blanchet@viagenie.qc.ca diff --git a/doc/draft/draft-ietf-idn-nameprep-02.txt b/doc/draft/draft-ietf-idn-nameprep-02.txt new file mode 100644 index 0000000000..b0b27f2b0e --- /dev/null +++ b/doc/draft/draft-ietf-idn-nameprep-02.txt @@ -0,0 +1,1988 @@ +Internet Draft Paul Hoffman +draft-ietf-idn-nameprep-02.txt IMC & VPNC +January 17, 2001 Marc Blanchet +Expires in six months ViaGenie + + Preparation of Internationalized Host Names + +Status of this memo + +This document is an Internet-Draft and is in full conformance with all +provisions of Section 10 of RFC2026. + +Internet-Drafts are working documents of the Internet Engineering Task +Force (IETF), its areas, and its working groups. Note that other groups +may also distribute working documents as Internet-Drafts. + +Internet-Drafts are draft documents valid for a maximum of six months +and may be updated, replaced, or obsoleted by other documents at any +time. It is inappropriate to use Internet-Drafts as reference material +or to cite them other than as "work in progress." + +To view the list Internet-Draft Shadow Directories, see +http://www.ietf.org/shadow.html. + + +Abstract + +This document describes how to prepare internationalized host names for +for use in the DNS. The steps include: +- mapping characters to other characters, such as to change their case +- normalizing the characters +- excluding characters that are prohibited from appearing in +internationalized host names + +1. Introduction + +When expanding today's DNS to include internationalized host names, +those new names will be handled in many parts of the DNS. The IDN +Working Group's requirements document [IDNReq] describes a framework for +domain name handling as well as requirements for the new names. The IDN +Working Group's comparison document [IDNComp] gives a framework for how +various parts of the IDN solution work together. + +A user can enter a domain name into an application program in a myriad +of fashions. Depending on the input method, the characters entered in +the domain name may or may not be those that are allowed in +internationalized host names. Thus, there must be a way to normalized +the user's input before the name is resolved in the DNS. + +It is a design goal of this document to allow users to enter host names +in applications and have the highest chance of getting the name correct. +This means that the user should not be limited to only entering exactly +the characters that might have been used, but to instead be able to +enter characters that unambiguously normalize to characters in the +desired host name. At the same time, this process must not introduce any +chance that two host names could be represented by two distinct strings +of characters that look identical to typical users. It is also a design +goal to have all preprocessing of IDN done before going on the wire, so +that no transformation is done in the DNS server space. Name preparation +can be done in other places, such as in the registration process. + +This document describes the steps needed to convert a name part from one +that is entered by the user to one that can be used in the DNS. + +1.1 Terminology + +The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and +"MAY" in this document are to be interpreted as described in RFC 2119 +[RFC2119]. + +Examples in this document use the notation for code points and names +from the Unicode Standard [Unicode3] and ISO 10646 [ISO10646]. For +example, the letter "a" may be represented as either "U+0061" or "LATIN +SMALL LETTER A". In the lists of prohibited characters, the "U+" is left +off to make the lists easier to read. The names of character ranges are +shown in square brackets (such as "[SYMBOLS]") and do not come from the +standards. + +Note: A glossary of terms used in Unicode and ISO 10646 can be found in +[Glossary]. Information on the 10646/Unicode character model can be +found in [CharModel]. + + +2. Preparation Overview + +The steps for preparing names are: + +1) Input from the application service interface -- This can be done in +many ways and is not specified in this document + +2) Map -- For each character in the input, check if it has a mapping +and, if so, replace it with its mapping. The mappings are a combination +of folding uppercase characters to lowercase and hyphen mapping. This +is described in Section 4. + +3) Normalize -- Normalize the characters. This is described in Section 5. + +4) Look for prohibited output -- Check for any characters that are not +allowed in the output. If any are found, return an error to the +application service interface. This is described in Section 6. + +5) Resolution of the prepared name -- This must be specified in a +different IDN document. + +The above steps MUST be performed in the order given in order to comply +with this specification. + +The steps in this document have associated tables in the document. The +tables are derived from outside sources, and the derivation is briefly +described in the document. Although a great deal of effort has gone into +preparing the tables, there is a chance that the tables do not correctly +reflect the outside sources. Regardless of whether or not the tables +differ from the sources, implementations MUST use the tables in this +document for their processing. That is, if there is an error in the +tables, the tables must still be used. Future versions of this document +may include corrections and additions to the tables. + + +3. Mapping + +Each character in the input stream is checked against the mapping table. +The mapping table can be found in Appendix E of this document. That +table includes all the steps described in the subsections below. + +The mappings can be one-to-none, one-to-one, or one-to-many. That is, +some characters may be eliminated or replaced by more than one +character, and the output of this step might be shorter or longer than +the input. + +Design note: Characters that are not wanted in internationalized name +parts can either be mapped to nothing in the mapping step, or cause an +error in the prohibition step. The general guideline used to pick +between the two outcomes was that removing alphabetic, non-protocol +characters be done in the mapping step, but all other removals be done +in the prohibition step. This allows for simple linguistic errors on the +part of an input mechanism to be caught in the mapping step, but to not +hide serious errors such as entering protocol characters or invisible +characters from the user. + +3.1 Case mapping + +For each character in the input, if there is a lowercase mapping for +that character, the input character is changed to the mapped lowercase +character(s). The entries in the mapping table are derived from [UTR21]. + +Design note: this step could have been "change all lowercase characters +into uppercase characters". However, the upper-to-lower folding was +chosen because most users of the Internet today enter host names in +lowercase. + +3.2 Additional folding mappings + +There are some characters that do not have mappings in [UTR21] but still +need processing. These characters include a few Greek characters and +many symbols that contain Latin characters. The list of characters to +add to the mapping table were determined by the following algorithm: + +b = Normalize(Fold(a)); +c = Normalize(Fold(b)); +if c is not the same as b, add a mapping for "a to c". + +Because Normalize(Fold(c)) always equals c, the table is stable from +that point on. + +3.3 Mapped out + +The following characters are simply deleted from the input (that is, +they are mapped to nothing) because their presence or absence should not +make two domain names different. + +Some characters are only useful in line-based text, and are otherwise +invisible and ignored. + +00AD; SOFT HYPHEN +1806; MONGOLIAN TODO SOFT HYPHEN +200B; ZERO WIDTH SPACE +FEFF; ZERO WIDTH NO-BREAK SPACE + +Variation selectors and cursive connectors select different glyphs, but +do not bear semantics. + +180B; MONGOLIAN FREE VARIATION SELECTOR ONE +180C; MONGOLIAN FREE VARIATION SELECTOR TWO +180D; MONGOLIAN FREE VARIATION SELECTOR THREE +200C; ZERO WIDTH NON-JOINER +200D; ZERO WIDTH JOINER + + +4. Normalization + +The output of the mapping step is normalized using form KC, as described +in [UTR15]. Using form KC instead of form C causes many characters that +are identical or near-identical to be converted into a single character. +Note that this specification refers to a specific vesion of [UTR15]. +If a later version of [UTR15] changes the algorithm used for normalizing, +that later version MUST NOT be used with this specification. Note that +it is likely that this specification will be revised if UTR15 is changed, +but until that happens, only the specified version of [UTR15] must +be used. + + +5. Prohibited Output + +Before the text can be emitted, it must be checked for prohibited code +points. There is a variety of prohibited code points, as described in +this section. + +One of the goals of IDN is to allow the widest possible set of host +names as long as those host names do not cause other problems, such as +conflict with other standards. Specifically, experience with current DNS +names have shown that there is a desire for host names that include +personal names, company names, and spoken phrases. A goal of this +section is to prohibit as few characters that might be used in these +contexts as possible. + +Note that every code point listed in this section MUST NOT be transmitted +on the DNS service interface. If a DNS server receives a request +containing a prohibited code point, then the DNS server MUST NOT resolve +that name. + +The collected list of prohibited code points can be found in Appendix F +of this document. The list in Appendix F MUST be used by implementations +of this specification. If there are any discrepancies between the list +in Appendix F and subsections below, the list Appendix F always takes +precedence. + +Some code points listed in one section would also appear in other +sections. Each code point is only listed once in the table in Appendix +F. + +5.1 Currently-prohibited ASCII characters + +Some of the ASCII characters that are currently prohibited in host names +by [STD13] are also used in protocol elements such as URIs. The other +characters in the range U+0000 to U+007F that are not currently allowed +are also prohibited in host name parts to reserve them for future use in +protocol elements. + +0000-002C; [ASCII] +002E-002F; [ASCII] +003A-0040; [ASCII] +005B-0060; [ASCII] +007B-007F; [ASCII] + +5.2 Space characters + +Space characters would make visual transcription of URLs nearly +impossible and could lead to user entry errors in many ways. + +0020; SPACE +00A0; NO-BREAK SPACE +2000; EN QUAD +2001; EM QUAD +2002; EN SPACE +2003; EM SPACE +2004; THREE-PER-EM SPACE +2005; FOUR-PER-EM SPACE +2006; SIX-PER-EM SPACE +2007; FIGURE SPACE +2008; PUNCTUATION SPACE +2009; THIN SPACE +200A; HAIR SPACE +202F; NARROW NO-BREAK SPACE +3000; IDEOGRAPHIC SPACE +1680; OGHAM SPACE MARK +200B; ZERO WIDTH SPACE + +5.3 Control characters + +Control characters cannot be seen and can cause unpredictable results +when displayed. + +0000-001F; [CONTROL CHARACTERS] +007F; DELETE +0080-009F; [CONTROL CHARACTERS] +2028; LINE SEPARATOR +2029; PARAGRAPH SEPARATORS + +5.4 Private use and replacement characters + +Because private-use characters do not have defined meanings, they are +prohibited. The private-use characters are: + +E000-F8FF; [PRIVATE USE, PLANE 0] +F0000-FFFFD; [PRIVATE USE, PLANE 15] +100000-10FFFD; [PRIVATE USE, PLANE 16] + +The replacement character (U+FFFD) has no known semantic definition in a +name, and is often used in renderers to say "there would be some +character here, but it cannot be rendered". For example, on a computer +with no Asian fonts, a name with three katakana characters might be +rendered with three replacement characters. + +FFFD; REPLACEMENT CHARACTER + +5.5 Non-character codepoints + +Non-character code points are code points that have been assigned in +ISO 10646 but are not characters. Because they are already assigned, +they are guaranteed not to later change into characters. + +FFFE-FFFF; [NONCHARACTER CODE POINTS] +1FFFE-1FFFF; [NONCHARACTER CODE POINTS] +2FFFE-2FFFF; [NONCHARACTER CODE POINTS] +3FFFE-3FFFF; [NONCHARACTER CODE POINTS] +4FFFE-4FFFF; [NONCHARACTER CODE POINTS] +5FFFE-5FFFF; [NONCHARACTER CODE POINTS] +6FFFE-6FFFF; [NONCHARACTER CODE POINTS] +7FFFE-7FFFF; [NONCHARACTER CODE POINTS] +8FFFE-8FFFF; [NONCHARACTER CODE POINTS] +9FFFE-9FFFF; [NONCHARACTER CODE POINTS] +AFFFE-AFFFF; [NONCHARACTER CODE POINTS] +BFFFE-BFFFF; [NONCHARACTER CODE POINTS] +CFFFE-CFFFF; [NONCHARACTER CODE POINTS] +DFFFE-DFFFF; [NONCHARACTER CODE POINTS] +EFFFE-EFFFF; [NONCHARACTER CODE POINTS] +FFFFE-FFFFF; [NONCHARACTER CODE POINTS] +10FFFE-10FFFF; [NONCHARACTER CODE POINTS] + +5.6 Surrogate codes + +The following code points are permanently reserved for use as surrogate +code values in the UTF-16 encoding, will never be assigned to +characters, and are therefore prohibited: + +D800-DFFF; [SURROGATE CODES] + +5.7 Inappropriate for plain text + +The following characters should not appear in regular text. + +FFF9; INTERLINEAR ANNOTATION ANCHOR +FFFA; INTERLINEAR ANNOTATION SEPARATOR +FFFB; INTERLINEAR ANNOTATION TERMINATOR +FFFC; OBJECT REPLACEMENT CHARACTER + +5.8 Inappropriate for domain names + +The ideographic description characters allow different sequences of +characters to be rendered the same way, which makes them inappropriate +for host names that must have a single canonical order. + +2FF0-2FFF; [IDEOGRAPHIC DESCRIPTION CHARACTERS] + +5.9 Change display properties + +The following characters, some of which are deprecated in ISO 10646, +can cause changes in display or the order in which characters appear +when rendered. + +200E; LEFT-TO-RIGHT MARK +200F; RIGHT-TO-LEFT MARK +202A; LEFT-TO-RIGHT EMBEDDING +202B; RIGHT-TO-LEFT EMBEDDING +202C; POP DIRECTIONAL FORMATTING +202D; LEFT-TO-RIGHT OVERRIDE +202E; RIGHT-TO-LEFT OVERRIDE +206A; INHIBIT SYMMETRIC SWAPPING +206B; ACTIVATE SYMMETRIC SWAPPING +206C; INHIBIT ARABIC FORM SHAPING +206D; ACTIVATE ARABIC FORM SHAPING +206E; NATIONAL DIGIT SHAPES +206F; NOMINAL DIGIT SHAPES + + +6. Unassigned Code Points + +All code points not yet assigned in ISO 10646 are called "unassigned +code points". Authoritative name servers MUST NOT have internationalized +name parts that contain any unassigned code points. DNS requests MAY +contain name parts that contain unassigned code points. Note that this +is the only part of this document where the requirements for queries +differs from the requirements for names in DNS zones. + +Using two different policies for where unassigned code points can appear +in the DNS prevents the need for versioning the IDNprotocol [IDNrev]. +This is very useful since it makes the overall processing simpler and do +not impose a "protocol" to handle versioning. It is expected that ISO +10646 will be updated fairly frequently; recently, it has happened +approximately once a year. Each time a new version of ISO 10646 appears, +a new version of this document can be created. Some end users will want +to use the new code points as soon as they are defined. + +The list of unassigned code points can be found in Appendix G of this +document. The list in Appendix G MUST be used by implementations of this +specification. If there are any discrepancies between the list in +Appendix G and the ISO 10646 specification, the list Appendix G always +takes precedence. + +Due to the way that versioning is handled in this section, host names +that are embedded in structures that cannot be changed (such as the +signed parts of digital certificates) MUST NOT have internationalized +name parts that contain any unassigned code points. + +6.1 Categories of code points + +Each code point in ISO 10646 can be categorized by how it acts in the +process described in earlier sections of this document: + +AO Code points that may be in the output + +MN Code points that cannot be in the output because they are + mapped to nothing or never appear as output from + normalization + +D Code points that cannot be in the output because they are + disallowed in the prohibition step + +U Unassigned code points + +A subsequent version of this document that references a newer version of +ISO 10646 with new code points will inherently have some code points +move from category U to either D, MN, or AO. For backwards +compatibility, no future version of this document will move code points +from any other category. That is, no current AO, MN, or D code points +will ever change to a different category. + +Authoritative name servers MUST NOT contain any name that has code +points outside of AO for the latest version of this document. That is, +they are forbidden to contain any IDN names containing code points from +the MN, D, or U categories. + +Applications creating name queries MUST treat U code points as if they +were AO when preparing the name parts according to this document. Those +applications MAY optionally have a preprocess that provide stricter +checks: treating unassigned code points in the input as errors, or +warning the user about the fact that the code point is unassigned in the +version of this document that the software is based on; such a choice is +a local matter for the software. + +Non-authoritative DNS servers MAY reject names that contain code points +that are in categories MN or D for the version of this document that +they implement, but MUST NOT reject names because they contain name +parts with code points from category U. + +6.2 Reasons for difference between authoritative servers and requests + +Different software using different versions of this document need to +interoperate with maximal compatibility. The scheme described in this +section (authoritative name servers MUST NOT use unassigned code points, +requests MAY include unassigned code points) allows that compatibility +without introducing any known security or interoperability issues. + +The list below shows what happens if a request contains a code point +from category U that is allowed in a newer version of this document. The +request either resolves to the domain name that was intended, or +resolves to no domain at all. In this list, the request comes from an +application using version "oldVersion" of this document, the +authoritative name server is using version "newVersion" of this +document, and the code point X was in category U on oldVersion, and has +changed category to AO, MN, or D. There are 3 possible scenarios: + +1. X becomes AO -- In newVersion, X is in category AO. Because the +application passed X through, it gets back correct data from the +authoritative name server. There is one exceptional case, where X is a +combining mark. + +The order of combining marks is normalized, so if another combining mark +Y has a lower combining class than X then XY will be put in the +canonical order YX. (Unassigned code points are never reordered, so this +doesn't happen in oldVersion). If the request contains YX, the request +will get correct data from the authoritative name server. However, no +domain name can be registered with XY, so a request with XY will get a +"no such host" error. + +2. X becomes MN -- In newVersion, X is normalized to code point "nX" and +therefore X is now put in category MN. This cannot exist in any domain +name, so any request containing X will get back a "no such host" error. +Note, however, if the request had contained the letter nX, it would have +gotten back correct data. + +3. X becomes D -- In newVersion, X is in category MN. This cannot exist +in any domain name, so any request containing X will get back a "no such +host" error. + +In none of the cases does the request get data for a host name other +than the one it actually wanted. + +The processing in this document is always stable. If a string S is the +result of processing on newVersion, then it will remain the same when +processed on oldVersion. + +There is always a way for the application to get the correct data from +the authoritative name server. For example, suppose that was +unassigned in oldVersion, and that it is assigned in newVersion, but +case-folded to . As long as the application supplies strings +containing instead of , the correct data will be +returned. Because the processing is stable, a different application +running newVersion can pass a processed host name to the application +running oldVersion. It will only contain , and will return the +correct results from the authoritative name server. + +6.3 Versions of applications and authoritative name servers + +Another way to see that this versioning system works is to compare what +happens when an application uses a newer or older version of this +document. + +Newer application -- Suppose that a application or intermediary DNS +server is using version newVersion and the authoritative name server is +using version oldVersion. This case is simple: there will be no names on +the server that cannot be accessed by the application because the +resolver uses a superset of the code points accepted by the server. + +Newer server -- Suppose that an application or intermediary DNS server +is using oldVersion and the authoritative name server is using +newVersion. Because the application passed through any unassigned code +points, the user can access names on the server that use code points in +newVersion. No names on the site can have code points that are +unassigned in newVersion, since that is illegal. In this case, the +application has to enter the unassigned code points in the correct +order, and has to use unassigned code points that would make it through +both the mapping and the normalization steps. + + +7. Security Considerations + +Much of the security of the Internet relies on the DNS. Thus, any change +to the characteristics of the DNS can change the security of much of the +Internet. + +Host names are used by users to connect to Internet servers. The +security of the Internet would be compromised if a user entering a +single internationalized name could be connected to different servers +based on different interpretations of the internationalized host name. + +Current applications may assume that the characters allowed in host +names will always be the same as they are in [STD13]. This document +vastly increases the number of characters available in host names. Every +program that uses "special" characters in conjunction with host names +may be vulnerable to attack based on the new characters allowed by this +specification. + + +8. References + +[CharModel] Unicode Technical Report;17, Character Model. +. + +[Glossary] Unicode Glossary, . + +[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name +Proposals", draft-ietf-idn-compare + +[IDNReq] James Seng, "Requirements of Internationalized Domain Names", +draft-ietf-idn-requirement + +[IDNRev] Marc Blanchet, "Handling versions of internationalized domain +names protocols", draft-ietf-idn-version + +[ISO10646] ISO/IEC 10646-1:2000. International Standard -- Information +technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part +1: Architecture and Basic Multilingual Plane. + +[Normalize] Character Normalization in IETF Protocols, +draft-duerst-i18n-norm-03 + +[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate +Requirement Levels", March 1997, RFC 2119. + +[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI): +Generic Syntax", August 1998, RFC 2396. + +[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in +URL's, December 1999, RFC 2732. + +[STD13] Paul Mockapetris, "Domain names - implementation and +specification", November 1987, STD 13 (RFC 1034 and 1035). + +[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version +3.0", ISBN 0-201-61633-5. Described at +. + +[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. +Unicode Technical Report;15. +. + +[UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21. +. + + +A. Acknowledgements + +Many people from the IETF IDN Working Group and the Unicode Technical +Committee contributed ideas that went into the first draft of this +document. Mark Davis and Patrik Faltstrom were particularly helpful in +some of the ideas, such as the versioning description. + +The IDN namprep design team made many useful changes to the first +draft. That team and its advisors include: + +Asmus Freytag +Cathy Wissink +Francois Yergeau +James Seng +Marc Blanchet +Mark Davis +Martin Duerst +Patrik Faltstrom +Paul Hoffman + +Additional significant improvements were proposed by: + +Jonathan Rosenne +Kent Karlsson +Scott Hollenbeck + +B. Differences Between -01 and -01 Drafts + +Throughout: changed the format of lines with character names to make +the document easier to review. + +1.1: Added non-normative reference to [ISO10646]. Also added note about +range names. + +3.2: Changed "CaseFold" to "Fold" in last sentence. + +4: Corrected spelling in title. + +5: Changed "character" to "code point" in many places because some of +the things that are prohibited are not chraracters. Changed the last +sentence in the fifth paragraph. + +6: Changed "character" to "code point" in many places, including the +title of the section. + +A: Added Kent Karlsson and Scott Hollenbeck to the commenters list. + +F: Corrected an error in the table (hyphen was called prohibited +when it obviously is not). Changed title. + +G: Fixed the table to use the proper format for the code points. +Changed title. + + +C. IANA Considerations + +[[[ We probably won't have any. ]]] + + +D. Author Contact Information + +Paul Hoffman +Internet Mail Consortium and VPN Consortium +127 Segre Place +Santa Cruz, CA 95060 USA +paul.hoffman@imc.org and paul.hoffman@vpnc.org + +Marc Blanchet +Viagenie inc. +2875 boul. Laurier, bur. 300 +Ste-Foy, Quebec, Canada, G1V 2M2 +Marc.Blanchet@viagenie.qc.ca + + +E. Mapping Table + +The following is the mapping table from Section 3. The table has three +columns: +- the character that is mapped from +- the zero or more characters that it is mapped to +- the reason for the mapping +The columns are separated by semicolons. Note that the second column may +be empty, or it may have one character, or it may have more than one +character, with each character separated by a space. + +0041; 0061; Case map +0042; 0062; Case map +0043; 0063; Case map +0044; 0064; Case map +0045; 0065; Case map +0046; 0066; Case map +0047; 0067; Case map +0048; 0068; Case map +0049; 0069; Case map +004A; 006A; Case map +004B; 006B; Case map +004C; 006C; Case map +004D; 006D; Case map +004E; 006E; Case map +004F; 006F; Case map +0050; 0070; Case map +0051; 0071; Case map +0052; 0072; Case map +0053; 0073; Case map +0054; 0074; Case map +0055; 0075; Case map +0056; 0076; Case map +0057; 0077; Case map +0058; 0078; Case map +0059; 0079; Case map +005A; 007A; Case map +00AD; ; Map out +00B5; 03BC; Case map +00C0; 00E0; Case map +00C1; 00E1; Case map +00C2; 00E2; Case map +00C3; 00E3; Case map +00C4; 00E4; Case map +00C5; 00E5; Case map +00C6; 00E6; Case map +00C7; 00E7; Case map +00C8; 00E8; Case map +00C9; 00E9; Case map +00CA; 00EA; Case map +00CB; 00EB; Case map +00CC; 00EC; Case map +00CD; 00ED; Case map +00CE; 00EE; Case map +00CF; 00EF; Case map +00D0; 00F0; Case map +00D1; 00F1; Case map +00D2; 00F2; Case map +00D3; 00F3; Case map +00D4; 00F4; Case map +00D5; 00F5; Case map +00D6; 00F6; Case map +00D8; 00F8; Case map +00D9; 00F9; Case map +00DA; 00FA; Case map +00DB; 00FB; Case map +00DC; 00FC; Case map +00DD; 00FD; Case map +00DE; 00FE; Case map +00DF; 0073 0073; Case map +0100; 0101; Case map +0102; 0103; Case map +0104; 0105; Case map +0106; 0107; Case map +0108; 0109; Case map +010A; 010B; Case map +010C; 010D; Case map +010E; 010F; Case map +0110; 0111; Case map +0112; 0113; Case map +0114; 0115; Case map +0116; 0117; Case map +0118; 0119; Case map +011A; 011B; Case map +011C; 011D; Case map +011E; 011F; Case map +0120; 0121; Case map +0122; 0123; Case map +0124; 0125; Case map +0126; 0127; Case map +0128; 0129; Case map +012A; 012B; Case map +012C; 012D; Case map +012E; 012F; Case map +0130; 0069; Case map +0131; 0069; Case map +0132; 0133; Case map +0134; 0135; Case map +0136; 0137; Case map +0139; 013A; Case map +013B; 013C; Case map +013D; 013E; Case map +013F; 0140; Case map +0141; 0142; Case map +0143; 0144; Case map +0145; 0146; Case map +0147; 0148; Case map +0149; 02BC 006E; Case map +014A; 014B; Case map +014C; 014D; Case map +014E; 014F; Case map +0150; 0151; Case map +0152; 0153; Case map +0154; 0155; Case map +0156; 0157; Case map +0158; 0159; Case map +015A; 015B; Case map +015C; 015D; Case map +015E; 015F; Case map +0160; 0161; Case map +0162; 0163; Case map +0164; 0165; Case map +0166; 0167; Case map +0168; 0169; Case map +016A; 016B; Case map +016C; 016D; Case map +016E; 016F; Case map +0170; 0171; Case map +0172; 0173; Case map +0174; 0175; Case map +0176; 0177; Case map +0178; 00FF; Case map +0179; 017A; Case map +017B; 017C; Case map +017D; 017E; Case map +017F; 0073; Case map +0181; 0253; Case map +0182; 0183; Case map +0184; 0185; Case map +0186; 0254; Case map +0187; 0188; Case map +0189; 0256; Case map +018A; 0257; Case map +018B; 018C; Case map +018E; 01DD; Case map +018F; 0259; Case map +0190; 025B; Case map +0191; 0192; Case map +0193; 0260; Case map +0194; 0263; Case map +0196; 0269; Case map +0197; 0268; Case map +0198; 0199; Case map +019C; 026F; Case map +019D; 0272; Case map +019F; 0275; Case map +01A0; 01A1; Case map +01A2; 01A3; Case map +01A4; 01A5; Case map +01A6; 0280; Case map +01A7; 01A8; Case map +01A9; 0283; Case map +01AC; 01AD; Case map +01AE; 0288; Case map +01AF; 01B0; Case map +01B1; 028A; Case map +01B2; 028B; Case map +01B3; 01B4; Case map +01B5; 01B6; Case map +01B7; 0292; Case map +01B8; 01B9; Case map +01BC; 01BD; Case map +01C4; 01C6; Case map +01C5; 01C6; Case map +01C7; 01C9; Case map +01C8; 01C9; Case map +01CA; 01CC; Case map +01CB; 01CC; Case map +01CD; 01CE; Case map +01CF; 01D0; Case map +01D1; 01D2; Case map +01D3; 01D4; Case map +01D5; 01D6; Case map +01D7; 01D8; Case map +01D9; 01DA; Case map +01DB; 01DC; Case map +01DE; 01DF; Case map +01E0; 01E1; Case map +01E2; 01E3; Case map +01E4; 01E5; Case map +01E6; 01E7; Case map +01E8; 01E9; Case map +01EA; 01EB; Case map +01EC; 01ED; Case map +01EE; 01EF; Case map +01F0; 006A 030C; Case map +01F1; 01F3; Case map +01F2; 01F3; Case map +01F4; 01F5; Case map +01F6; 0195; Case map +01F7; 01BF; Case map +01F8; 01F9; Case map +01FA; 01FB; Case map +01FC; 01FD; Case map +01FE; 01FF; Case map +0200; 0201; Case map +0202; 0203; Case map +0204; 0205; Case map +0206; 0207; Case map +0208; 0209; Case map +020A; 020B; Case map +020C; 020D; Case map +020E; 020F; Case map +0210; 0211; Case map +0212; 0213; Case map +0214; 0215; Case map +0216; 0217; Case map +0218; 0219; Case map +021A; 021B; Case map +021C; 021D; Case map +021E; 021F; Case map +0222; 0223; Case map +0224; 0225; Case map +0226; 0227; Case map +0228; 0229; Case map +022A; 022B; Case map +022C; 022D; Case map +022E; 022F; Case map +0230; 0231; Case map +0232; 0233; Case map +0345; 03B9; Case map +037A; 0020 03B9; Additional folding +0386; 03AC; Case map +0388; 03AD; Case map +0389; 03AE; Case map +038A; 03AF; Case map +038C; 03CC; Case map +038E; 03CD; Case map +038F; 03CE; Case map +0390; 03B9 0308 0301; Case map +0391; 03B1; Case map +0392; 03B2; Case map +0393; 03B3; Case map +0394; 03B4; Case map +0395; 03B5; Case map +0396; 03B6; Case map +0397; 03B7; Case map +0398; 03B8; Case map +0399; 03B9; Case map +039A; 03BA; Case map +039B; 03BB; Case map +039C; 03BC; Case map +039D; 03BD; Case map +039E; 03BE; Case map +039F; 03BF; Case map +03A0; 03C0; Case map +03A1; 03C1; Case map +03A3; 03C2; Case map +03A4; 03C4; Case map +03A5; 03C5; Case map +03A6; 03C6; Case map +03A7; 03C7; Case map +03A8; 03C8; Case map +03A9; 03C9; Case map +03AA; 03CA; Case map +03AB; 03CB; Case map +03B0; 03C5 0308 0301; Case map +03C2; 03C2; Case map +03C3; 03C2; Case map +03D0; 03B2; Case map +03D1; 03B8; Case map +03D2; 03C5; Additional folding +03D3; 03CD; Additional folding +03D4; 03CB; Additional folding +03D5; 03C6; Case map +03D6; 03C0; Case map +03DA; 03DB; Case map +03DC; 03DD; Case map +03DE; 03DF; Case map +03E0; 03E1; Case map +03E2; 03E3; Case map +03E4; 03E5; Case map +03E6; 03E7; Case map +03E8; 03E9; Case map +03EA; 03EB; Case map +03EC; 03ED; Case map +03EE; 03EF; Case map +03F0; 03BA; Case map +03F1; 03C1; Case map +03F2; 03C2; Case map +0400; 0450; Case map +0401; 0451; Case map +0402; 0452; Case map +0403; 0453; Case map +0404; 0454; Case map +0405; 0455; Case map +0406; 0456; Case map +0407; 0457; Case map +0408; 0458; Case map +0409; 0459; Case map +040A; 045A; Case map +040B; 045B; Case map +040C; 045C; Case map +040D; 045D; Case map +040E; 045E; Case map +040F; 045F; Case map +0410; 0430; Case map +0411; 0431; Case map +0412; 0432; Case map +0413; 0433; Case map +0414; 0434; Case map +0415; 0435; Case map +0416; 0436; Case map +0417; 0437; Case map +0418; 0438; Case map +0419; 0439; Case map +041A; 043A; Case map +041B; 043B; Case map +041C; 043C; Case map +041D; 043D; Case map +041E; 043E; Case map +041F; 043F; Case map +0420; 0440; Case map +0421; 0441; Case map +0422; 0442; Case map +0423; 0443; Case map +0424; 0444; Case map +0425; 0445; Case map +0426; 0446; Case map +0427; 0447; Case map +0428; 0448; Case map +0429; 0449; Case map +042A; 044A; Case map +042B; 044B; Case map +042C; 044C; Case map +042D; 044D; Case map +042E; 044E; Case map +042F; 044F; Case map +0460; 0461; Case map +0462; 0463; Case map +0464; 0465; Case map +0466; 0467; Case map +0468; 0469; Case map +046A; 046B; Case map +046C; 046D; Case map +046E; 046F; Case map +0470; 0471; Case map +0472; 0473; Case map +0474; 0475; Case map +0476; 0477; Case map +0478; 0479; Case map +047A; 047B; Case map +047C; 047D; Case map +047E; 047F; Case map +0480; 0481; Case map +048C; 048D; Case map +048E; 048F; Case map +0490; 0491; Case map +0492; 0493; Case map +0494; 0495; Case map +0496; 0497; Case map +0498; 0499; Case map +049A; 049B; Case map +049C; 049D; Case map +049E; 049F; Case map +04A0; 04A1; Case map +04A2; 04A3; Case map +04A4; 04A5; Case map +04A6; 04A7; Case map +04A8; 04A9; Case map +04AA; 04AB; Case map +04AC; 04AD; Case map +04AE; 04AF; Case map +04B0; 04B1; Case map +04B2; 04B3; Case map +04B4; 04B5; Case map +04B6; 04B7; Case map +04B8; 04B9; Case map +04BA; 04BB; Case map +04BC; 04BD; Case map +04BE; 04BF; Case map +04C1; 04C2; Case map +04C3; 04C4; Case map +04C7; 04C8; Case map +04CB; 04CC; Case map +04D0; 04D1; Case map +04D2; 04D3; Case map +04D4; 04D5; Case map +04D6; 04D7; Case map +04D8; 04D9; Case map +04DA; 04DB; Case map +04DC; 04DD; Case map +04DE; 04DF; Case map +04E0; 04E1; Case map +04E2; 04E3; Case map +04E4; 04E5; Case map +04E6; 04E7; Case map +04E8; 04E9; Case map +04EA; 04EB; Case map +04EC; 04ED; Case map +04EE; 04EF; Case map +04F0; 04F1; Case map +04F2; 04F3; Case map +04F4; 04F5; Case map +04F8; 04F9; Case map +0531; 0561; Case map +0532; 0562; Case map +0533; 0563; Case map +0534; 0564; Case map +0535; 0565; Case map +0536; 0566; Case map +0537; 0567; Case map +0538; 0568; Case map +0539; 0569; Case map +053A; 056A; Case map +053B; 056B; Case map +053C; 056C; Case map +053D; 056D; Case map +053E; 056E; Case map +053F; 056F; Case map +0540; 0570; Case map +0541; 0571; Case map +0542; 0572; Case map +0543; 0573; Case map +0544; 0574; Case map +0545; 0575; Case map +0546; 0576; Case map +0547; 0577; Case map +0548; 0578; Case map +0549; 0579; Case map +054A; 057A; Case map +054B; 057B; Case map +054C; 057C; Case map +054D; 057D; Case map +054E; 057E; Case map +054F; 057F; Case map +0550; 0580; Case map +0551; 0581; Case map +0552; 0582; Case map +0553; 0583; Case map +0554; 0584; Case map +0555; 0585; Case map +0556; 0586; Case map +0587; 0565 0582; Case map +1806; ; Map out +180B; ; Map out +180C; ; Map out +180D; ; Map out +1E00; 1E01; Case map +1E02; 1E03; Case map +1E04; 1E05; Case map +1E06; 1E07; Case map +1E08; 1E09; Case map +1E0A; 1E0B; Case map +1E0C; 1E0D; Case map +1E0E; 1E0F; Case map +1E10; 1E11; Case map +1E12; 1E13; Case map +1E14; 1E15; Case map +1E16; 1E17; Case map +1E18; 1E19; Case map +1E1A; 1E1B; Case map +1E1C; 1E1D; Case map +1E1E; 1E1F; Case map +1E20; 1E21; Case map +1E22; 1E23; Case map +1E24; 1E25; Case map +1E26; 1E27; Case map +1E28; 1E29; Case map +1E2A; 1E2B; Case map +1E2C; 1E2D; Case map +1E2E; 1E2F; Case map +1E30; 1E31; Case map +1E32; 1E33; Case map +1E34; 1E35; Case map +1E36; 1E37; Case map +1E38; 1E39; Case map +1E3A; 1E3B; Case map +1E3C; 1E3D; Case map +1E3E; 1E3F; Case map +1E40; 1E41; Case map +1E42; 1E43; Case map +1E44; 1E45; Case map +1E46; 1E47; Case map +1E48; 1E49; Case map +1E4A; 1E4B; Case map +1E4C; 1E4D; Case map +1E4E; 1E4F; Case map +1E50; 1E51; Case map +1E52; 1E53; Case map +1E54; 1E55; Case map +1E56; 1E57; Case map +1E58; 1E59; Case map +1E5A; 1E5B; Case map +1E5C; 1E5D; Case map +1E5E; 1E5F; Case map +1E60; 1E61; Case map +1E62; 1E63; Case map +1E64; 1E65; Case map +1E66; 1E67; Case map +1E68; 1E69; Case map +1E6A; 1E6B; Case map +1E6C; 1E6D; Case map +1E6E; 1E6F; Case map +1E70; 1E71; Case map +1E72; 1E73; Case map +1E74; 1E75; Case map +1E76; 1E77; Case map +1E78; 1E79; Case map +1E7A; 1E7B; Case map +1E7C; 1E7D; Case map +1E7E; 1E7F; Case map +1E80; 1E81; Case map +1E82; 1E83; Case map +1E84; 1E85; Case map +1E86; 1E87; Case map +1E88; 1E89; Case map +1E8A; 1E8B; Case map +1E8C; 1E8D; Case map +1E8E; 1E8F; Case map +1E90; 1E91; Case map +1E92; 1E93; Case map +1E94; 1E95; Case map +1E96; 0068 0331; Case map +1E97; 0074 0308; Case map +1E98; 0077 030A; Case map +1E99; 0079 030A; Case map +1E9A; 0061 02BE; Case map +1E9B; 1E61; Case map +1EA0; 1EA1; Case map +1EA2; 1EA3; Case map +1EA4; 1EA5; Case map +1EA6; 1EA7; Case map +1EA8; 1EA9; Case map +1EAA; 1EAB; Case map +1EAC; 1EAD; Case map +1EAE; 1EAF; Case map +1EB0; 1EB1; Case map +1EB2; 1EB3; Case map +1EB4; 1EB5; Case map +1EB6; 1EB7; Case map +1EB8; 1EB9; Case map +1EBA; 1EBB; Case map +1EBC; 1EBD; Case map +1EBE; 1EBF; Case map +1EC0; 1EC1; Case map +1EC2; 1EC3; Case map +1EC4; 1EC5; Case map +1EC6; 1EC7; Case map +1EC8; 1EC9; Case map +1ECA; 1ECB; Case map +1ECC; 1ECD; Case map +1ECE; 1ECF; Case map +1ED0; 1ED1; Case map +1ED2; 1ED3; Case map +1ED4; 1ED5; Case map +1ED6; 1ED7; Case map +1ED8; 1ED9; Case map +1EDA; 1EDB; Case map +1EDC; 1EDD; Case map +1EDE; 1EDF; Case map +1EE0; 1EE1; Case map +1EE2; 1EE3; Case map +1EE4; 1EE5; Case map +1EE6; 1EE7; Case map +1EE8; 1EE9; Case map +1EEA; 1EEB; Case map +1EEC; 1EED; Case map +1EEE; 1EEF; Case map +1EF0; 1EF1; Case map +1EF2; 1EF3; Case map +1EF4; 1EF5; Case map +1EF6; 1EF7; Case map +1EF8; 1EF9; Case map +1F08; 1F00; Case map +1F09; 1F01; Case map +1F0A; 1F02; Case map +1F0B; 1F03; Case map +1F0C; 1F04; Case map +1F0D; 1F05; Case map +1F0E; 1F06; Case map +1F0F; 1F07; Case map +1F18; 1F10; Case map +1F19; 1F11; Case map +1F1A; 1F12; Case map +1F1B; 1F13; Case map +1F1C; 1F14; Case map +1F1D; 1F15; Case map +1F28; 1F20; Case map +1F29; 1F21; Case map +1F2A; 1F22; Case map +1F2B; 1F23; Case map +1F2C; 1F24; Case map +1F2D; 1F25; Case map +1F2E; 1F26; Case map +1F2F; 1F27; Case map +1F38; 1F30; Case map +1F39; 1F31; Case map +1F3A; 1F32; Case map +1F3B; 1F33; Case map +1F3C; 1F34; Case map +1F3D; 1F35; Case map +1F3E; 1F36; Case map +1F3F; 1F37; Case map +1F48; 1F40; Case map +1F49; 1F41; Case map +1F4A; 1F42; Case map +1F4B; 1F43; Case map +1F4C; 1F44; Case map +1F4D; 1F45; Case map +1F50; 03C5 0313; Case map +1F52; 03C5 0313 0300; Case map +1F54; 03C5 0313 0301; Case map +1F56; 03C5 0313 0342; Case map +1F59; 1F51; Case map +1F5B; 1F53; Case map +1F5D; 1F55; Case map +1F5F; 1F57; Case map +1F68; 1F60; Case map +1F69; 1F61; Case map +1F6A; 1F62; Case map +1F6B; 1F63; Case map +1F6C; 1F64; Case map +1F6D; 1F65; Case map +1F6E; 1F66; Case map +1F6F; 1F67; Case map +1F80; 1F00 03B9; Case map +1F81; 1F01 03B9; Case map +1F82; 1F02 03B9; Case map +1F83; 1F03 03B9; Case map +1F84; 1F04 03B9; Case map +1F85; 1F05 03B9; Case map +1F86; 1F06 03B9; Case map +1F87; 1F07 03B9; Case map +1F88; 1F00 03B9; Case map +1F89; 1F01 03B9; Case map +1F8A; 1F02 03B9; Case map +1F8B; 1F03 03B9; Case map +1F8C; 1F04 03B9; Case map +1F8D; 1F05 03B9; Case map +1F8E; 1F06 03B9; Case map +1F8F; 1F07 03B9; Case map +1F90; 1F20 03B9; Case map +1F91; 1F21 03B9; Case map +1F92; 1F22 03B9; Case map +1F93; 1F23 03B9; Case map +1F94; 1F24 03B9; Case map +1F95; 1F25 03B9; Case map +1F96; 1F26 03B9; Case map +1F97; 1F27 03B9; Case map +1F98; 1F20 03B9; Case map +1F99; 1F21 03B9; Case map +1F9A; 1F22 03B9; Case map +1F9B; 1F23 03B9; Case map +1F9C; 1F24 03B9; Case map +1F9D; 1F25 03B9; Case map +1F9E; 1F26 03B9; Case map +1F9F; 1F27 03B9; Case map +1FA0; 1F60 03B9; Case map +1FA1; 1F61 03B9; Case map +1FA2; 1F62 03B9; Case map +1FA3; 1F63 03B9; Case map +1FA4; 1F64 03B9; Case map +1FA5; 1F65 03B9; Case map +1FA6; 1F66 03B9; Case map +1FA7; 1F67 03B9; Case map +1FA8; 1F60 03B9; Case map +1FA9; 1F61 03B9; Case map +1FAA; 1F62 03B9; Case map +1FAB; 1F63 03B9; Case map +1FAC; 1F64 03B9; Case map +1FAD; 1F65 03B9; Case map +1FAE; 1F66 03B9; Case map +1FAF; 1F67 03B9; Case map +1FB2; 1F70 03B9; Case map +1FB3; 03B1 03B9; Case map +1FB4; 03AC 03B9; Case map +1FB6; 03B1 0342; Case map +1FB7; 03B1 0342 03B9; Case map +1FB8; 1FB0; Case map +1FB9; 1FB1; Case map +1FBA; 1F70; Case map +1FBB; 1F71; Case map +1FBC; 03B1 03B9; Case map +1FBE; 03B9; Case map +1FC2; 1F74 03B9; Case map +1FC3; 03B7 03B9; Case map +1FC4; 03AE 03B9; Case map +1FC6; 03B7 0342; Case map +1FC7; 03B7 0342 03B9; Case map +1FC8; 1F72; Case map +1FC9; 1F73; Case map +1FCA; 1F74; Case map +1FCB; 1F75; Case map +1FCC; 03B7 03B9; Case map +1FD2; 03B9 0308 0300; Case map +1FD3; 03B9 0308 0301; Case map +1FD6; 03B9 0342; Case map +1FD7; 03B9 0308 0342; Case map +1FD8; 1FD0; Case map +1FD9; 1FD1; Case map +1FDA; 1F76; Case map +1FDB; 1F77; Case map +1FE2; 03C5 0308 0300; Case map +1FE3; 03C5 0308 0301; Case map +1FE4; 03C1 0313; Case map +1FE6; 03C5 0342; Case map +1FE7; 03C5 0308 0342; Case map +1FE8; 1FE0; Case map +1FE9; 1FE1; Case map +1FEA; 1F7A; Case map +1FEB; 1F7B; Case map +1FEC; 1FE5; Case map +1FF2; 1F7C 03B9; Case map +1FF3; 03C9 03B9; Case map +1FF4; 03CE 03B9; Case map +1FF6; 03C9 0342; Case map +1FF7; 03C9 0342 03B9; Case map +1FF8; 1F78; Case map +1FF9; 1F79; Case map +1FFA; 1F7C; Case map +1FFB; 1F7D; Case map +1FFC; 03C9 03B9; Case map +200B; ; Map out +200C; ; Map out +200D; ; Map out +20A8; 0072 0073; Additional folding +2102; 0063; Additional folding +2103; 00B0 0063; Additional folding +2107; 025B; Additional folding +2109; 00B0 0066; Additional folding +210B; 0068; Additional folding +210C; 0068; Additional folding +210D; 0068; Additional folding +2110; 0069; Additional folding +2111; 0069; Additional folding +2112; 006C; Additional folding +2115; 006E; Additional folding +2116; 006E 006F; Additional folding +2119; 0070; Additional folding +211A; 0071; Additional folding +211B; 0072; Additional folding +211C; 0072; Additional folding +211D; 0072; Additional folding +2120; 0073 006D; Additional folding +2121; 0074 0065 006C; Additional folding +2122; 0074 006D; Additional folding +2124; 007A; Additional folding +2126; 03C9; Case map +2128; 007A; Additional folding +212A; 006B; Case map +212B; 00E5; Case map +212C; 0062; Additional folding +212D; 0063; Additional folding +2130; 0065; Additional folding +2131; 0066; Additional folding +2133; 006D; Additional folding +2160; 2170; Case map +2161; 2171; Case map +2162; 2172; Case map +2163; 2173; Case map +2164; 2174; Case map +2165; 2175; Case map +2166; 2176; Case map +2167; 2177; Case map +2168; 2178; Case map +2169; 2179; Case map +216A; 217A; Case map +216B; 217B; Case map +216C; 217C; Case map +216D; 217D; Case map +216E; 217E; Case map +216F; 217F; Case map +24B6; 24D0; Case map +24B7; 24D1; Case map +24B8; 24D2; Case map +24B9; 24D3; Case map +24BA; 24D4; Case map +24BB; 24D5; Case map +24BC; 24D6; Case map +24BD; 24D7; Case map +24BE; 24D8; Case map +24BF; 24D9; Case map +24C0; 24DA; Case map +24C1; 24DB; Case map +24C2; 24DC; Case map +24C3; 24DD; Case map +24C4; 24DE; Case map +24C5; 24DF; Case map +24C6; 24E0; Case map +24C7; 24E1; Case map +24C8; 24E2; Case map +24C9; 24E3; Case map +24CA; 24E4; Case map +24CB; 24E5; Case map +24CC; 24E6; Case map +24CD; 24E7; Case map +24CE; 24E8; Case map +24CF; 24E9; Case map +3371; 0068 0070 0061; Additional folding +3373; 0061 0075; Additional folding +3375; 006F 0076; Additional folding +3380; 0070 0061; Additional folding +3381; 006E 0061; Additional folding +3382; 03BC 0061; Additional folding +3383; 006D 0061; Additional folding +3384; 006B 0061; Additional folding +3385; 006B 0062; Additional folding +3386; 006D 0062; Additional folding +3387; 0067 0062; Additional folding +338A; 0070 0066; Additional folding +338B; 006E 0066; Additional folding +338C; 03BC 0066; Additional folding +3390; 0068 007A; Additional folding +3391; 006B 0068 007A; Additional folding +3392; 006D 0068 007A; Additional folding +3393; 0067 0068 007A; Additional folding +3394; 0074 0068 007A; Additional folding +33A9; 0070 0061; Additional folding +33AA; 006B 0070 0061; Additional folding +33AB; 006D 0070 0061; Additional folding +33AC; 0067 0070 0061; Additional folding +33B4; 0070 0076; Additional folding +33B5; 006E 0076; Additional folding +33B6; 03BC 0076; Additional folding +33B7; 006D 0076; Additional folding +33B8; 006B 0076; Additional folding +33B9; 006D 0076; Additional folding +33BA; 0070 0077; Additional folding +33BB; 006E 0077; Additional folding +33BC; 03BC 0077; Additional folding +33BD; 006D 0077; Additional folding +33BE; 006B 0077; Additional folding +33BF; 006D 0077; Additional folding +33C0; 006B 03C9; Additional folding +33C1; 006D 03C9; Additional folding +33C3; 0062 0071; Additional folding +33C6; 0063 2215 006B 0067; Additional folding +33C7; 0063 006F 002E; Additional folding +33C8; 0064 0062; Additional folding +33C9; 0067 0079; Additional folding +33CB; 0068 0070; Additional folding +33CD; 006B 006B; Additional folding +33CE; 006B 006D; Additional folding +33D7; 0070 0068; Additional folding +33D9; 0070 0070 006D; Additional folding +33DA; 0070 0072; Additional folding +33DC; 0073 0076; Additional folding +33DD; 0077 0062; Additional folding +FB00; 0066 0066; Case map +FB01; 0066 0069; Case map +FB02; 0066 006C; Case map +FB03; 0066 0066 0069; Case map +FB04; 0066 0066 006C; Case map +FB05; 0073 0074; Case map +FB06; 0073 0074; Case map +FB13; 0574 0576; Case map +FB14; 0574 0565; Case map +FB15; 0574 056B; Case map +FB16; 057E 0576; Case map +FB17; 0574 056D; Case map +FEFF; ; Map out +FF21; FF41; Case map +FF22; FF42; Case map +FF23; FF43; Case map +FF24; FF44; Case map +FF25; FF45; Case map +FF26; FF46; Case map +FF27; FF47; Case map +FF28; FF48; Case map +FF29; FF49; Case map +FF2A; FF4A; Case map +FF2B; FF4B; Case map +FF2C; FF4C; Case map +FF2D; FF4D; Case map +FF2E; FF4E; Case map +FF2F; FF4F; Case map +FF30; FF50; Case map +FF31; FF51; Case map +FF32; FF52; Case map +FF33; FF53; Case map +FF34; FF54; Case map +FF35; FF55; Case map +FF36; FF56; Case map +FF37; FF57; Case map +FF38; FF58; Case map +FF39; FF59; Case map +FF3A; FF5A; Case map + + +F. Prohibited Code Point List + +0000-002C +002E-002F +003A-0040 +005B-0060 +007B-007F +0080-009F +00A0 +1680 +2000 +2001 +2002 +2003 +2004 +2005 +2006 +2007 +2008 +2009 +200A +200B +200E +200F +2028 +2029 +202A +202B +202C +202D +202E +202F +206A +206B +206C +206D +206E +206F +2FF0-2FFF +3000 +D800-DFFF +E000-F8FF +FFF9 +FFFA +FFFB +FFFC +FFFD +FFFE-FFFF +1FFFE-1FFFF +2FFFE-2FFFF +3FFFE-3FFFF +4FFFE-4FFFF +5FFFE-5FFFF +6FFFE-6FFFF +7FFFE-7FFFF +8FFFE-8FFFF +9FFFE-9FFFF +AFFFE-AFFFF +BFFFE-BFFFF +CFFFE-CFFFF +DFFFE-DFFFF +EFFFE-EFFFF +F0000-FFFFD +FFFFE-FFFFF +100000-10FFFD +10FFFE-10FFFF + +NOTE WELL: Software that follows this specification that will be used to +check names before they are put in authoritative name servers MUST add +all unassigned code pints to the list of characters that are prohibited. +See Section 6 for more details. + + +G. Unassigned Code Point List + +0220-0221 +0234-024F +02AE-02AF +02EF-02FF +034F-035F +0363-0373 +0376-0379 +037B-037D +037F-0383 +038B +038D +03A2 +03CF +03D8-03D9 +03F4-03FF +0487 +048A-048B +04C5-04C6 +04C9-04CA +04CD-04CF +04F6-04F7 +04FA-0530 +0557-0558 +0560 +0588 +058B-0590 +05A2 +05BA +05C5-05CF +05EB-05EF +05F5-060B +060D-061A +061C-061E +0620 +063B-063F +0656-065F +066E-066F +06EE-06EF +06FF +070E +072D-072F +074B-077F +07B1-0900 +0904 +093A-093B +094E-094F +0955-0957 +0971-0980 +0984 +098D-098E +0991-0992 +09A9 +09B1 +09B3-09B5 +09BA-09BB +09BD +09C5-09C6 +09C9-09CA +09CE-09D6 +09D8-09DB +09DE +09E4-09E5 +09FB-0A01 +0A03-0A04 +0A0B-0A0E +0A11-0A12 +0A29 +0A31 +0A34 +0A37 +0A3A-0A3B +0A3D +0A43-0A46 +0A49-0A4A +0A4E-0A58 +0A5D +0A5F-0A65 +0A75-0A80 +0A84 +0A8C +0A8E +0A92 +0AA9 +0AB1 +0AB4 +0ABA-0ABB +0AC6 +0ACA +0ACE-0ACF +0AD1-0ADF +0AE1-0AE5 +0AF0-0B00 +0B04 +0B0D-0B0E +0B11-0B12 +0B29 +0B31 +0B34-0B35 +0B3A-0B3B +0B44-0B46 +0B49-0B4A +0B4E-0B55 +0B58-0B5B +0B5E +0B62-0B65 +0B71-0B81 +0B84 +0B8B-0B8D +0B91 +0B96-0B98 +0B9B +0B9D +0BA0-0BA2 +0BA5-0BA7 +0BAB-0BAD +0BB6 +0BBA-0BBD +0BC3-0BC5 +0BC9 +0BCE-0BD6 +0BD8-0BE6 +0BF3-0C00 +0C04 +0C0D +0C11 +0C29 +0C34 +0C3A-0C3D +0C45 +0C49 +0C4E-0C54 +0C57-0C5F +0C62-0C65 +0C70-0C81 +0C84 +0C8D +0C91 +0CA9 +0CB4 +0CBA-0CBD +0CC5 +0CC9 +0CCE-0CD4 +0CD7-0CDD +0CDF +0CE2-0CE5 +0CF0-0D01 +0D04 +0D0D +0D11 +0D29 +0D3A-0D3D +0D44-0D45 +0D49 +0D4E-0D56 +0D58-0D5F +0D62-0D65 +0D70-0D81 +0D84 +0D97-0D99 +0DB2 +0DBC +0DBE-0DBF +0DC7-0DC9 +0DCB-0DCE +0DD5 +0DD7 +0DE0-0DF1 +0DF5-0E00 +0E3B-0E3E +0E5C-0E80 +0E83 +0E85-0E86 +0E89 +0E8B-0E8C +0E8E-0E93 +0E98 +0EA0 +0EA4 +0EA6 +0EA8-0EA9 +0EAC +0EBA +0EBE-0EBF +0EC5 +0EC7 +0ECE-0ECF +0EDA-0EDB +0EDE-0EFF +0F48 +0F6B-0F70 +0F8C-0F8F +0F98 +0FBD +0FCD-0FCE +0FD0-0FFF +1022 +1028 +102B +1033-1035 +103A-103F +105A-109F +10C6-10CF +10F7-10FA +10FC-10FF +115A-115E +11A3-11A7 +11FA-11FF +1207 +1247 +1249 +124E-124F +1257 +1259 +125E-125F +1287 +1289 +128E-128F +12AF +12B1 +12B6-12B7 +12BF +12C1 +12C6-12C7 +12CF +12D7 +12EF +130F +1311 +1316-1317 +131F +1347 +135B-1360 +137D-139F +13F5-1400 +1677-167F +169D-169F +16F1-177F +17DD-17DF +17EA-17FF +180F +181A-181F +1878-187F +18AA-1DFF +1E9C-1E9F +1EFA-1EFF +1F16-1F17 +1F1E-1F1F +1F46-1F47 +1F4E-1F4F +1F58 +1F5A +1F5C +1F5E +1F7E-1F7F +1FB5 +1FC5 +1FD4-1FD5 +1FDC +1FF0-1FF1 +1FF5 +1FFF +2047 +204E-2069 +2071-2073 +208F-209F +20B0-20CF +20E4-20FF +213B-2152 +2184-218F +21F4-21FF +22F2-22FF +237C +239B-23FF +2427-243F +244B-245F +24EB-24FF +2596-259F +25F8-25FF +2614-2618 +2672-2700 +2705 +270A-270B +2728 +274C +274E +2753-2755 +2757 +275F-2760 +2768-2775 +2795-2797 +27B0 +27BF-27FF +2900-2E7F +2E9A +2EF4-2EFF +2FD6-2FEF +2FFC-2FFF +303B-303D +3040 +3095-3098 +309F-30A0 +30FF-3104 +312D-3130 +318F +31B8-31FF +321D-321F +3244-325F +327C-327E +32B1-32BF +32CC-32CF +32FF +3377-337A +33DE-33DF +33FF +4DB6-4DFF +9FA6-9FFF +A48D-A48F +A4A2-A4A3 +A4B4 +A4C1 +A4C5 +A4C7-ABFF +D7A4-D7FF +FA2E-FAFF +FB07-FB12 +FB18-FB1C +FB37 +FB3D +FB3F +FB42 +FB45 +FBB2-FBD2 +FD40-FD4F +FD90-FD91 +FDC8-FDCF +FDFC-FE1F +FE24-FE2F +FE45-FE48 +FE53 +FE67 +FE6C-FE6F +FE73 +FE75 +FEFD-FEFE +FF00 +FF5F-FF60 +FFBF-FFC1 +FFC8-FFC9 +FFD0-FFD1 +FFD8-FFD9 +FFDD-FFDF +FFE7 +FFEF-FFF8 +10000-1FFFD +20000-2FFFD +30000-3FFFD +40000-4FFFD +50000-5FFFD +60000-6FFFD +70000-7FFFD +80000-8FFFD +90000-9FFFD +A0000-AFFFD +B0000-BFFFD +C0000-CFFFD +D0000-DFFFD +E0000-EFFFD \ No newline at end of file diff --git a/doc/draft/draft-ietf-idn-uri-00.txt b/doc/draft/draft-ietf-idn-uri-00.txt new file mode 100644 index 0000000000..ab22a632c8 --- /dev/null +++ b/doc/draft/draft-ietf-idn-uri-00.txt @@ -0,0 +1,269 @@ +INTERNET-DRAFT Martin Duerst +draft-ietf-idn-uri-00 W3C/Keio University +Expires July 2001 January 6, 2001 + + + Internationalized Domain Names in URIs and IRIs + +Status of this Memo + +This document is an Internet-Draft and is in full conformance with all +provisions of Section 10 of RFC2026. + +Internet-Drafts are working documents of the Internet Engineering Task +Force (IETF), its areas, and its working groups. Note that other +groups may also distribute working documents as Internet-Drafts. + +Internet-Drafts are draft documents valid for a maximum of six months +and may be updated, replaced, or obsoleted by other documents at any +time. It is inappropriate to use Internet- Drafts as reference +material or to cite them other than as "work in progress." + +The list of current Internet-Drafts can be accessed at +http://www.ietf.org/ietf/1id-abstracts.txt. + +The list of Internet-Draft Shadow Directories can be accessed at +http://www.ietf.org/shadow.html. + + +Abstract + +This document is a first draft for the provisions necessary to +upgrade the definitions of URIs [RFC 2396] and IRIs (Internationalized +Resource Identifiers, [IRI]) to work with internationalized domain +names. + + +1. Introduction + +Internet domain names serve to identify hosts and services on the +Internet in a convenient way. The IETF IDN working group is currently +working on extending the character repertoire usable in domain names +beyond a subset of US-ASCII. + +One of the most important places where domain names appear are +Uniform Resource Identifiers (URIs, [RFC 2396], as modified by +[RFC2732]). However, in the current definition of the generic URI +syntax, the restrictions on domain names are 'hard-coded'. This +document proposes to relax these restrictions by updating the syntax, +and defines how internationalized domain names are encoded in URIs. + +URIs themselves are restricted to a subset of US-ASCII. However, +there is a proposal for relieving these restrictions by creating +a new protocol element called an IRI (Internationalized Resource +Identifier [IRI]). While IRIs in general allow the use of non-ASCII +characters, the syntax of IRIs has the same restriction for domain +names as the syntaxt of URIs. This document proposes to relax these +restrictions, too, in a way that is compatible with the new syntax +for URIs. This means that encoding an internationalized domain name in +an URI and encoding the same name in an IRI will produce an URI and an +IRI that can be converted into each other using the procedures defined +in [IRI] for these conversions. + +2. URI syntax changes + +The syntax of URIs [RFC2326] currently contains the following rules +relevant to domain names: + + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + +The later two rules are changed as follows: + + domainlabel = escalphanum | escalphanum *( escalphanum | "-" ) + escalphanum + toplabel = escalpha | escalpha *( escalphanum | "-" ) + escalphanum + +and the following rules are added: + + escalphanum = escaped8 | alphanum + escalpha = elcaped8 | alpha + escaped8 = "%" hexdig8 HEXDIG + hexdig8 = <> + +The %HH escaping is used to encode characters outside the repertoire +of US-ASCII. This is done by first encoding the characters in UTF-8 +[RFC 2279], resulting in a sequence of octets, and then escaping these +octets. + +Using UTF-8 assures that this encoding interoperates with IRIs (see +Section 3). It is also alligned with the recommendations in [RFC 2277] +and [RFC 2718], and is consistent with the URN syntax [RFC2141] as +well as recent URL scheme definitions that define encodings of +non-ASCII characters based on (e.g., IMAP URLs [RFC 2192] and POP URLs +[RFC 2384]). + +Please note that the use of UTF-8 for encoding internationalized +domain names in URIs is independent of the choice of encoding chosen +for these names in the DNS protocol. In case something else than UTF-8 +is chosen for the later, a future version of this document may give +instructions for the conversion if deemed necessary. + +The above syntax rules do not extend the possible domain names based +on US-ASCII characters. This may have to be changed in case the IDN +WG should decide to allow such extensions. + +The above rules also do not allow escaping of US-ASCII characters, +although this is allowed in the other parts of an URI (except for the +special provisions in case of reserved characters). Allowing such +escaping would make the syntax rules quite a bit more complicated, +would mean that the restrictions on US-ASCII characters can be +circumvented by using escaping, or would lead to much simpler syntax +rules that don't express these restrictions anymore. Even in case +escaping of US-ASCII characters is allowed in order to simplify +processing, it should be noted that it is always better not to escape +US-ASCII characters in domain names because of the possibility that +a resolver cannot unescape them. At least purely US-ASCII domain names +would then always be resolved by such a processor. + +While only the restrictions on US-ASCII characters are expressed in the +rules above, all the other restrictions on internationalized +domain names that will be defined by the IDN WG MUST be respected. + +The work of the IDN WG currently includes some procedures for name +preparation. Before encoding an internationalized domain name in an +URI, this preparation step SHOULD be applied. However, the resolver +MUST also apply name preparation. + + +2. IRI syntax changes + +The syntax of IRIs [IRI] currently contains the following rules +relevant to domain names: + + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + +The later two rules are changed as follows: + + domainlabel = intalphanum | intalphanum *( intalphanum | "-" ) + intalphanum + toplabel = intalpha | intalpha *( intalphanum | "-" ) + intalphanum + +and the following rules are added: + + intalphanum = ichar | alphanum | escaped8 + intalpha = ichar | alpha | escaped8 + escaped8 = "%" hexdig8 HEXDIG + hexdig8 = <> + +where ichar, as in [IRI], is: + + ichar = << any character of UCS [ISO10646] beyond + U+0080, subject to limitations in Section + 3.1. of [IRI] >> + +With respect to the allowed domain names based on US-ASCII characters, +the same considerations as in Section 2 apply. + +As in Section 2, all the other restrictions on internationalized +domain names that will be defined by the IDN WG MUST be respected. +Also, before encoding an internationalized domain name in an IRI, +name preparation SHOULD be applied. However, the IRI resolver MUST +also apply name preparation. + +It is expected that the rules in Section 3.1 of [IRI] will be less +restrictive than the rules for internationalized domain names, so that +no escaping is necessary. Nevertheless, escaping is allowed for cases +where not all characters can be directly represented. + + +4. Security Considerations + +Besides the security considerations of [RFC 2396] and [IRI] and those +applying to the various aspects of internationalized domain names in +general, there are currently no known security problems. + + +Acknowledgements + +To be done. + + +Copyright + +Copyright (C) The Internet Society, 1997. All Rights Reserved. + +This document and translations of it may be copied and furnished to +others, and derivative works that comment on or otherwise explain it +or assist in its implementation may be prepared, copied, published +and distributed, in whole or in part, without restriction of any +kind, provided that the above copyright notice and this paragraph +are included on all such copies and derivative works. However, this +document itself may not be modified in any way, such as by removing +the copyright notice or references to the Internet Society or other +Internet organizations, except as needed for the purpose of +developing Internet standards in which case the procedures for +copyrights defined in the Internet Standards process must be +followed, or as required to translate it into languages other +than English. + +The limited permissions granted above are perpetual and will not be +revoked by the Internet Society or its successors or assigns. + +This document and the information contained herein is provided on an +"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING +TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING +BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION +HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF +MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." + + +Author's address + + Martin J. Duerst + W3C/Keio University + 5322 Endo, Fujisawa + 252-8520 Japan + duerst@w3.org + http://www.w3.org/People/D%C3%BCrst/ + Tel/Fax: +81 466 49 1170 + + Note: Please write "Duerst" with u-umlaut wherever + possible, e.g. as "Dürst" in XML and HTML. + + +References + +[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers + (IRI)", Internet Draft, January 2001, + , + work in progress. + +[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet + Coded Character Set (UCS) - Part 1: Architecture and Basic + Multilingual Plane, Oct. 2000, with amendments. + +[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate + Requirement Levels", March 1997. + +[RFC 2141] R. Moats, "URN Syntax", May 1997. + +[RFC 2192] C. Newman, "IMAP URL Scheme", September 1997. + +[RFC 2277] H. Alvestrad, "IETF Policy on Character Sets and + Languages". + +[RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.", + January 1998. + +[RFC 2384] R. Gellens, "POP URL Scheme", August 1998. + +[RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource + Identifiers (URI): Generic Syntax." August, 1998. + +[RFC 2640] B. Curtis, "Internationalization of the File Transfer + Protocol", July 1999. + +[RFC 2718] L. Masinter, H. Alvestrand, D. Zigmond, R. Petke, + "Guidelines for new URL Schemes", November 1999. + +[RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal + IPv6 Addresses in URL's", December 1999. + + + diff --git a/doc/draft/draft-macgowan-dnsext-label-intel-manage-00.txt b/doc/draft/draft-macgowan-dnsext-label-intel-manage-00.txt index 52ff3b5d07..ed16cac511 100644 --- a/doc/draft/draft-macgowan-dnsext-label-intel-manage-00.txt +++ b/doc/draft/draft-macgowan-dnsext-label-intel-manage-00.txt @@ -1,1858 +1,802 @@ -INTERNET-DRAFT DNS Label Intelligence and Management System -UPDATES RFC 1034 February 2001 - Expires August 2001 - - - - -Domain Name System (DNS) DNS Label Intelligence and Management System - - draft-macgowan-dnsext-label-intel-manage-00.txt - - - - -Michael L. Macgowan Jr. - - -Status of This Document - -This draft is intended to become a Proposed Standard RFC. Distribution -of this document is unlimited. Comments should be sent to the Domain -Name Server Extensions working group mailing -list or to the author. - -This document is an Internet-Draft and is in full conformance with all -provisions of Section 10 of RFC 2026. Internet-Drafts are working -documents of the Internet Engineering Task Force (IETF), its areas, and -its working groups. Note that other groups may also distribute working -documents as Internet-Drafts. - -Internet-Drafts are draft documents valid for a maximum of six months -and may be updated, replaced, or obsoleted by other documents at any -time. It is inappropriate to use Internet- Drafts as reference -material or to cite them other than as "work in progress." - -The list of current Internet-Drafts can be accessed at -http://www.ietf.org/ietf/1id-abstracts.txt - -The list of Internet-Draft Shadow Directories can be accessed at -http://www.ietf.org/shadow.html. - - - -Abstract - - -A multidimensional array of domain label analysis and extensions are -offered to overcome a number of issues with the DNS and its use to -locate resources on the Internet. These goals are accomplished by -proposing a naming convention to add labels to domain strings. The -result will be a rational relationship to the content that will provide -a method for meeting the ever-increasing need to expand the namespace, -while providing an efficient search system to access content in a user- -friendly manner. - -A fundamental problem exists in the design of DNS. A user must know the -domain name including the Top Level Domain, TLD, and type the Uniform -Resource Locator, URL, accurately to connect to resources on the -Internet. The current lookup organization of the DNS uses domain labels -separated by periods to provide hierarchical levels for a resolver to -seek in finding a path to an authority. A new masking technique within -labels is proposed to accommodate lookups based on the request. -Multiple processing trees are proposed to redistribute the requests -based on the known pieces of the domain name. Rather than knowing the -fully qualified domain name, FQDN, the user can search for content -based upon known pieces of the string like group (business), country, -area code, phone number, type of organization, street address, zip code -and/or GPS location, etc.. Intelligence is added for determining the -fastest route to resolution based on user weighting, number of -requests, and traffic within the system. - -A result of the masking technique is an opportunity to provide a -completely hidden label(s) for maintenance of the system. A TTL (Time -to Live), version, and type of request could be keyed into a label to -provide information, which remains with the client but is normally lost -after a request is processed. This system could be implemented to -create automatically updated records and content. Or hidden labels -could be used to distinguish between version 4 and version 6 requests -in the TCP/IP, Transmission Control Protocol/ Internet Protocol, -rollover. - -Implementation of the new name system is facilitated by the addition of -a client interface for building requests. Longer domain names are -enhanced by smart AutoCompletes and group edit boxes. - -Table of Contents - - Status of This Document 1 - Abstract 1 - - Table of Contents 3 - - 1. Introduction 4 - 2. Inputting Request for Resolution 4 - 3. Resolution Processing 7 - 4. Processing Forest 13 - 5. Extended Label Uses 14 - 6. IANA Considerations 16 - 6. Security Considerations 16 - - References 16 - - Authors Address 16 - Expiration and File Name 17 - - - - - - - - - - - - -1. Introduction - -The Domain Name System (DNS) [RFC 1034, 1035] is the global -hierarchical replicated distributed database system for Internet -addressing, mail proxy, and other information. The DNS has been -extended to phone numbers as described in [RFC 2535]. It is designed to -accommodate a user-friendly name as an abstraction level over an IP -address, which provides a path to the physical connection to resources -and/or content on the Internet. This abstraction allows for changing -the physical location of the content without an update by the client. -The design, however, lacks a user-friendly method for assigning TLDs -and determining which TLD a content provider will be registered under. - -According to COMPUTERGRAM INTERNATIONAL: January 08, 2001, over 100 -million hosts are connected to the Internet with over 350 million -users. ICANN has submitted plans to increase the number of TLDs to -accommodate the lack of namespace, but the problem of organization and -extensibility continues to exist. As the number of TLDs grows, it -becomes harder for a user to input a user friendly domain name. In -essence, the user must know what derivations and which TLDs were -available to a provider at the time the organization chose a domain -name. The method of one response, in an all or nothing request, forces -precision on the part of the user that is a distraction to the original -goal of a user-friendly name. Consider a user that wants to find a new -theoretical health related company called Healthy Foods. Will the -company be called Healthyfoods.com? Or will it have an extension like -healthfoods.net, healthfoods.org, or healthfoods.health? Maybe it will -be forced to be a derivation like healthf.com, healthf.net, etc. There -is no user-friendly method to determine what the associated domain name -might be. This is a central problem of focus and organization. The -number of iterations a user must try increases with each new TLD that -is added. If a user forms multiple guesses for the TLD, excess traffic -is generated and the search is slowed by the inefficient nature of -human typing. Further, if a system were proposed under the current root -structure to allow for a search of all possible TLDs, the number of -requests would grow exponentially with the addition of each new TLD. - -2. Inputting Request for Resolution - -The key to making a New Name System, NNS, is to provide a user -interface, which will accommodate a friendly method of building name -requests. AutoComplete and multiple-selection drop-down, group list -boxes (some editable, some not) will make more complicated names easier -to input. Consider the previous example of Healthy Foods. Additional -extensions could be added as labels to make the namespace exponentially -larger. The web content might be reached at -www.healthy.food.US2081234567.Fairview101. In this example, www is the -Device label or content desired by the user. Health is the domain or -Subgroup/Group name label. Food is the item under the Type label. -US2081234567 is the item country/area code/number for the Global label. -Sfairview101 is the street/address of the Local label. - -Derivations of this example provide a limitless expansion of the -namespace within the physical limits of the protocol. A competitor down -the block might have the same FQDN, except for the street number and -phone number e.g. www.healthy.food.US2088901234.SFairview990. A second -type of business could also be run from the same location by changing -the type e.g. www.healthy.entertainment.US2081234567.SFairview101. A -parody of the site might be offered at -www.healthy.parody.us2086669999.SState103. - -A method of using less descriptive labels could also be used to -generalize the content. For example, the site for the regional office -might use only the country and area code designation e.g. .US208. A -corporate address might be located at www.healthy.food.US.corporate. -This way the Global and Local labels are not tied to physical -locations. Or there may be an 800 or 888 number that could be used for -multiple sites that are tied to multiple registrations at different -street addresses in the Local label. - -The task of building these longer names with labels can be accomplished -by updating list items from the NNS and by designing a better -interface. Instead of waiting for ICANN to vote on the relative merits -of a proposal for a new TLD, items could be automatically updated and -added to the system by a list of requirements. This would force a -relationship between labels but provide a nonbiased method without -prejudice. For example, a .Bus(iness) item for the Type label would -require a copy of a business license to be granted by the governing -authorities for the area specified in the Global label or the address -specified in the Local label. A ôTMö item could separate the -Intellectual Property of Trademarks and Copyrights from other -registered listings issued from the government specified by the country -code in the Global label. Additionally, the Academy of Motion Pictures -might request an Oscar item, which would restrict membership to -nominees or recipients of the coveted award. - -Just as the resolver gets an updated list of root servers upon first -connection, the resolver could also receive an updated list of items in -the Type label and return them to the client. The list could be updated -by a TTL trigger and should not be editable from the userÆs standpoint. -The user interface should allow for multiple selections, which could be -used to form separate requests for resolution. Finally, the -implementation should begin with at least a list that is equal to a -subject list found in the yellow pages of the phone book. This will -provide a well-known classification that will greatly reduce -competition for names of organizations, which are similar but provide -for very different products/services. Delta.airline is readily -distinguished from Delta.homeimprovement. - -The device label would remain largely unaffected. A list of previously -connected items such as www, toasters, lock, refrigerator, etc. would -facilitate input. The list would be editable. As the number of devices -connected to the Internet grows, this method will be invaluable. -Consider mail and faxes being sent directly to -printer.mybusiness.computer.us2081234567.sfairview101. A user that -needs to send a fax to a satellite office might also be able to try -searching for mybusiness at its other street address or telephone -number eg., printer.mybusiness.computer.us714*.sPensylvaiaave2345. -Wildcards and searching are discussed in the next section. - -The items under the Groups/Subgroups labels would also be a list of -previously connected to domains (less the TLD) such as sales.business -or kitchen.home. The list would contain a history of previous -connections and be editable. - -The Global label would have two characters to represent the country -code followed optionally by all or part of a representative telephone -number or mask for identifying the voice number(s) associated as items -in the domain. An international code would require a rational -relationship with world organizations. The interface would contain the -country codes and/or area codes, but the numbers would have to be -added. - -The Local label would require a single character to represent the type -of information presented, followed by the information in a standardized -form. The following codes are proposed for the Local label, ôPö for -Postal code, ôGö for Global Satellite Positioning and ôSö for street -address. For example, P83706 would represent the authorÆs postal code, -GP0445004N1162498 (since the ô+ö key is not valid, ôpö and ônö -represent positive and negative) would represent the GPS position of -the author with padding to standardize degrees/min/sec or SOrchard15541 -would represent the Street address (house number at the end). Note each -of these would require a separate name registration. The editable list -box could be a fly out list box with one of the designators specified, -while the remainder would be user input. - -+------------------+ -|Street | -| Fairview101 | - State101 | -|Postal | -|GPS | -+------------------+ - -The added labels would exponentially expand the name space. This may -cause an undesired relation to a Global or Local designation. This -could hamper changes to an organization or business in the future. -Hence a business might want to use a CNAME entry to reference users to -a non-distinct item in a label. For instance, a corporation might want -to register mybusiness.bus.In(ternational).corporate so that the -corporate office could be used for email addresses and bookmarking. -Content might be located at each mybusiness.bus.country.location where -the company does business. This way a corporation does not have to be -penalized for moving to a new physical location. The goal of the DNS -was to remove a physical relationship to the network, but the need of -the users is for some content to have a physical relationship to the -content; which is why, in part, the NNS is proposed. The concept of an -update is also discussed in section 5. - -The system would need to distinguish between the need for a request for -a connection to single IP address versus multiple requests. In an -application like a browser, traditional requests for IP resolution are -all or none. Either an IP address is connected to or not. If wildcards -are added to the request, multiple entries could be returned as a ôhitö -list. An option on the browser could determine the number of requests -specified by the user. The ôhitsö should also be weighted. For -instance, if a user wanted to find all the movie theatres in the local -area he/she might submit a request for www.*.movies.US8370*.*. She/he -would be more inclined to desire additional theatres at different -nearby area codes than derivations of different domain names or Local -label derivations for a single theatre. A simple listing of each label -with an associated numerical value in an advanced option field could -determine how the responses are weighted against one another. The NNS -could also take into account the number of requests on the system and -further limit the number of responses based upon traffic. - -For registration, the content provider might want to register a more -global entry to be displayed on a restrictive search e.g. loans.US*.*. -A business content provider might want to register mybusiness.com.US* -so that requests for www.mybusiness.com.US208.* and -www.mybusiness.com.us714.* both resolve to mybusiness. A process would -have to be in place to copy an entry with wildcards to each of the -associated branches of the processing trees as discussed in section 4. -Similarly wildcard registrations should meet the rational requirements -required for the known item with the generalized scope. In the previous -example the provider would have to be licensed as a financial -institution in each of the states of the United States. - -3. Resolution Processing - -The key to expanding the DNS is to provide for a name space, which can -be accessed quickly and efficiently. Organization is key to this -process. The current DNS has one root organized by TLDs of the Type -label combined with Country TLDs. If a user does not know the extension -for the name, requests must be created for each one until a match is -found. The NNS creates separate roots for each label that can be used -for a search (see graphic on next page, description of TLDs is in -section 5). Instead of one tree, a forest is created, connected by a -common list of authorities for devices in the zones requested. Requests -can be organized by the known piece(s) of information. For instance, if -a user is trying to find Hewlett Packard and does not know that content -is provided at HP, a search of www.H*.*.US*.* should be returned -alphabetically from the Group label, not the Type label. However, if -the type item is known to be ôcomputerö, a search of the Type tree -would be fastest. If a user wants to find a local voice number for -Microsoft he/she could submit a request generalized request within the -local area code for www.Microsoft.software.US208*.*. The authority -would best be located by the Global processor, which might list -www.Microsoft.software.US5041234567.SState123 and -www.Microsoft.software.US5044567890.SredmondAve123. If the request for -www.Microsoft.software.US504*.* were sent to the Local processor, every -TLD would have to be queried. The result might be one phone number with -separate Local label listings for the street address, GPS, and postal -code. This would create unwanted traffic on the system. - - -Root ô.ö Group Root ô.öType - | | - | | - ôHö TLD TLD ôComputerö - | | - | | - --- Authority for..HP.computer.US2081234567.SChinden12---- - | | - | | - ôUS208ö TLD TLD ôSChiö - | | - | | -Root ô.ö Global Root ô.öLocal - - -In addition to determining which label(s) to process the request, the -system would also have to take into account the weighting by the user -and the traffic on the system as discussed in the previous section. -When the FQDN is specified, the resolver would query the processor with -the fastest expected response time. A FQDN can be resolved from any of -the search processor trees. In the example -oven.macgowan.private.US2081234567.SOrchard15541, it does not matter -whether the request is sent to the Group, Type, Global, or Local -processing tree. Each leads to the authority, -macgowan.private.us2081234567.SOrchard15541. - -If wildcards or null characters exist in the request, the system should -take into account the number of requests that might be generated. -Currently the DNS does not account for the ô?ö and reserves the ôö for -the root. The ô*ö could replace the singe character wildcard ô?ö and -the word ônullö could be used in lieu of ôö. The following table could -be used to determine which processing tree should be the most desirable -under such conditions: - -any = -any combination of characters displayed in -request -reject= -no preferred processor -*= -match any combination of characters for -response -?= -match any single character for response -null= -no character specified - - -Device -Sub -Group -Type -Global -Local -Result -* -* -* -* -* -* -reject -* -any -any -any -any -any -reject -* -* -any -any -any -any -reject -* -* -* -any -any -any -submit to type, global, or local -processor -* -* -* -* -any -any -submit to global, or local -processor -* -* -* -* -* -any -submit to local processor -any -* -* -* -* -* -reject -any -any -* -* -* -* -reject -any -any -any -* -* -* -submit to group processor -any -any -any -any -* -* -submit to group, or type -processor -any -any -any -any -any -* -submit to group, type, or global -processor -any -any -any -any -any -any -submit to any processor -any -* -any -any -any -any -submit to any processor -any -* -* -any -any -any -submit to type, global, or local -processor -any -* -* -* -any -any -submit to any global, or local -processor -any -* -* -* -* -any -submit to any local processor -any -any -* -any -any -any -submit to any type, global, or -local processor -any -any -* -* -any -any -submit to any global, or local -processor -any -any -* -* -* -any -submit to any local processor -any -any -any -* -any -any -submit to any group, global, or -local processor -any -any -any -* -* -any -submit to any group, or local -processor -any -any -any -any -* -any -submit to any group, type, or -local processor -any -any -any -any -* -* -submit to any group, or type -processor - - - - - - - -* -* -* -* -* -* -reject -* -any*any -any*any -any*any -any*any -any*any -reject -* -* -any*any -any*any -any*any -any*any -reject -* -* -* -any*any -any*any -any*any -submit to type, global, or local -processor -* -* -* -* -any*any -any*any -submit to global, or local -processor -* -* -* -* -* -any*any -submit to local processor -any*any -* -* -* -* -* -reject -any*any -any*any -* -* -* -* -reject -any*any -any*any -any*any -* -* -* -submit to group processor -any*any -any*any -any*any -any*any -* -* -submit to group, or type -processor -any*any -any*any -any*any -any*any -any*any -* -submit to group, type, or global -processor -any*any -any*any -any*any -any*any -any*any -any*any -reject -any*any -* -any*any -any*any -any*any -any*any -reject -any*any -* -* -any*any -any*any -any*any -submit to type, global, or local -processor -any*any -* -* -* -any*any -any*any -submit to any global, or local -processor -any*any -* -* -* -* -any*any -submit to any local processor -any*any -any*any -* -any*any -any*any -any*any -reject -any*any -any*any -* -* -any*any -any*any -submit to any global, or local -processor -any*any -any*any -* -* -* -any*any -submit to any local processor -any*any -any*any -any*any -* -any*any -any*any -reject -any*any -any*any -any*any -* -* -any*any -submit to any group, or local -processor -any*any -any*any -any*any -any*any -* -any*any -submit to any group, type, or -local processor -any*any -any*any -any*any -any*any -* -* -submit to any group, or type -processor - - - - - - - -* -* -* -* -* -* -reject -* -any* -any* -any* -any* -any* -reject -* -* -any* -any* -any* -any* -reject -* -* -* -any* -any* -any* -reject -* -* -* -* -any* -any* -submit to global, or local -processor -* -* -* -* -* -any* -submit to local processor -any* -* -* -* -* -* -reject -any* -any* -* -* -* -* -reject -any* -any* -any* -* -* -* -reject -any* -any* -any* -any* -* -* -reject -any* -any* -any* -any* -any* -* -reject -any* -any* -any* -any* -any* -any* -reject -any* -* -any* -any* -any* -any* -reject -any* -* -* -any* -any* -any* -submit to type, global, or local -processor -any* -* -* -* -any* -any* -submit to any global, or local -processor -any* -* -* -* -* -any* -submit to any local processor -any* -any* -* -any* -any* -any* -reject -any* -any* -* -* -any* -any* -submit to any global, or local -processor -any* -any* -* -* -* -any* -submit to any local processor -any* -any* -any* -* -any* -any* -reject -any* -any* -any* -* -* -any* -submit to any group, or local -processor -any* -any* -any* -any* -* -any* -reject -any* -any* -any* -any* -* -* -submit to any group, or type -processor - - - - - - - -?any -?any -?any -?any -?any -?any -reject -?any -any -any -any -any -any -reject -?any -?any -any -any -any -any -reject -?any -?any -?any -any -any -any -submit to type, global, or local -processor -?any -?any -?any -?any -any -any -submit to global, or local -processor -?any -?any -?any -?any -?any -any -submit to local processor -any -?any -?any -?any -?any -?any -reject -any -any -?any -?any -?any -?any -reject -any -any -any -?any -?any -?any -submit to group processor -any -any -any -any -?any -?any -submit to group, or type -processor -any -any -any -any -any -?any -submit to group, type, or global -processor -any -any -any -any -any -any -submit to any processor -any -?any -any -any -any -any -submit to any processor -any -?any -?any -any -any -any -submit to type, global, or local -processor -any -?any -?any -?any -any -any -submit to any global, or local -processor -any -?any -?any -?any -?any -any -submit to any local processor -any -any -?any -any -any -any -submit to any type, global, or -local processor -any -any -?any -?any -any -any -submit to any global, or local -processor -any -any -?any -?any -?any -any -submit to any local processor -any -any -any -?any -any -any -submit to any group, global, or -local processor -any -any -any -?any -?any -any -submit to any group, or local -processor -any -any -any -any -?any -any -submit to any group, type, or -local processor -any -any -any -any -?any -?any -submit to any group, or type -processor - - - - - - - -any?any -any?any -any?any -any?any -any?any -any?any -reject -any?any -any -any -any -any -any -submit to any processor -any?any -any?any -any -any -any -any -submit to any processor -any?any -any?any -any?any -any -any -any -submit to any processor -any?any -any?any -any?any -any?any -any -any -submit to global, or local -processor -any?any -any?any -any?any -any?any -any?any -any -submit to local processor -any -any?any -any?any -any?any -any?any -any?any -reject -any -any -any?any -any?any -any?any -any?any -reject -any -any -any -any?any -any?any -any?any -submit to group processor -any -any -any -any -any?any -any?any -submit to group, or type -processor -any -any -any -any -any -any?any -submit to any processor -any -any -any -any -any -any -submit to any processor -any -any?any -any -any -any -any -submit to any processor -any -any?any -any?any -any -any -any -submit to any processor -any -any?any -any?any -any?any -any -any -submit to any global, or local -processor -any -any?any -any?any -any?any -any?any -any -submit to any local processor -any -any -any?any -any -any -any -submit to any processor -any -any -any?any -any?any -any -any -submit to any global, or local -processor -any -any -any?any -any?any -any?any -any -submit to any local processor -any -any -any -any?any -any -any -submit to any processor -any -any -any -any?any -any?any -any -submit to any group, or local -processor -any -any -any -any -any?any -any -submit to any processor -any -any -any -any -any?any -any?any -submit to any group, or type -processor - - - - - - - -any? -any? -any? -any? -any? -any? -reject -any? -any -any -any -any -any -submit to any processor -any? -any? -any -any -any -any -submit to any processor -any? -any? -any? -any -any -any -submit to any processor -any? -any? -any? -any? -any -any -submit to any processor -any? -any? -any? -any? -any? -any -submit to any processor -any -any? -any? -any? -any? -any? -submit to any processor -any -any -any? -any? -any? -any? -submit to any processor -any -any -any -any? -any? -any? -submit to any processor -any -any -any -any -any? -any? -submit to any processor -any -any -any -any -any -any? -submit to any processor -any -any -any -any -any -any -submit to any processor -any -any? -any -any -any -any -submit to any processor -any -any? -any? -any -any -any -submit to any processor -any -any? -any? -any? -any -any -submit to any processor -any -any? -any? -any? -any? -any -submit to any processor -any -any -any? -any -any -any -submit to any processor -any -any -any? -any? -any -any -submit to any processor -any -any -any? -any? -any? -any -submit to any processor -any -any -any -any? -any -any -submit to any processor -any -any -any -any? -any? -any -submit to any processor -any -any -any -any -any? -any -submit to any processor -any -any -any -any -any? -any? -submit to any processor - - - - - - - -Null -any -any -any -any -any -not valid -any -Null -any -any -any -any -submit to any processor -any -any -Null -any -any -any -reject -any -any -any -Null -any -any -submit to group, global, local -processor -any -any -any -any -Null -any -submit to group, type, local -processor -any -any -any -any -any -Null -submit to group, type, global -processor -Null -Null -any -any -any -any -not valid -any -Null -Null -any -any -any -reject -any -any -Null -Null -any -any -submit to global, local -processor -any -any -any -Null -Null -any -submit to group, local -processor -any -any -any -any -Null -Null -submit to group, type -processor -Null -Null -Null -any -any -any -not valid -any -Null -Null -Null -any -any -submit to global, local -processor -any -any -Null -Null -Null -any -submit to local processor -any -any -any -Null -Null -Null -submit to group processor -Null -Null -Null -Null -any -any -not valid -any -Null -Null -Null -Null -any -submit to local processor -any -any -Null -Null -Null -Null -not valid -Null -Null -Null -Null -Null -any -not valid -any -Null -Null -Null -Null -Null -not valid -Null -Null -Null -Null -Null -Null -not valid - - - -4. Processing Forest - - - - |--Group Root---| - | | - |---Type Root---| - | | -client->------Resolver ->------| |----Authority->--- -return - | | - |--Global Root--| - | | - |--Local Root---| - -Once the resolver has determined which root to send the resolution -request to, each tree should be organized according to an exhaustive -replication of each name string on the route to an authority. For -instance, the Group tree would be organized alphabetically with TLDs -ôAö through ôZö initially. Since there are a lot of organizations with -business name derivations using the word ômicronö, there might be a -need to reorganize the ôMö TLD to accommodate a ôMicö and a ôMidö TLD. -Although it would be more efficient to break down each letter according -to the demands of the system, it would be easier to specify one mask -for the entire tree. The number of TLDs becomes a function of the -permutations of the number of masked characters in the available set of -usable characters rather than a select few that are added over time. -The resolver can cache the TLDs and know when to use them based upon -the mask for the tree. If a larger mask is needed to further distribute -the load, the resolvers would have to be updated. - -To replicate the current DNS entries under the additional labels -specified in this proposal a number of applications and uses would have -to be accounted for. The ARPA listings would remain unchanged or they -could be replicated under each root by recombining telephone numbers in -a single label under the e164 or padding IP addresses under the inverse -lookup tables without the periods separating the octets. - -Since the NNS uses a forest of processing trees and the current system -uses only one tree, a conversion process would have to be developed to -distinguish between DNS requests and NNS requests. This could be -handled using a number of different methods. - -A version flag in the request could accomplish this. This way the -resolver would be able to determine which searchable labels were used -and the order of presentation by standardization. The resolver -intelligence would know which labels to use for lookup or in the -preferred embodiment. The resolver could also reorganize the labels to -be presented under the correct processor so that the Global label is -presented at the right of the name string for processing through the -Global tree. Legacy requests without a version would be sent to the -Type tree. - -Another method could accomplish the goal by combining the labels the -request for the processing tree. In the previous example, the request -oven.macgowan.private.US2081234567.SOrchard15541 could be recombined by -the submitting processor as -oven.macgowanUS2081234567SOrchard15541.private to be searched under the -Type tree. Similarly it could be recombined as -oven.macgowanprivateUS2081234567.SOrchard15541 to be searched under the -Local tree. If a legacy DNS based system submitted a request for -www.yahoo.com, it might be appended as www.yahoo.com..... The first ô.ö -after com is to end the Type label. The second ô.ö Represents the null -character at the end of the Global label. The third ô.ö is for the -Local label. The fourth ô.ö is for the root. The last ô.ö is for the -end of the sentence. If applications are affected by the reservation of -the ô.ö for the root, the request could be recreated as -www.yahoo.com.null.null.. - -A final method is to create a hidden label. Hidden labels are discussed -further in extended label uses. - -Once the authority for a label is found within the label, the system -must also determine if there are Subgroups. Subgroups can be used for a -number of internal functions and/or divisions within the authority for -an organization. At this point the system would continue to resolve -using subgroup labels as levels as it does under the current system -toward the device at the left of the name string. - -The remaining searchable labels would be serviced using a similar -method. The Type tree would be organized as it is in the DNS with TLDs -representing each item in the list. Since the items in the list are -limited by the system, the mask could be set to none. The Global label -should be organized by a mask, which would accommodate at least the -country and area codes. The Local label would mask the PGS items until -enough TLDs are derived to equal processing traffic under the other -trees. Provisions should be made for the non-distinct items like -ôcorporateö that may use characters not reserved for physical -locations. In addition, a null TLD could be used to organize the -remainder of name strings that have omitted labels. The null ôö -character or the word ônullö could be used to represent legacy DNS -strings under the new labels until the name strings are updated with -the longer requirements. - -The NNS allows a FQDN to be resolved from each searchable label. Please -refer to the previous example, -oven.macgowan.private.US2081234567.SOrchard15541. The authority, -ôMacgowan.private.US2081234567.SOrchard15541ö is found using the -traditional method of the DNS using a Type item of ôprivateö (mask of -zero). The authority, ôMacgowan.private.US2081234567.SOrchard15541ö is -found through the Group processor under the ôMacö branch using a mask -of three characters. The ôMacgowan.private.US2081234567.SOrchard15541ö -authority is found under ôUS208ö using a mask of four characters within -the Global processing tree. The -ôMacgowan.private.US2081234567.SOrchard15541ö authority is also found -under ôSOrö of the branch masked under the Local tree. - -5. Extended Label Uses - -The NNS is a simple design which can accommodate the future of Internet -name strings by incorporating additional processing trees and a large -name space organized by labels with a user friendly interface. A search -engine is automatically derived from the organization within labels as -opposed to across labels. In other words, you send the known pieces of -the request to the processing tree that will yield the quickest results -with the least amount of traffic. Once names are bookmarked or selected -from a list of AutoCompletes, requests can be sent to any processing -tree to balance the load on the system. - -The present proposal also provides an extensible path for future labels -that may or may not have associated processors. A ôContactö label -might always be masked during the request for resolution, but provide -additional value to the user with a description about the connection or -a webmasterÆs email address. This has extreme value in the event a name -can be resolved, but not reached by connection to the IP address. In -addition to adding new labels, a group or association might request a -new item under the Type label or a new area code might be added under -the Group label. Therefore, one result of this system is a combination -of devices and labels which expands exponentially to meet the demand -for namespace with an inherent capability to adjust to future needs. - -An additional hidden label (mask of all) adjacent to the root could be -hidden and give information for maintenance of the system and/or the -listing. The most important consideration is keying the order and -number of labels in the string. Or using this method, a hidden security -label could help create a firewall between valid requests from users in -the domain versus outsiders or tie to a public key for the destination. -The hidden label could also be used to pass a request for content -delivered in a specific language. With the addition of the Local and -Global labels it might also be necessary to add a TTL label which could -serve as a timer for the registration or the life of a bookmark or -connection. The client could use this value in a history of valid -connections to make a request for an updated TTL, a new IP address, -and/or a trigger for replacing the name with a new string. This would -allow for a change in address, phone number, new area code, etc. on the -part of the provider. Just as the domain name was an abstraction layer -over the IP address, the current domain string is an abstraction for a -future domain string. A routine could prompt a user to change an entry -in a contact/bookmark list. Services such as WWW could also -automatically update links in the content or reflect changes to related -destinations within the content. In use, the client could compare its -value to the value at the authority. If the authority has a value of -zero, the client would update its name and IP address to the new -pointer returned by the resolver. An electronically updating NNS with -updating links in content is a product of this system. - -An example of using this procedure could be applied to finding the best -cell phone plan. A user buys a cell plan. The user emails contact links -to friends and associates. The recipients use their link to dial the -user. The user determines a new provider would be more advantageous and -purchases a new plan with a new number. The user sets their old TTL to -zero in the NNS and creates a new FQDN with the new cell number. Now -when the recipients use the old string, they are pointed to the new -string. The string with the new number is updated in the recipientÆs -contact list. The user is not tied to their telephone number and the -recipients do not need to manually adjust their entries. - -Hidden labels and masking would also have to be present at the client. -A business might have a lot of phone numbers or locations listed on the -name servers but use a shorter version of the string for making local -connections. This way all the devices under a group could be combined -as a single domain name. The future direction of label intelligence and -the ideas expressed here suggest that there may be numerous ways to -provide abstraction levels within the label string. Even the IP address -might be used as an identifier to search for the rest of the domain -string or an item like the telephone number. - -6. IANA Considerations - -The focus of the IANA will change considerably. The need to regulate -name hoarders, TM infringement considerations, and the decision to -implement new TLDs will be greatly reduced. The IANA might be used to -determine the relationships between labels as new items are added under -the requirements that provide for fair and equal addition to the Type -label. - -7. Security Consideration - -Name resolution is an inherent problem for spoofing content, but is -beyond the scope of this proposal. The suggested ability to update name -strings at the client also increases the need to provide secure -communications between the system and the client. - - -References - - - - [RFC 1034] - "Domain names - concepts and facilities", P. - - Mockapetris, 11/01/1987. - - [RFC 1035] - "Domain names - implementation and specification", P. - - Mockapetris, 11/01/1987. - - [RFC 2535] û ôE.164 number and DNSö , P. - - P. Faltstrom, 9/1/2000. - -Authors Address - - Michael L. Macgowan Jr. - 15541 Orchard Ave. - Caldwell, ID 83607 USA - - - Telephone: +1 208.454.1177 (h) - FAX: +1 208.455.0439 - EMail: mmacgowa@yahoo.com - - -Expiration and File Name - - This draft expires in August 2001 - - Its file name is labelmanage.txt - -Full Copyright Statement - -Copyright (C) The Internet Society (February 2001). All Rights -Reserved. - -This document and translations of it may be copied and furnished to -others, and derivative works that comment on or otherwise explain it or -assist in its implementation may be prepared, copied, published and -distributed, in whole or in part, without restriction of any kind, -provided that the above copyright notice and this paragraph are -included on all such copies and derivative works. However, this -document itself may not be modified in any way, such as by removing the -copyright notice or references to the Internet Society or other -Internet organizations, except as needed for the purpose of developing -Internet standards in which case the procedures for copyrights defined -in the Internet Standards process must be followed, or as required to -translate it into languages other than English. - -The limited permissions granted above are perpetual and will not be -revoked by the Internet Society or its successors or assigns. This -document and the information contained herein is provided on an "AS IS" -basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE -DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED -TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT -INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR -FITNESS FOR A PARTICULAR PURPOSE." -Michael L. Macgowan Jr. February 2001 [Page 6] - -Internet Draft DNS Label Intelligence and Management System - - - +INTERNET-DRAFT DNS Label Intelligence and Management System +UPDATES RFC 1034 February 2001 + Expires August 2001 + + + + +Domain Name System (DNS) DNS Label Intelligence and Management System + draft-macgowan-dnsext-label-intel-manage-00.txt + + + +Michael L. Macgowan Jr. + + +Status of This Document + +This draft is intended to become a Proposed Standard RFC. Distribution +of this document is unlimited. Comments should be sent to the Domain +Name Server Extensions working group mailing +list or to the author. + +This document is an Internet-Draft and is in full conformance with all +provisions of Section 10 of RFC 2026. Internet-Drafts are working +documents of the Internet Engineering Task Force (IETF), its areas, and +its working groups. Note that other groups may also distribute working +documents as Internet-Drafts. + +Internet-Drafts are draft documents valid for a maximum of six months +and may be updated, replaced, or obsoleted by other documents at any +time. It is inappropriate to use Internet- Drafts as reference +material or to cite them other than as "work in progress." + +The list of current Internet-Drafts can be accessed at +http://www.ietf.org/ietf/1id-abstracts.txt + +The list of Internet-Draft Shadow Directories can be accessed at +http://www.ietf.org/shadow.html. + + + +Abstract + + +A multidimensional array of domain label analysis and extensions are +offered to overcome a number of issues with the DNS and its use to +locate resources on the Internet. These goals are accomplished by +proposing a naming convention to add labels to domain strings. The +result will be a rational relationship to the content that will provide +a method for meeting the ever-increasing need to expand the namespace, +while providing an efficient search system to access content in a user- +friendly manner. + +A fundamental problem exists in the design of DNS. A user must know the +domain name including the Top Level Domain, TLD, and type the Uniform +Resource Locator, URL, accurately to connect to resources on the +Internet. The current lookup organization of the DNS uses domain labels +separated by periods to provide hierarchical levels for a resolver to +seek in finding a path to an authority. A new masking technique within +labels is proposed to accommodate lookups based on the request. +Multiple processing trees are proposed to redistribute the requests +based on the known pieces of the domain name. Rather than knowing the +fully qualified domain name, FQDN, the user can search for content +based upon known pieces of the string like group (business), country, +area code, phone number, type of organization, street address, zip code +and/or GPS location, etc.. Intelligence is added for determining the +fastest route to resolution based on user weighting, number of +requests, and traffic within the system. + +A result of the masking technique is an opportunity to provide a +completely hidden label(s) for maintenance of the system. A TTL (Time +to Live), version, and type of request could be keyed into a label to +provide information, which remains with the client but is normally lost +after a request is processed. This system could be implemented to +create automatically updated records and content. Or hidden labels +could be used to distinguish between version 4 and version 6 requests +in the TCP/IP, Transmission Control Protocol/ Internet Protocol, +rollover. + +Implementation of the new name system is facilitated by the addition of +a client interface for building requests. Longer domain names are +enhanced by smart AutoCompletes and group edit boxes. + +Table of Contents + + Status of This Document 1 + Abstract 1 + + Table of Contents 3 + + 1. Introduction 4 + 2. Inputting Request for Resolution 4 + 3. Resolution Processing 7 + 4. Processing Forest 13 + 5. Extended Label Uses 14 + 6. IANA Considerations 16 + 6. Security Considerations 16 + + References 16 + + Authors Address 16 + Expiration and File Name 17 + + + + + + + + + + + + +1. Introduction + +The Domain Name System (DNS) [RFC 1034, 1035] is the global +hierarchical replicated distributed database system for Internet +addressing, mail proxy, and other information. The DNS has been +extended to phone numbers as described in [RFC 2535]. It is designed to +accommodate a user-friendly name as an abstraction level over an IP +address, which provides a path to the physical connection to resources +and/or content on the Internet. This abstraction allows for changing +the physical location of the content without an update by the client. +The design, however, lacks a user-friendly method for assigning TLDs +and determining which TLD a content provider will be registered under. + +According to COMPUTERGRAM INTERNATIONAL: January 08, 2001, over 100 +million hosts are connected to the Internet with over 350 million +users. ICANN has submitted plans to increase the number of TLDs to +accommodate the lack of namespace, but the problem of organization and +extensibility continues to exist. As the number of TLDs grows, it +becomes harder for a user to input a user friendly domain name. In +essence, the user must know what derivations and which TLDs were +available to a provider at the time the organization chose a domain +name. The method of one response, in an all or nothing request, forces +precision on the part of the user that is a distraction to the original +goal of a user-friendly name. Consider a user that wants to find a new +theoretical health related company called Healthy Foods. Will the +company be called Healthyfoods.com? Or will it have an extension like +healthfoods.net, healthfoods.org, or healthfoods.health? Maybe it will +be forced to be a derivation like healthf.com, healthf.net, etc. There +is no user-friendly method to determine what the associated domain name +might be. This is a central problem of focus and organization. The +number of iterations a user must try increases with each new TLD that +is added. If a user forms multiple guesses for the TLD, excess traffic +is generated and the search is slowed by the inefficient nature of +human typing. Further, if a system were proposed under the current root +structure to allow for a search of all possible TLDs, the number of +requests would grow exponentially with the addition of each new TLD. + +2. Inputting Request for Resolution + + + +The key to making a New Name System, NNS, is to provide a user +interface, which will accommodate a friendly method of building name +requests. AutoComplete and multiple-selection drop-down, group list +boxes (some editable, some not) will make more complicated names easier +to input. Consider the previous example of Healthy Foods. Additional +extensions could be added as labels to make the namespace exponentially +larger. The web content might be reached at +www.healthy.food.US2081234567.Fairview101. In this example, www is the +Device label or content desired by the user. Health is the domain or +Subgroup/Group name label. Food is the item under the Type label. +US2081234567 is the item country/area code/number for the Global label. +Sfairview101 is the street/address of the Local label. + +Derivations of this example provide a limitless expansion of the +namespace within the physical limits of the protocol. A competitor down +the block might have the same FQDN, except for the street number and +phone number e.g. www.healthy.food.US2088901234.SFairview990. A second +type of business could also be run from the same location by changing +the type e.g. www.healthy.entertainment.US2081234567.SFairview101. A +parody of the site might be offered at +www.healthy.parody.us2086669999.SState103. + +A method of using less descriptive labels could also be used to +generalize the content. For example, the site for the regional office +might use only the country and area code designation e.g. .US208. A +corporate address might be located at www.healthy.food.US.corporate. +This way the Global and Local labels are not tied to physical +locations. Or there may be an 800 or 888 number that could be used for +multiple sites that are tied to multiple registrations at different +street addresses in the Local label. + +The task of building these longer names with labels can be accomplished +by updating list items from the NNS and by designing a better +interface. Instead of waiting for ICANN to vote on the relative merits +of a proposal for a new TLD, items could be automatically updated and +added to the system by a list of requirements. This would force a +relationship between labels but provide a nonbiased method without +prejudice. For example, a .Bus(iness) item for the Type label would +require a copy of a business license to be granted by the governing +authorities for the area specified in the Global label or the address +specified in the Local label. A “TM” item could separate the +Intellectual Property of Trademarks and Copyrights from other +registered listings issued from the government specified by the country +code in the Global label. Additionally, the Academy of Motion Pictures +might request an Oscar item, which would restrict membership to +nominees or recipients of the coveted award. + +Just as the resolver gets an updated list of root servers upon first +connection, the resolver could also receive an updated list of items in +the Type label and return them to the client. The list could be updated +by a TTL trigger and should not be editable from the user’s standpoint. +The user interface should allow for multiple selections, which could be +used to form separate requests for resolution. Finally, the +implementation should begin with at least a list that is equal to a +subject list found in the yellow pages of the phone book. This will +provide a well-known classification that will greatly reduce +competition for names of organizations, which are similar but provide +for very different products/services. Delta.airline is readily +distinguished from Delta.homeimprovement. + +The device label would remain largely unaffected. A list of previously +connected items such as www, toasters, lock, refrigerator, etc. would +facilitate input. The list would be editable. As the number of devices +connected to the Internet grows, this method will be invaluable. +Consider mail and faxes being sent directly to +printer.mybusiness.computer.us2081234567.sfairview101. A user that +needs to send a fax to a satellite office might also be able to try +searching for mybusiness at its other street address or telephone +number eg., printer.mybusiness.computer.us714*.sPensylvaiaave2345. +Wildcards and searching are discussed in the next section. + +The items under the Groups/Subgroups labels would also be a list of +previously connected to domains (less the TLD) such as sales.business +or kitchen.home. The list would contain a history of previous +connections and be editable. + +The Global label would have two characters to represent the country +code followed optionally by all or part of a representative telephone +number or mask for identifying the voice number(s) associated as items +in the domain. An international code would require a rational +relationship with world organizations. The interface would contain the +country codes and/or area codes, but the numbers would have to be +added. + +The Local label would require a single character to represent the type +of information presented, followed by the information in a standardized +form. The following codes are proposed for the Local label, “P” for +Postal code, “G” for Global Satellite Positioning and “S” for street +address. For example, P83706 would represent the author’s postal code, +GP0445004N1162498 (since the “+” key is not valid, “p” and “n” +represent positive and negative) would represent the GPS position of +the author with padding to standardize degrees/min/sec or SOrchard15541 +would represent the Street address (house number at the end). Note each +of these would require a separate name registration. The editable list +box could be a fly out list box with one of the designators specified, +while the remainder would be user input. + ++------------------+ +|Street | +| Fairview101 | + State101 | +|Postal | +|GPS | ++------------------+ + +The added labels would exponentially expand the name space. This may +cause an undesired relation to a Global or Local designation. This +could hamper changes to an organization or business in the future. +Hence a business might want to use a CNAME entry to reference users to +a non-distinct item in a label. For instance, a corporation might want +to register mybusiness.bus.In(ternational).corporate so that the +corporate office could be used for email addresses and bookmarking. +Content might be located at each mybusiness.bus.country.location where +the company does business. This way a corporation does not have to be +penalized for moving to a new physical location. The goal of the DNS +was to remove a physical relationship to the network, but the need of +the users is for some content to have a physical relationship to the +content; which is why, in part, the NNS is proposed. The concept of an +update is also discussed in section 5. + +The system would need to distinguish between the need for a request for +a connection to single IP address versus multiple requests. In an +application like a browser, traditional requests for IP resolution are +all or none. Either an IP address is connected to or not. If wildcards +are added to the request, multiple entries could be returned as a “hit” +list. An option on the browser could determine the number of requests +specified by the user. The “hits” should also be weighted. For +instance, if a user wanted to find all the movie theatres in the local +area he/she might submit a request for www.*.movies.US8370*.*. She/he +would be more inclined to desire additional theatres at different +nearby area codes than derivations of different domain names or Local +label derivations for a single theatre. A simple listing of each label +with an associated numerical value in an advanced option field could +determine how the responses are weighted against one another. The NNS +could also take into account the number of requests on the system and +further limit the number of responses based upon traffic. + +For registration, the content provider might want to register a more +global entry to be displayed on a restrictive search e.g. loans.US*.*. +A business content provider might want to register mybusiness.com.US* +so that requests for www.mybusiness.com.US208.* and +www.mybusiness.com.us714.* both resolve to mybusiness. A process would +have to be in place to copy an entry with wildcards to each of the +associated branches of the processing trees as discussed in section 4. +Similarly wildcard registrations should meet the rational requirements +required for the known item with the generalized scope. In the previous +example the provider would have to be licensed as a financial +institution in each of the states of the United States. + +3. Resolution Processing + +The key to expanding the DNS is to provide for a name space, which can +be accessed quickly and efficiently. Organization is key to this +process. The current DNS has one root organized by TLDs of the Type +label combined with Country TLDs. If a user does not know the extension +for the name, requests must be created for each one until a match is +found. The NNS creates separate roots for each label that can be used +for a search (see graphic on next page, description of TLDs is in +section 5). Instead of one tree, a forest is created, connected by a +common list of authorities for devices in the zones requested. Requests +can be organized by the known piece(s) of information. For instance, if +a user is trying to find Hewlett Packard and does not know that content +is provided at HP, a search of www.H*.*.US*.* should be returned +alphabetically from the Group label, not the Type label. However, if +the type item is known to be “computer”, a search of the Type tree +would be fastest. If a user wants to find a local voice number for +Microsoft he/she could submit a request generalized request within the +local area code for www.Microsoft.software.US208*.*. The authority +would best be located by the Global processor, which might list +www.Microsoft.software.US5041234567.SState123 and +www.Microsoft.software.US5044567890.SredmondAve123. If the request for +www.Microsoft.software.US504*.* were sent to the Local processor, every +TLD would have to be queried. The result might be one phone number with +separate Local label listings for the street address, GPS, and postal +code. This would create unwanted traffic on the system. + + +Root “.” Group Root “.”Type + | | + | | + “H” TLD TLD “Computer” + | | + | | + --- Authority for..HP.computer.US2081234567.SChinden12---- + | | + | | + “US208” TLD TLD “SChi” + | | + | | +Root “.” Global Root “.”Local + + +In addition to determining which label(s) to process the request, the +system would also have to take into account the weighting by the user +and the traffic on the system as discussed in the previous section. +When the FQDN is specified, the resolver would query the processor with +the fastest expected response time. A FQDN can be resolved from any of +the search processor trees. In the example +oven.macgowan.private.US2081234567.SOrchard15541, it does not matter +whether the request is sent to the Group, Type, Global, or Local +processing tree. Each leads to the authority, +macgowan.private.us2081234567.SOrchard15541. + +If wildcards or null characters exist in the request, the system should +take into account the number of requests that might be generated. +Currently the DNS does not account for the “?” and reserves the “” for +the root. The “*” could replace the singe character wildcard “?” and +the word “null” could be used in lieu of “”. The following table could +be used to determine which processing tree should be the most desirable +under such conditions: + +any = any combination of characters displayed in request +reject= no preferred processor +*= match any combination of characters for response +?= match any single character for response +null= no character specified + + +Device Sub Group T G L Result +* * * * * * reject +* any any any any any reject +* * any any any any reject +* * * any any any submit to T, G, or L +* * * * any any submit to G, or L +* * * * * any submit to L +any * * * * * reject +any any * * * * reject +any any any * * * submit to group +any any any any * * submit to group, or T +any any any any any * submit to group, T, or G +any any any any any any submit to any +any * any any any any submit to any +any * * any any any submit to T, G, or L +any * * * any any submit to any G, or L +any * * * * any submit to any L +any any * any any any submit to any T, G, or L +any any * * any any submit to any G, or L +any any * * * any submit to any L +any any any * any any submit to any group, G, or L +any any any * * any submit to any group, or L +any any any any * any submit to any group, T, or L +any any any any * * submit to any group, or T + +* * * * * * reject +* any*any any*any any*any any*any any*any reject +* * any*any any*any any*any any*any reject +* * * any*any any*any any*any submit to T, G, or L +* * * * any*any any*any submit to G, or L +* * * * * any*any submit to L +any*any * * * * * reject +any*any any*any * * * * reject +any*any any*any any*any * * * submit to group +any*any any*any any*any any*any * * submit to group, or T +any*any any*any any*any any*any any*any * submit to group, T, or G +any*any any*any any*any any*any any*any any*any reject +any*any * any*any any*any any*any any*any reject +any*any * * any*any any*any any*any submit to T, G, or L +any*any * * * any*any any*any submit to any G, or L +any*any * * * * any*any submit to any L +any*any any*any * any*any any*any any*any reject +any*any any*any * * any*any any*any submit to any G, or L +any*any any*any * * * any*any submit to any L +any*any any*any any*any * any*any any*any reject +any*any any*any any*any * * any*any submit to any group, or L +any*any any*any any*any any*any * any*any submit to any group, T, or L +any*any any*any any*any any*any * * submit to any group, or T + +* * * * * * reject +* any* any* any* any* any* reject +* * any* any* any* any* reject +* * * any* any* any* reject +* * * * any* any* submit to G, or L +* * * * * any* submit to L +any* * * * * * reject +any* any* * * * * reject +any* any* any* * * * reject +any* any* any* any* * * reject +any* any* any* any* any* * reject +any* any* any* any* any* any* reject +any* * any* any* any* any* reject +any* * * any* any* any* submit to T, G, or L +any* * * * any* any* submit to any G, or L +any* * * * * any* submit to any L +any* any* * any* any* any* reject +any* any* * * any* any* submit to any G, or L +any* any* * * * any* submit to any L +any* any* any* * any* any* reject +any* any* any* * * any* submit to any group, or L +any* any* any* any* * any* reject +any* any* any* any* * * submit to any group, or T + +?any ?any ?any ?any ?any ?any reject +?any any any any any any reject +?any ?any any any any any reject +?any ?any ?any any any any submit to T, G, or L +?any ?any ?any ?any any any submit to G, or L +?any ?any ?any ?any ?any any submit to L +any ?any ?any ?any ?any ?any reject +any any ?any ?any ?any ?any reject +any any any ?any ?any ?any submit to group +any any any any ?any ?any submit to group, or T +any any any any any ?any submit to group, T, or G +any any any any any any submit to any +any ?any any any any any submit to any +any ?any ?any any any any submit to T, G, or L +any ?any ?any ?any any any submit to any G, or L +any ?any ?any ?any ?any any submit to any L +any any ?any any any any submit to any T, G, or L +any any ?any ?any any any submit to any G, or L +any any ?any ?any ?any any submit to any L +any any any ?any any any submit to any group, G, or L +any any any ?any ?any any submit to any group, or L +any any any any ?any any submit to any group, T, or L +any any any any ?any ?any submit to any group, or T + +any?any any?any any?any any?any any?any any?any reject +any?any any any any any any submit to any +any?any any?any any any any any submit to any +any?any any?any any?any any any any submit to any +any?any any?any any?any any?any any any submit to G, or L +any?any any?any any?any any?any any?any any submit to L +any any?any any?any any?any any?any any?any reject +any any any?any any?any any?any any?any reject +any any any any?any any?any any?any submit to group +any any any any any?any any?any submit to group, or T +any any any any any any?any submit to any +any any any any any any submit to any +any any?any any any any any submit to any +any any?any any?any any any any submit to any +any any?any any?any any?any any any submit to any G, or L +any any?any any?any any?any any?any any submit to any L +any any any?any any any any submit to any +any any any?any any?any any any submit to any G, or L +any any any?any any?any any?any any submit to any L +any any any any?any any any submit to any +any any any any?any any?any any submit to any group, or L +any any any any any?any any submit to any +any any any any any?any any?any submit to any group, or T + +any? any? any? any? any? any? reject +any? any any any any any submit to any +any? any? any any any any submit to any +any? any? any? any any any submit to any +any? any? any? any? any any submit to any +any? any? any? any? any? any submit to any +any any? any? any? any? any? submit to any +any any any? any? any? any? submit to any +any any any any? any? any? submit to any +any any any any any? any? submit to any +any any any any any any? submit to any +any any any any any any submit to any +any any? any any any any submit to any +any any? any? any any any submit to any +any any? any? any? any any submit to any +any any? any? any? any? any submit to any +any any any? any any any submit to any +any any any? any? any any submit to any +any any any? any? any? any submit to any +any any any any? any any submit to any +any any any any? any? any submit to any +any any any any any? any submit to any +any any any any any? any? submit to any + +Null any any any any any not valid +any Null any any any any submit to any +any any Null any any any reject +any any any Null any any submit to group, G, L +any any any any Null any submit to group, T, L +any any any any any Null submit to group, T, G +Null Null any any any any not valid +any Null Null any any any reject +any any Null Null any any submit to G, L +any any any Null Null any submit to group, L +any any any any Null Null submit to group, T +Null Null Null any any any not valid +any Null Null Null any any submit to G, L +any any Null Null Null any submit to L +any any any Null Null Null submit to group +Null Null Null Null any any not valid +any Null Null Null Null any submit to L +any any Null Null Null Null not valid +Null Null Null Null Null any not valid +any Null Null Null Null Null not valid +Null Null Null Null Null Null not valid + + + +4. Processing Forest + + + + |--Group Root---| + | | + |---Type Root---| + | | +client->------Resolver ->------| |----Authority->--- +return + | | + |--Global Root--| + | | + |--Local Root---| + +Once the resolver has determined which root to send the resolution +request to, each tree should be organized according to an exhaustive +replication of each name string on the route to an authority. For +instance, the Group tree would be organized alphabetically with TLDs +“A” through “Z” initially. Since there are a lot of organizations with +business name derivations using the word “micron”, there might be a +need to reorganize the “M” TLD to accommodate a “Mic” and a “Mid” TLD. +Although it would be more efficient to break down each letter according +to the demands of the system, it would be easier to specify one mask +for the entire tree. The number of TLDs becomes a function of the +permutations of the number of masked characters in the available set of +usable characters rather than a select few that are added over time. +The resolver can cache the TLDs and know when to use them based upon +the mask for the tree. If a larger mask is needed to further distribute +the load, the resolvers would have to be updated. + +To replicate the current DNS entries under the additional labels +specified in this proposal a number of applications and uses would have +to be accounted for. The ARPA listings would remain unchanged or they +could be replicated under each root by recombining telephone numbers in +a single label under the e164 or padding IP addresses under the inverse +lookup tables without the periods separating the octets. + +Since the NNS uses a forest of processing trees and the current system +uses only one tree, a conversion process would have to be developed to +distinguish between DNS requests and NNS requests. This could be +handled using a number of different methods. + +A version flag in the request could accomplish this. This way the +resolver would be able to determine which searchable labels were used +and the order of presentation by standardization. The resolver +intelligence would know which labels to use for lookup or in the +preferred embodiment. The resolver could also reorganize the labels to +be presented under the correct processor so that the Global label is +presented at the right of the name string for processing through the +Global tree. Legacy requests without a version would be sent to the +Type tree. + +Another method could accomplish the goal by combining the labels the +request for the processing tree. In the previous example, the request +oven.macgowan.private.US2081234567.SOrchard15541 could be recombined by +the submitting processor as +oven.macgowanUS2081234567SOrchard15541.private to be searched under the +Type tree. Similarly it could be recombined as +oven.macgowanprivateUS2081234567.SOrchard15541 to be searched under the +Local tree. If a legacy DNS based system submitted a request for +www.yahoo.com, it might be appended as www.yahoo.com..... The first “.” +after com is to end the Type label. The second “.” Represents the null +character at the end of the Global label. The third “.” is for the +Local label. The fourth “.” is for the root. The last “.” is for the +end of the sentence. If applications are affected by the reservation of +the “.” for the root, the request could be recreated as +www.yahoo.com.null.null.. + +A final method is to create a hidden label. Hidden labels are discussed +further in extended label uses. + +Once the authority for a label is found within the label, the system +must also determine if there are Subgroups. Subgroups can be used for a +number of internal functions and/or divisions within the authority for +an organization. At this point the system would continue to resolve +using subgroup labels as levels as it does under the current system +toward the device at the left of the name string. + +The remaining searchable labels would be serviced using a similar +method. The Type tree would be organized as it is in the DNS with TLDs +representing each item in the list. Since the items in the list are +limited by the system, the mask could be set to none. The Global label +should be organized by a mask, which would accommodate at least the +country and area codes. The Local label would mask the PGS items until +enough TLDs are derived to equal processing traffic under the other +trees. Provisions should be made for the non-distinct items like +“corporate” that may use characters not reserved for physical +locations. In addition, a null TLD could be used to organize the +remainder of name strings that have omitted labels. The null “” +character or the word “null” could be used to represent legacy DNS +strings under the new labels until the name strings are updated with +the longer requirements. + +The NNS allows a FQDN to be resolved from each searchable label. Please +refer to the previous example, +oven.macgowan.private.US2081234567.SOrchard15541. The authority, +“Macgowan.private.US2081234567.SOrchard15541” is found using the +traditional method of the DNS using a Type item of “private” (mask of +zero). The authority, “Macgowan.private.US2081234567.SOrchard15541” is +found through the Group processor under the “Mac” branch using a mask +of three characters. The “Macgowan.private.US2081234567.SOrchard15541” +authority is found under “US208” using a mask of four characters within +the Global processing tree. The +“Macgowan.private.US2081234567.SOrchard15541” authority is also found +under “SOr” of the branch masked under the Local tree. + +5. Extended Label Uses + +The NNS is a simple design which can accommodate the future of Internet +name strings by incorporating additional processing trees and a large +name space organized by labels with a user friendly interface. A search +engine is automatically derived from the organization within labels as +opposed to across labels. In other words, you send the known pieces of +the request to the processing tree that will yield the quickest results +with the least amount of traffic. Once names are bookmarked or selected +from a list of AutoCompletes, requests can be sent to any processing +tree to balance the load on the system. + +The present proposal also provides an extensible path for future labels +that may or may not have associated processors. A “Contact” label +might always be masked during the request for resolution, but provide +additional value to the user with a description about the connection or +a webmaster’s email address. This has extreme value in the event a name +can be resolved, but not reached by connection to the IP address. In +addition to adding new labels, a group or association might request a +new item under the Type label or a new area code might be added under +the Group label. Therefore, one result of this system is a combination +of devices and labels which expands exponentially to meet the demand +for namespace with an inherent capability to adjust to future needs. + +An additional hidden label (mask of all) adjacent to the root could be +hidden and give information for maintenance of the system and/or the +listing. The most important consideration is keying the order and +number of labels in the string. Or using this method, a hidden security +label could help create a firewall between valid requests from users in +the domain versus outsiders or tie to a public key for the destination. +The hidden label could also be used to pass a request for content +delivered in a specific language. With the addition of the Local and +Global labels it might also be necessary to add a TTL label which could +serve as a timer for the registration or the life of a bookmark or +connection. The client could use this value in a history of valid +connections to make a request for an updated TTL, a new IP address, +and/or a trigger for replacing the name with a new string. This would +allow for a change in address, phone number, new area code, etc. on the +part of the provider. Just as the domain name was an abstraction layer +over the IP address, the current domain string is an abstraction for a +future domain string. A routine could prompt a user to change an entry +in a contact/bookmark list. Services such as WWW could also +automatically update links in the content or reflect changes to related +destinations within the content. In use, the client could compare its +value to the value at the authority. If the authority has a value of +zero, the client would update its name and IP address to the new +pointer returned by the resolver. An electronically updating NNS with +updating links in content is a product of this system. + +An example of using this procedure could be applied to finding the best +cell phone plan. A user buys a cell plan. The user emails contact links +to friends and associates. The recipients use their link to dial the +user. The user determines a new provider would be more advantageous and +purchases a new plan with a new number. The user sets their old TTL to +zero in the NNS and creates a new FQDN with the new cell number. Now +when the recipients use the old string, they are pointed to the new +string. The string with the new number is updated in the recipient’s +contact list. The user is not tied to their telephone number and the +recipients do not need to manually adjust their entries. + +Hidden labels and masking would also have to be present at the client. +A business might have a lot of phone numbers or locations listed on the +name servers but use a shorter version of the string for making local +connections. This way all the devices under a group could be combined +as a single domain name. The future direction of label intelligence and +the ideas expressed here suggest that there may be numerous ways to +provide abstraction levels within the label string. Even the IP address +might be used as an identifier to search for the rest of the domain +string or an item like the telephone number. + +6. IANA Considerations + +The focus of the IANA will change considerably. The need to regulate +name hoarders, TM infringement considerations, and the decision to +implement new TLDs will be greatly reduced. The IANA might be used to +determine the relationships between labels as new items are added under +the requirements that provide for fair and equal addition to the Type +label. + +7. Security Consideration + +Name resolution is an inherent problem for spoofing content, but is +beyond the scope of this proposal. The suggested ability to update name +strings at the client also increases the need to provide secure +communications between the system and the client. + + +References + + + + [RFC 1034] - "Domain names - concepts and facilities", P. + + Mockapetris, 11/01/1987. + + [RFC 1035] - "Domain names - implementation and specification", P. + + Mockapetris, 11/01/1987. + + [RFC 2535] – “E.164 number and DNS” , P. + + P. Faltstrom, 9/1/2000. + +Authors Address + + Michael L. Macgowan Jr. + 15541 Orchard Ave. + Caldwell, ID 83607 USA + + + Telephone: +1 208.454.1177 (h) + FAX: +1 208.455.0439 + EMail: mmacgowa@yahoo.com + + +Expiration and File Name + + This draft expires in August 2001 + + Its file name is labelmanage.txt + +Full Copyright Statement + +Copyright (C) The Internet Society (February 2001). All Rights +Reserved. + +This document and translations of it may be copied and furnished to +others, and derivative works that comment on or otherwise explain it or +assist in its implementation may be prepared, copied, published and +distributed, in whole or in part, without restriction of any kind, +provided that the above copyright notice and this paragraph are +included on all such copies and derivative works. However, this +document itself may not be modified in any way, such as by removing the +copyright notice or references to the Internet Society or other +Internet organizations, except as needed for the purpose of developing +Internet standards in which case the procedures for copyrights defined +in the Internet Standards process must be followed, or as required to +translate it into languages other than English. + +The limited permissions granted above are perpetual and will not be +revoked by the Internet Society or its successors or assigns. This +document and the information contained herein is provided on an "AS IS" +basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE +DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED +TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT +INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR +FITNESS FOR A PARTICULAR PURPOSE." +Michael L. Macgowan Jr. February 2001 [Page 17] + +Internet Draft DNS Label Intelligence and Management System + + +