Blame


1 78e51a8c 2005-01-14 devnull .TH UTF 7
2 78e51a8c 2005-01-14 devnull .SH NAME
3 78e51a8c 2005-01-14 devnull UTF, Unicode, ASCII, rune \- character set and format
4 78e51a8c 2005-01-14 devnull .SH DESCRIPTION
5 78e51a8c 2005-01-14 devnull The Plan 9 character set and representation are
6 78e51a8c 2005-01-14 devnull based on the Unicode Standard and on the ISO multibyte
7 78e51a8c 2005-01-14 devnull .SM UTF-8
8 78e51a8c 2005-01-14 devnull encoding (Universal Character
9 78e51a8c 2005-01-14 devnull Set Transformation Format, 8 bits wide).
10 78e51a8c 2005-01-14 devnull The Unicode Standard represents its characters in 16
11 78e51a8c 2005-01-14 devnull bits;
12 78e51a8c 2005-01-14 devnull .SM UTF-8
13 78e51a8c 2005-01-14 devnull represents such
14 78e51a8c 2005-01-14 devnull values in an 8-bit byte stream.
15 78e51a8c 2005-01-14 devnull Throughout this manual,
16 78e51a8c 2005-01-14 devnull .SM UTF-8
17 78e51a8c 2005-01-14 devnull is shortened to
18 78e51a8c 2005-01-14 devnull .SM UTF.
19 78e51a8c 2005-01-14 devnull .PP
20 78e51a8c 2005-01-14 devnull In Plan 9, a
21 78e51a8c 2005-01-14 devnull .I rune
22 78e51a8c 2005-01-14 devnull is a 16-bit quantity representing a Unicode character.
23 78e51a8c 2005-01-14 devnull Internally, programs may store characters as runes.
24 78e51a8c 2005-01-14 devnull However, any external manifestation of textual information,
25 78e51a8c 2005-01-14 devnull in files or at the interface between programs, uses a
26 78e51a8c 2005-01-14 devnull machine-independent, byte-stream encoding called
27 78e51a8c 2005-01-14 devnull .SM UTF.
28 78e51a8c 2005-01-14 devnull .PP
29 78e51a8c 2005-01-14 devnull .SM UTF
30 78e51a8c 2005-01-14 devnull is designed so the 7-bit
31 78e51a8c 2005-01-14 devnull .SM ASCII
32 78e51a8c 2005-01-14 devnull set (values hexadecimal 00 to 7F),
33 78e51a8c 2005-01-14 devnull appear only as themselves
34 78e51a8c 2005-01-14 devnull in the encoding.
35 78e51a8c 2005-01-14 devnull Runes with values above 7F appear as sequences of two or more
36 78e51a8c 2005-01-14 devnull bytes with values only from 80 to FF.
37 78e51a8c 2005-01-14 devnull .PP
38 78e51a8c 2005-01-14 devnull The
39 78e51a8c 2005-01-14 devnull .SM UTF
40 78e51a8c 2005-01-14 devnull encoding of the Unicode Standard is backward compatible with
41 78e51a8c 2005-01-14 devnull .SM ASCII\c
42 78e51a8c 2005-01-14 devnull :
43 78e51a8c 2005-01-14 devnull programs presented only with
44 78e51a8c 2005-01-14 devnull .SM ASCII
45 78e51a8c 2005-01-14 devnull work on Plan 9
46 78e51a8c 2005-01-14 devnull even if not written to deal with
47 78e51a8c 2005-01-14 devnull .SM UTF,
48 78e51a8c 2005-01-14 devnull as do
49 78e51a8c 2005-01-14 devnull programs that deal with uninterpreted byte streams.
50 78e51a8c 2005-01-14 devnull However, programs that perform semantic processing on
51 78e51a8c 2005-01-14 devnull .SM ASCII
52 78e51a8c 2005-01-14 devnull graphic
53 78e51a8c 2005-01-14 devnull characters must convert from
54 78e51a8c 2005-01-14 devnull .SM UTF
55 78e51a8c 2005-01-14 devnull to runes
56 78e51a8c 2005-01-14 devnull in order to work properly with non-\c
57 78e51a8c 2005-01-14 devnull .SM ASCII
58 78e51a8c 2005-01-14 devnull input.
59 78e51a8c 2005-01-14 devnull See
60 78e51a8c 2005-01-14 devnull .IR rune (3).
61 78e51a8c 2005-01-14 devnull .PP
62 78e51a8c 2005-01-14 devnull Letting numbers be binary,
63 78e51a8c 2005-01-14 devnull a rune x is converted to a multibyte
64 78e51a8c 2005-01-14 devnull .SM UTF
65 78e51a8c 2005-01-14 devnull sequence
66 78e51a8c 2005-01-14 devnull as follows:
67 78e51a8c 2005-01-14 devnull .PP
68 78e51a8c 2005-01-14 devnull 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
69 78e51a8c 2005-01-14 devnull .br
70 78e51a8c 2005-01-14 devnull 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
71 78e51a8c 2005-01-14 devnull .br
72 78e51a8c 2005-01-14 devnull 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
73 78e51a8c 2005-01-14 devnull .br
74 78e51a8c 2005-01-14 devnull .PP
75 78e51a8c 2005-01-14 devnull Conversion 01 provides a one-byte sequence that spans the
76 78e51a8c 2005-01-14 devnull .SM ASCII
77 78e51a8c 2005-01-14 devnull character set in a compatible way.
78 78e51a8c 2005-01-14 devnull Conversions 10 and 11 represent higher-valued characters
79 78e51a8c 2005-01-14 devnull as sequences of two or three bytes with the high bit set.
80 78e51a8c 2005-01-14 devnull Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
81 78e51a8c 2005-01-14 devnull When there are multiple ways to encode a value, for example rune 0,
82 78e51a8c 2005-01-14 devnull the shortest encoding is used.
83 78e51a8c 2005-01-14 devnull .PP
84 78e51a8c 2005-01-14 devnull In the inverse mapping,
85 78e51a8c 2005-01-14 devnull any sequence except those described above
86 78e51a8c 2005-01-14 devnull is incorrect and is converted to rune hexadecimal 0080.
87 78e51a8c 2005-01-14 devnull .SH "SEE ALSO"
88 78e51a8c 2005-01-14 devnull .IR ascii (1),
89 78e51a8c 2005-01-14 devnull .IR tcs (1),
90 78e51a8c 2005-01-14 devnull .IR rune (3),
91 78e51a8c 2005-01-14 devnull .IR "The Unicode Standard" .