3 78e51a8c 2005-01-14 devnull UTF, Unicode, ASCII, rune \- character set and format
4 78e51a8c 2005-01-14 devnull .SH DESCRIPTION
5 78e51a8c 2005-01-14 devnull The Plan 9 character set and representation are
6 78e51a8c 2005-01-14 devnull based on the Unicode Standard and on the ISO multibyte
8 78e51a8c 2005-01-14 devnull encoding (Universal Character
9 78e51a8c 2005-01-14 devnull Set Transformation Format, 8 bits wide).
10 78e51a8c 2005-01-14 devnull The Unicode Standard represents its characters in 16
12 78e51a8c 2005-01-14 devnull .SM UTF-8
13 78e51a8c 2005-01-14 devnull represents such
14 78e51a8c 2005-01-14 devnull values in an 8-bit byte stream.
15 78e51a8c 2005-01-14 devnull Throughout this manual,
16 78e51a8c 2005-01-14 devnull .SM UTF-8
17 78e51a8c 2005-01-14 devnull is shortened to
20 78e51a8c 2005-01-14 devnull In Plan 9, a
22 78e51a8c 2005-01-14 devnull is a 16-bit quantity representing a Unicode character.
23 78e51a8c 2005-01-14 devnull Internally, programs may store characters as runes.
24 78e51a8c 2005-01-14 devnull However, any external manifestation of textual information,
25 78e51a8c 2005-01-14 devnull in files or at the interface between programs, uses a
26 78e51a8c 2005-01-14 devnull machine-independent, byte-stream encoding called
30 78e51a8c 2005-01-14 devnull is designed so the 7-bit
31 78e51a8c 2005-01-14 devnull .SM ASCII
32 78e51a8c 2005-01-14 devnull set (values hexadecimal 00 to 7F),
33 78e51a8c 2005-01-14 devnull appear only as themselves
34 78e51a8c 2005-01-14 devnull in the encoding.
35 78e51a8c 2005-01-14 devnull Runes with values above 7F appear as sequences of two or more
36 78e51a8c 2005-01-14 devnull bytes with values only from 80 to FF.
40 78e51a8c 2005-01-14 devnull encoding of the Unicode Standard is backward compatible with
41 78e51a8c 2005-01-14 devnull .SM ASCII\c
43 78e51a8c 2005-01-14 devnull programs presented only with
44 78e51a8c 2005-01-14 devnull .SM ASCII
45 78e51a8c 2005-01-14 devnull work on Plan 9
46 78e51a8c 2005-01-14 devnull even if not written to deal with
49 78e51a8c 2005-01-14 devnull programs that deal with uninterpreted byte streams.
50 78e51a8c 2005-01-14 devnull However, programs that perform semantic processing on
51 78e51a8c 2005-01-14 devnull .SM ASCII
53 78e51a8c 2005-01-14 devnull characters must convert from
56 78e51a8c 2005-01-14 devnull in order to work properly with non-\c
57 78e51a8c 2005-01-14 devnull .SM ASCII
60 78e51a8c 2005-01-14 devnull .IR rune (3).
62 78e51a8c 2005-01-14 devnull Letting numbers be binary,
63 78e51a8c 2005-01-14 devnull a rune x is converted to a multibyte
66 78e51a8c 2005-01-14 devnull as follows:
68 78e51a8c 2005-01-14 devnull 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
70 78e51a8c 2005-01-14 devnull 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
72 78e51a8c 2005-01-14 devnull 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
75 78e51a8c 2005-01-14 devnull Conversion 01 provides a one-byte sequence that spans the
76 78e51a8c 2005-01-14 devnull .SM ASCII
77 78e51a8c 2005-01-14 devnull character set in a compatible way.
78 78e51a8c 2005-01-14 devnull Conversions 10 and 11 represent higher-valued characters
79 78e51a8c 2005-01-14 devnull as sequences of two or three bytes with the high bit set.
80 78e51a8c 2005-01-14 devnull Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
81 78e51a8c 2005-01-14 devnull When there are multiple ways to encode a value, for example rune 0,
82 78e51a8c 2005-01-14 devnull the shortest encoding is used.
84 78e51a8c 2005-01-14 devnull In the inverse mapping,
85 78e51a8c 2005-01-14 devnull any sequence except those described above
86 78e51a8c 2005-01-14 devnull is incorrect and is converted to rune hexadecimal 0080.
87 78e51a8c 2005-01-14 devnull .SH "SEE ALSO"
88 78e51a8c 2005-01-14 devnull .IR ascii (1),
89 78e51a8c 2005-01-14 devnull .IR tcs (1),
90 78e51a8c 2005-01-14 devnull .IR rune (3),
91 78e51a8c 2005-01-14 devnull .IR "The Unicode Standard" .