Blame


1 76193d7c 2003-09-30 devnull .TH UTF 7
2 76193d7c 2003-09-30 devnull .SH NAME
3 76193d7c 2003-09-30 devnull UTF, Unicode, ASCII, rune \- character set and format
4 76193d7c 2003-09-30 devnull .SH DESCRIPTION
5 76193d7c 2003-09-30 devnull The Plan 9 character set and representation are
6 76193d7c 2003-09-30 devnull based on the Unicode Standard and on the ISO multibyte
7 76193d7c 2003-09-30 devnull .SM UTF-8
8 76193d7c 2003-09-30 devnull encoding (Universal Character
9 76193d7c 2003-09-30 devnull Set Transformation Format, 8 bits wide).
10 76193d7c 2003-09-30 devnull The Unicode Standard represents its characters in 16
11 76193d7c 2003-09-30 devnull bits;
12 76193d7c 2003-09-30 devnull .SM UTF-8
13 76193d7c 2003-09-30 devnull represents such
14 76193d7c 2003-09-30 devnull values in an 8-bit byte stream.
15 76193d7c 2003-09-30 devnull Throughout this manual,
16 76193d7c 2003-09-30 devnull .SM UTF-8
17 76193d7c 2003-09-30 devnull is shortened to
18 76193d7c 2003-09-30 devnull .SM UTF.
19 76193d7c 2003-09-30 devnull .PP
20 76193d7c 2003-09-30 devnull In Plan 9, a
21 76193d7c 2003-09-30 devnull .I rune
22 76193d7c 2003-09-30 devnull is a 16-bit quantity representing a Unicode character.
23 76193d7c 2003-09-30 devnull Internally, programs may store characters as runes.
24 76193d7c 2003-09-30 devnull However, any external manifestation of textual information,
25 76193d7c 2003-09-30 devnull in files or at the interface between programs, uses a
26 76193d7c 2003-09-30 devnull machine-independent, byte-stream encoding called
27 76193d7c 2003-09-30 devnull .SM UTF.
28 76193d7c 2003-09-30 devnull .PP
29 76193d7c 2003-09-30 devnull .SM UTF
30 76193d7c 2003-09-30 devnull is designed so the 7-bit
31 76193d7c 2003-09-30 devnull .SM ASCII
32 76193d7c 2003-09-30 devnull set (values hexadecimal 00 to 7F),
33 76193d7c 2003-09-30 devnull appear only as themselves
34 76193d7c 2003-09-30 devnull in the encoding.
35 76193d7c 2003-09-30 devnull Runes with values above 7F appear as sequences of two or more
36 76193d7c 2003-09-30 devnull bytes with values only from 80 to FF.
37 76193d7c 2003-09-30 devnull .PP
38 76193d7c 2003-09-30 devnull The
39 76193d7c 2003-09-30 devnull .SM UTF
40 76193d7c 2003-09-30 devnull encoding of the Unicode Standard is backward compatible with
41 76193d7c 2003-09-30 devnull .SM ASCII\c
42 76193d7c 2003-09-30 devnull :
43 76193d7c 2003-09-30 devnull programs presented only with
44 76193d7c 2003-09-30 devnull .SM ASCII
45 76193d7c 2003-09-30 devnull work on Plan 9
46 76193d7c 2003-09-30 devnull even if not written to deal with
47 76193d7c 2003-09-30 devnull .SM UTF,
48 76193d7c 2003-09-30 devnull as do
49 76193d7c 2003-09-30 devnull programs that deal with uninterpreted byte streams.
50 76193d7c 2003-09-30 devnull However, programs that perform semantic processing on
51 76193d7c 2003-09-30 devnull .SM ASCII
52 76193d7c 2003-09-30 devnull graphic
53 76193d7c 2003-09-30 devnull characters must convert from
54 76193d7c 2003-09-30 devnull .SM UTF
55 76193d7c 2003-09-30 devnull to runes
56 76193d7c 2003-09-30 devnull in order to work properly with non-\c
57 76193d7c 2003-09-30 devnull .SM ASCII
58 76193d7c 2003-09-30 devnull input.
59 76193d7c 2003-09-30 devnull See
60 d32deab1 2020-08-16 rsc .MR rune (3) .
61 76193d7c 2003-09-30 devnull .PP
62 76193d7c 2003-09-30 devnull Letting numbers be binary,
63 76193d7c 2003-09-30 devnull a rune x is converted to a multibyte
64 76193d7c 2003-09-30 devnull .SM UTF
65 76193d7c 2003-09-30 devnull sequence
66 76193d7c 2003-09-30 devnull as follows:
67 76193d7c 2003-09-30 devnull .PP
68 76193d7c 2003-09-30 devnull 01. x in [00000000.0bbbbbbb] → 0bbbbbbb
69 76193d7c 2003-09-30 devnull .br
70 76193d7c 2003-09-30 devnull 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
71 76193d7c 2003-09-30 devnull .br
72 76193d7c 2003-09-30 devnull 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
73 76193d7c 2003-09-30 devnull .br
74 76193d7c 2003-09-30 devnull .PP
75 76193d7c 2003-09-30 devnull Conversion 01 provides a one-byte sequence that spans the
76 76193d7c 2003-09-30 devnull .SM ASCII
77 76193d7c 2003-09-30 devnull character set in a compatible way.
78 76193d7c 2003-09-30 devnull Conversions 10 and 11 represent higher-valued characters
79 76193d7c 2003-09-30 devnull as sequences of two or three bytes with the high bit set.
80 76193d7c 2003-09-30 devnull Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
81 76193d7c 2003-09-30 devnull When there are multiple ways to encode a value, for example rune 0,
82 76193d7c 2003-09-30 devnull the shortest encoding is used.
83 76193d7c 2003-09-30 devnull .PP
84 76193d7c 2003-09-30 devnull In the inverse mapping,
85 76193d7c 2003-09-30 devnull any sequence except those described above
86 76193d7c 2003-09-30 devnull is incorrect and is converted to rune hexadecimal 0080.
87 76193d7c 2003-09-30 devnull .SH "SEE ALSO"
88 d32deab1 2020-08-16 rsc .MR ascii (1) ,
89 d32deab1 2020-08-16 rsc .MR tcs (1) ,
90 d32deab1 2020-08-16 rsc .MR rune (3) ,
91 76193d7c 2003-09-30 devnull .IR "The Unicode Standard" .