op public repos

Blob

Date:: Tue Jan 12 09:28:10 2021 UTC
Message:: typos
Actions:: History | Blame | Raw File
1 In one of the recent posts, the one were I was discussing the IRI implementation in gmid, I said that I wasn’t happy with the UTF-8 parser.
2 
3 => /post/iris-are-not-hard.gmi  IRIs are not hard!
4 
5 Since then, I improved the valid_multibyte_utf8 function at least two times, and I’m happy with the current result, but I thought to document here the various “generations” of that function.
6 
7 The purpose of valid_multibyte_utf8 is to tell if a string starts with a valid UTF-8 encoded UNICODE character, and advance the pointer past that glyph.  We’re interested only in U+80 and up, because of the characters in the ASCII range we’ve already taken care of.
8 
9 UTF-8 is a multibyte character encoding for UNICODE text.  Multibyte means that a single UNICODE codepoint can occupy more than one byte; in fact, UTF-8 is also variable-length: a codepoint can span between one and four bytes.
10 
11 UTF-8 is also a nice encoding, easy to parse but, as with everything that is related to UNICODE, with subtle and annoying details.
12 
13 The UNICODE codepoints are encoded as follows:
14 
15 * U+0000 - U+007F: one byte, compatible with ASCII
16 * U+0080 - U+07FF: two bytes
17 * U+0800 - U+D7FF and U+E0000 - U+FFFF: three bytes
18 * U+10000 – U+10FFFF: four bytes
19 
20 Byte-wise, every UTF-8 character starts with either a 0 or with the bit patter 11XXXXXX.  Byte starting with 11 are called UTF-8 “start bytes”, and are followed by up to three byte that starts with 10XXXXXX, called “continuation bytes”.
21 
22 ```
23 one byte:       0.......
24 two bytes:      110..... 10......
25 three bytes:    1110.... 10...... 10......
26 four bytes:     11110... 10...... 10...... 10......
27 ```
28 
29 A first, quick implementation in C of a UTF-8 parser could look something like this:
30 
31 ```
32 #define CONT_BYTE(b) ((b & 0xC0) == 0x80)
33 
34 int
35 valid_multibyte_utf8(struct parser *p)
36 {
37 	/* p->uri is our string */
38 	uint8_t s = *p->uri;
39 
40 	if ((s & 0x80) == 0)
41 		/* return 1 here to accept ASCII */
42 		return 0;
43 
44 	/* 2 bytes seq */
45 	if ((s & 0xE0) == 0xC0)
46 		return CONT_BYTE(*(++p->uri));
47 
48 	/* 3 bytes seq */
49 	if ((s & 0xF0) == 0xE0)
50 		return CONT_BYTE(*(++p->uri))
51 			&& CONT_BYTE(*(++p->uri));
52 
53 	/* 4 bytes seq */
54 	if ((s & 0xF8) == 0xF0)
55 		return CONT_BYTE(*(++p->uri))
56 			&& CONT_BYTE(*(++p->uri))
57 			&& CONT_BYTE(*(++p->uri));
58 
59 	return 0;
60 }
61 ```
62 
63 This reads nice, and seems pretty straightforward.  It checks the first byte, and then the appropriate number of continuation bytes.  It won’t overflow if a codepoint is truncated due to the short-circuit nature of the logical and in C.
64 
65 But it isn’t UNICODE compliant.
66 
67 This parser will happily accept byte sequence that looks like UTF-8 but aren’t valid.  In particular, everything in the range U+D800 - U+DFFF and U+110000 - U+1FFFFF DO NOT contain valid UNICODE codepoints.  But this parser will accept them.
68 
69 An easy “upgrade” could be something like this, that inverts the logic of the ifs and checks that the codepoint we parse is in the correct range:
70 
71 ```
72 int
73 valid_multibyte_utf8(struct parser *p)
74 {
75 	uint32_t c;
76 	uint8_t s;
77 
78 	c = 0;
79 	s = *p->uri;
80 
81 	if ((s & 0xE0) == 0xC0) {
82 		if (!CONT_BYTE(p->uri[1]))
83 			return 0;
84 		c = ((s & 0x1F) << 6) | (p->uri[1] & 0x3F);
85 		p->uri += 1;
86 	} else if ((s & 0xF0) == 0xE0) {
87 		if (!CONT_BYTE(p->uri[1]) ||
88 		    !CONT_BYTE(p->uri[2]))
89 			return 0;
90 		c = (s & 0x0F) << 12
91 			| ((p->uri[1] & 0x3F) << 6)
92 			| ((p->uri[2] & 0x3F));
93 		p->uri += 2;
94 	} else if ((s & 0xF8) == 0xF0) {
95 		if (!CONT_BYTE(p->uri[1]) ||
96 		    !CONT_BYTE(p->uri[2]) ||
97 		    !CONT_BYTE(p->uri[3]))
98 			return 0;
99 		c = (s & 0x07) << 18
100 			| ((p->uri[1] & 0x3F) << 12)
101 			| ((p->uri[2] & 0x3F) << 6)
102 			| ((p->uri[3] & 0x3F));
103 		p->uri += 3;
104 	} else
105 		return 0;
106 
107 	return (((0x080 <= c) && (c <= 0x7FF))
108 	    || (((0x800 <= c) && (c <= 0xFFFF)))
109 	    || (((0x10000 <= c) && (c <= 0x10FFFF))));
110 }
111 ```
112 
113 Oh my, this is starting to become ugly, isn’t it?  Well, at least we can be sure that this handle everything and move on.
114 
115 Except that even this version is not complete.  Sure, we know that we’ve read a valid UNICODE codepoint, but here’s the twist: overlong sequences.
116 
117 In UTF-8 sometimes you can encode the same character in multiple ways.  The classic example, the one that various RFCs mentions, is the case of 0xC080.
118 
119 ``` hexadecimal, binary and decimal visual representations of 0xC080
120  C    0    8    0       hexadecimal
121 1100 0000 1000 0000     binary
122  12   0    8    0       decimal
123 ```
124 
125 This looks like a legit UTF-8 two-byte characters, but it gets encoded to NUL, U+0000.
126 
127 In RFC3629 — “UTF-8, a transformation format of ISO 10646” — explicitly warns implementors about this issue:
128 
129 > Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding invalid sequences may have security consequences or cause other problems.  See Security Considerations (Section 10) below.
130 > — RFC3629, 3. “UTF-8 definition”
131 
132 Well, we could reverse the logic and have the various checks in every if-else branch but, honestly, this is becoming even messier.  Magic numbers everywhere, long checks, etc; if only there were a simpler decoder…
133 
134 Introducing the “Flexible and Economical UTF-8 decoder”, by Björn Höhrmann.
135 
136 => https://bjoern.hoehrmann.de/utf-8/decoder/dfa/  «Flexible and Economical UTF-8 decoder».
137 
138 The decoder goes as follows:
139 ```
140 // Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
141 // See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
142 
143 #define UTF8_ACCEPT 0
144 #define UTF8_REJECT 1
145 
146 static const uint8_t utf8d[] = {
147   /* lots of data */
148 };
149 
150 uint32_t inline
151 decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
152   uint32_t type = utf8d[byte];
153 
154   *codep = (*state != UTF8_ACCEPT) ?
155     (byte & 0x3fu) | (*codep << 6) :
156     (0xff >> type) & (byte);
157 
158   *state = utf8d[256 + *state*16 + type];
159   return *state;
160 }
161 ```
162 
163 The beauty of this decoder lies in the technique used: a state machine.  I love state machines: they’re easy to design and reason about, fun and compact to implement and require only a small fixed amount of resources.
164 
165 valid_multibyte_utf8 can now be built on top of decode easily as follows:
166 ```
167 int
168 valid_multibyte_utf8(struct parser *p)
169 {
170 	uint32_t cp = 0, state = 0;
171 
172         for (; *p->uri; p->uri++)
173 		if (!utf8_decode(&state, &cp, *p->uri))
174 			break;
175 
176 	/* reject also the ASCII range */
177         return !state && cp > 0x7F
178 }
179 ```
180 
181 That’s all about it.  I found interesting to study these various techniques to decode UTF-8.  Also, if you don’t know the story behind how Ken Thompson designed it on a placemat, go read it, it’s fascinating!
182 
183 => https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt UTF-8 history