Blob


1 In one of the recent posts, the one were I was discussing the IRI implementation in gmid, I said that I wasn’t happy with the UTF-8 parser.
3 => /post/iris-are-not-hard.gmi IRIs are not hard!
5 Since then, I improved the valid_multibyte_utf8 function at least two times, and I’m happy with the current result, but I thought to document here the various “generations” of that function.
7 The purpose of valid_multibyte_utf8 is to tell if a string starts with a valid UTF-8 encoded UNICODE character, and advance the pointer past that glyph. We’re interested only in U+80 and up, because of the characters in the ASCII range we’ve already taken care of.
9 UTF-8 is a multibyte character encoding for UNICODE text. Multibyte means that a single UNICODE codepoint can occupy more than one byte; in fact, UTF-8 is also variable-length: a codepoint can span between one and four bytes.
11 UTF-8 is also a nice encoding, easy to parse but, as with everything that is related to UNICODE, with subtle and annoying details.
13 The UNICODE codepoints are encoded as follows:
15 * U+0000 - U+007F: one byte, compatible with ASCII
16 * U+0080 - U+07FF: two bytes
17 * U+0800 - U+D7FF and U+E0000 - U+FFFF: three bytes
18 * U+10000 – U+10FFFF: four bytes
20 Byte-wise, every UTF-8 character starts with either a 0 or with the bit patter 11XXXXXX. Byte starting with 11 are called UTF-8 “start bytes”, and are followed by up to three byte that starts with 10XXXXXX, called “continuation bytes”.
22 ```
23 one byte: 0.......
24 two bytes: 110..... 10......
25 three bytes: 1110.... 10...... 10......
26 four bytes: 11110... 10...... 10...... 10......
27 ```
29 A first, quick implementation in C of a UTF-8 parser could look something like this:
31 ```
32 #define CONT_BYTE(b) ((b & 0xC0) == 0x80)
34 int
35 valid_multibyte_utf8(struct parser *p)
36 {
37 /* p->uri is our string */
38 uint8_t s = *p->uri;
40 if ((s & 0x80) == 0)
41 /* return 1 here to accept ASCII */
42 return 0;
44 /* 2 bytes seq */
45 if ((s & 0xE0) == 0xC0)
46 return CONT_BYTE(*(++p->uri));
48 /* 3 bytes seq */
49 if ((s & 0xF0) == 0xE0)
50 return CONT_BYTE(*(++p->uri))
51 && CONT_BYTE(*(++p->uri));
53 /* 4 bytes seq */
54 if ((s & 0xF8) == 0xF0)
55 return CONT_BYTE(*(++p->uri))
56 && CONT_BYTE(*(++p->uri))
57 && CONT_BYTE(*(++p->uri));
59 return 0;
60 }
61 ```
63 This reads nice, and seems pretty straightforward. It checks the first byte, and then the appropriate number of continuation bytes. It won’t overflow if a codepoint is truncated due to the short-circuit nature of the logical and in C.
65 But it isn’t UNICODE compliant.
67 This parser will happily accept byte sequence that looks like UTF-8 but aren’t valid. In particular, everything in the range U+D800 - U+DFFF and U+110000 - U+1FFFFF DO NOT contain valid UNICODE codepoints. But this parser will accept them.
69 An easy “upgrade” could be something like this, that inverts the logic of the ifs and checks that the codepoint we parse is in the correct range:
71 ```
72 int
73 valid_multibyte_utf8(struct parser *p)
74 {
75 uint32_t c;
76 uint8_t s;
78 c = 0;
79 s = *p->uri;
81 if ((s & 0xE0) == 0xC0) {
82 if (!CONT_BYTE(p->uri[1]))
83 return 0;
84 c = ((s & 0x1F) << 6) | (p->uri[1] & 0x3F);
85 p->uri += 1;
86 } else if ((s & 0xF0) == 0xE0) {
87 if (!CONT_BYTE(p->uri[1]) ||
88 !CONT_BYTE(p->uri[2]))
89 return 0;
90 c = (s & 0x0F) << 12
91 | ((p->uri[1] & 0x3F) << 6)
92 | ((p->uri[2] & 0x3F));
93 p->uri += 2;
94 } else if ((s & 0xF8) == 0xF0) {
95 if (!CONT_BYTE(p->uri[1]) ||
96 !CONT_BYTE(p->uri[2]) ||
97 !CONT_BYTE(p->uri[3]))
98 return 0;
99 c = (s & 0x07) << 18
100 | ((p->uri[1] & 0x3F) << 12)
101 | ((p->uri[2] & 0x3F) << 6)
102 | ((p->uri[3] & 0x3F));
103 p->uri += 3;
104 } else
105 return 0;
107 return (((0x080 <= c) && (c <= 0x7FF))
108 || (((0x800 <= c) && (c <= 0xFFFF)))
109 || (((0x10000 <= c) && (c <= 0x10FFFF))));
111 ```
113 Oh my, this is starting to become ugly, isn’t it? Well, at least we can be sure that this handle everything and move on.
115 Except that even this version is not complete. Sure, we know that we’ve read a valid UNICODE codepoint, but here’s the twist: overlong sequences.
117 In UTF-8 sometimes you can encode the same character in multiple ways. The classic example, the one that various RFCs mentions, is the case of 0xC080.
119 ``` hexadecimal, binary and decimal visual representations of 0xC080
120 C 0 8 0 hexadecimal
121 1100 0000 1000 0000 binary
122 12 0 8 0 decimal
123 ```
125 This looks like a legit UTF-8 two-byte characters, but it gets encoded to NUL, U+0000.
127 In RFC3629 — “UTF-8, a transformation format of ISO 10646” — explicitly warns implementors about this issue:
129 > Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below.
130 > — RFC3629, 3. “UTF-8 definition”
132 Well, we could reverse the logic and have the various checks in every if-else branch but, honestly, this is becoming even messier. Magic numbers everywhere, long checks, etc; if only there were a simpler decoder…
134 Introducing the “Flexible and Economical UTF-8 decoder”, by Björn Höhrmann.
136 => https://bjoern.hoehrmann.de/utf-8/decoder/dfa/ «Flexible and Economical UTF-8 decoder».
138 The decoder goes as follows:
139 ```
140 // Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
141 // See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
143 #define UTF8_ACCEPT 0
144 #define UTF8_REJECT 1
146 static const uint8_t utf8d[] = {
147 /* lots of data */
148 };
150 uint32_t inline
151 decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
152 uint32_t type = utf8d[byte];
154 *codep = (*state != UTF8_ACCEPT) ?
155 (byte & 0x3fu) | (*codep << 6) :
156 (0xff >> type) & (byte);
158 *state = utf8d[256 + *state*16 + type];
159 return *state;
161 ```
163 The beauty of this decoder lies in the technique used: a state machine. I love state machines: they’re easy to design and reason about, fun and compact to implement and require only a small fixed amount of resources.
165 valid_multibyte_utf8 can now be built on top of decode easily as follows:
166 ```
167 int
168 valid_multibyte_utf8(struct parser *p)
170 uint32_t cp = 0, state = 0;
172 for (; *p->uri; p->uri++)
173 if (!utf8_decode(&state, &cp, *p->uri))
174 break;
176 /* reject also the ASCII range */
177 return !state && cp > 0x7F
179 ```
181 That’s all about it. I found interesting to study these various techniques to decode UTF-8. Also, if you don’t know the story behind how Ken Thompson designed it on a placemat, go read it, it’s fascinating!
183 => https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt UTF-8 history