op public repos

Blame

Date:: Tue Jan 12 09:28:10 2021 UTC
Message:: typos
Actions:: History | Blob | Raw File
  1 5c5b700d 2021-01-11 op In one of the recent posts, the one were I was discussing the IRI implementation in gmid, I said that I wasn’t happy with the UTF-8 parser.
  2 5c5b700d 2021-01-11 op 
  3 5c5b700d 2021-01-11 op => /post/iris-are-not-hard.gmi  IRIs are not hard!
  4 5c5b700d 2021-01-11 op 
  5 9a959102 2021-01-12 op Since then, I improved the valid_multibyte_utf8 function at least two times, and I’m happy with the current result, but I thought to document here the various “generations” of that function.
  6 5c5b700d 2021-01-11 op 
  7 5c5b700d 2021-01-11 op The purpose of valid_multibyte_utf8 is to tell if a string starts with a valid UTF-8 encoded UNICODE character, and advance the pointer past that glyph.  We’re interested only in U+80 and up, because of the characters in the ASCII range we’ve already taken care of.
  8 5c5b700d 2021-01-11 op 
  9 5c5b700d 2021-01-11 op UTF-8 is a multibyte character encoding for UNICODE text.  Multibyte means that a single UNICODE codepoint can occupy more than one byte; in fact, UTF-8 is also variable-length: a codepoint can span between one and four bytes.
 10 5c5b700d 2021-01-11 op 
 11 5c5b700d 2021-01-11 op UTF-8 is also a nice encoding, easy to parse but, as with everything that is related to UNICODE, with subtle and annoying details.
 12 5c5b700d 2021-01-11 op 
 13 5c5b700d 2021-01-11 op The UNICODE codepoints are encoded as follows:
 14 5c5b700d 2021-01-11 op 
 15 5c5b700d 2021-01-11 op * U+0000 - U+007F: one byte, compatible with ASCII
 16 5c5b700d 2021-01-11 op * U+0080 - U+07FF: two bytes
 17 5c5b700d 2021-01-11 op * U+0800 - U+D7FF and U+E0000 - U+FFFF: three bytes
 18 5c5b700d 2021-01-11 op * U+10000 – U+10FFFF: four bytes
 19 5c5b700d 2021-01-11 op 
 20 5c5b700d 2021-01-11 op Byte-wise, every UTF-8 character starts with either a 0 or with the bit patter 11XXXXXX.  Byte starting with 11 are called UTF-8 “start bytes”, and are followed by up to three byte that starts with 10XXXXXX, called “continuation bytes”.
 21 5c5b700d 2021-01-11 op 
 22 5c5b700d 2021-01-11 op ```
 23 5c5b700d 2021-01-11 op one byte:       0.......
 24 5c5b700d 2021-01-11 op two bytes:      110..... 10......
 25 5c5b700d 2021-01-11 op three bytes:    1110.... 10...... 10......
 26 5c5b700d 2021-01-11 op four bytes:     11110... 10...... 10...... 10......
 27 5c5b700d 2021-01-11 op ```
 28 5c5b700d 2021-01-11 op 
 29 5c5b700d 2021-01-11 op A first, quick implementation in C of a UTF-8 parser could look something like this:
 30 5c5b700d 2021-01-11 op 
 31 5c5b700d 2021-01-11 op ```
 32 5c5b700d 2021-01-11 op #define CONT_BYTE(b) ((b & 0xC0) == 0x80)
 33 5c5b700d 2021-01-11 op 
 34 5c5b700d 2021-01-11 op int
 35 5c5b700d 2021-01-11 op valid_multibyte_utf8(struct parser *p)
 36 5c5b700d 2021-01-11 op {
 37 5c5b700d 2021-01-11 op 	/* p->uri is our string */
 38 5c5b700d 2021-01-11 op 	uint8_t s = *p->uri;
 39 5c5b700d 2021-01-11 op 
 40 5c5b700d 2021-01-11 op 	if ((s & 0x80) == 0)
 41 5c5b700d 2021-01-11 op 		/* return 1 here to accept ASCII */
 42 5c5b700d 2021-01-11 op 		return 0;
 43 5c5b700d 2021-01-11 op 
 44 5c5b700d 2021-01-11 op 	/* 2 bytes seq */
 45 5c5b700d 2021-01-11 op 	if ((s & 0xE0) == 0xC0)
 46 5c5b700d 2021-01-11 op 		return CONT_BYTE(*(++p->uri));
 47 5c5b700d 2021-01-11 op 
 48 5c5b700d 2021-01-11 op 	/* 3 bytes seq */
 49 5c5b700d 2021-01-11 op 	if ((s & 0xF0) == 0xE0)
 50 5c5b700d 2021-01-11 op 		return CONT_BYTE(*(++p->uri))
 51 5c5b700d 2021-01-11 op 			&& CONT_BYTE(*(++p->uri));
 52 5c5b700d 2021-01-11 op 
 53 5c5b700d 2021-01-11 op 	/* 4 bytes seq */
 54 5c5b700d 2021-01-11 op 	if ((s & 0xF8) == 0xF0)
 55 5c5b700d 2021-01-11 op 		return CONT_BYTE(*(++p->uri))
 56 5c5b700d 2021-01-11 op 			&& CONT_BYTE(*(++p->uri))
 57 5c5b700d 2021-01-11 op 			&& CONT_BYTE(*(++p->uri));
 58 5c5b700d 2021-01-11 op 
 59 5c5b700d 2021-01-11 op 	return 0;
 60 5c5b700d 2021-01-11 op }
 61 5c5b700d 2021-01-11 op ```
 62 5c5b700d 2021-01-11 op 
 63 5c5b700d 2021-01-11 op This reads nice, and seems pretty straightforward.  It checks the first byte, and then the appropriate number of continuation bytes.  It won’t overflow if a codepoint is truncated due to the short-circuit nature of the logical and in C.
 64 5c5b700d 2021-01-11 op 
 65 5c5b700d 2021-01-11 op But it isn’t UNICODE compliant.
 66 5c5b700d 2021-01-11 op 
 67 5c5b700d 2021-01-11 op This parser will happily accept byte sequence that looks like UTF-8 but aren’t valid.  In particular, everything in the range U+D800 - U+DFFF and U+110000 - U+1FFFFF DO NOT contain valid UNICODE codepoints.  But this parser will accept them.
 68 5c5b700d 2021-01-11 op 
 69 5c5b700d 2021-01-11 op An easy “upgrade” could be something like this, that inverts the logic of the ifs and checks that the codepoint we parse is in the correct range:
 70 5c5b700d 2021-01-11 op 
 71 5c5b700d 2021-01-11 op ```
 72 5c5b700d 2021-01-11 op int
 73 5c5b700d 2021-01-11 op valid_multibyte_utf8(struct parser *p)
 74 5c5b700d 2021-01-11 op {
 75 5c5b700d 2021-01-11 op 	uint32_t c;
 76 5c5b700d 2021-01-11 op 	uint8_t s;
 77 5c5b700d 2021-01-11 op 
 78 5c5b700d 2021-01-11 op 	c = 0;
 79 5c5b700d 2021-01-11 op 	s = *p->uri;
 80 5c5b700d 2021-01-11 op 
 81 5c5b700d 2021-01-11 op 	if ((s & 0xE0) == 0xC0) {
 82 5c5b700d 2021-01-11 op 		if (!CONT_BYTE(p->uri[1]))
 83 5c5b700d 2021-01-11 op 			return 0;
 84 5c5b700d 2021-01-11 op 		c = ((s & 0x1F) << 6) | (p->uri[1] & 0x3F);
 85 5c5b700d 2021-01-11 op 		p->uri += 1;
 86 5c5b700d 2021-01-11 op 	} else if ((s & 0xF0) == 0xE0) {
 87 5c5b700d 2021-01-11 op 		if (!CONT_BYTE(p->uri[1]) ||
 88 5c5b700d 2021-01-11 op 		    !CONT_BYTE(p->uri[2]))
 89 5c5b700d 2021-01-11 op 			return 0;
 90 5c5b700d 2021-01-11 op 		c = (s & 0x0F) << 12
 91 5c5b700d 2021-01-11 op 			| ((p->uri[1] & 0x3F) << 6)
 92 5c5b700d 2021-01-11 op 			| ((p->uri[2] & 0x3F));
 93 5c5b700d 2021-01-11 op 		p->uri += 2;
 94 5c5b700d 2021-01-11 op 	} else if ((s & 0xF8) == 0xF0) {
 95 5c5b700d 2021-01-11 op 		if (!CONT_BYTE(p->uri[1]) ||
 96 5c5b700d 2021-01-11 op 		    !CONT_BYTE(p->uri[2]) ||
 97 5c5b700d 2021-01-11 op 		    !CONT_BYTE(p->uri[3]))
 98 5c5b700d 2021-01-11 op 			return 0;
 99 5c5b700d 2021-01-11 op 		c = (s & 0x07) << 18
100 5c5b700d 2021-01-11 op 			| ((p->uri[1] & 0x3F) << 12)
101 5c5b700d 2021-01-11 op 			| ((p->uri[2] & 0x3F) << 6)
102 5c5b700d 2021-01-11 op 			| ((p->uri[3] & 0x3F));
103 5c5b700d 2021-01-11 op 		p->uri += 3;
104 5c5b700d 2021-01-11 op 	} else
105 5c5b700d 2021-01-11 op 		return 0;
106 5c5b700d 2021-01-11 op 
107 5c5b700d 2021-01-11 op 	return (((0x080 <= c) && (c <= 0x7FF))
108 5c5b700d 2021-01-11 op 	    || (((0x800 <= c) && (c <= 0xFFFF)))
109 5c5b700d 2021-01-11 op 	    || (((0x10000 <= c) && (c <= 0x10FFFF))));
110 5c5b700d 2021-01-11 op }
111 5c5b700d 2021-01-11 op ```
112 5c5b700d 2021-01-11 op 
113 5c5b700d 2021-01-11 op Oh my, this is starting to become ugly, isn’t it?  Well, at least we can be sure that this handle everything and move on.
114 5c5b700d 2021-01-11 op 
115 9a959102 2021-01-12 op Except that even this version is not complete.  Sure, we know that we’ve read a valid UNICODE codepoint, but here’s the twist: overlong sequences.
116 5c5b700d 2021-01-11 op 
117 5c5b700d 2021-01-11 op In UTF-8 sometimes you can encode the same character in multiple ways.  The classic example, the one that various RFCs mentions, is the case of 0xC080.
118 5c5b700d 2021-01-11 op 
119 5c5b700d 2021-01-11 op ``` hexadecimal, binary and decimal visual representations of 0xC080
120 5c5b700d 2021-01-11 op  C    0    8    0       hexadecimal
121 5c5b700d 2021-01-11 op 1100 0000 1000 0000     binary
122 5c5b700d 2021-01-11 op  12   0    8    0       decimal
123 5c5b700d 2021-01-11 op ```
124 5c5b700d 2021-01-11 op 
125 5c5b700d 2021-01-11 op This looks like a legit UTF-8 two-byte characters, but it gets encoded to NUL, U+0000.
126 5c5b700d 2021-01-11 op 
127 5c5b700d 2021-01-11 op In RFC3629 — “UTF-8, a transformation format of ISO 10646” — explicitly warns implementors about this issue:
128 5c5b700d 2021-01-11 op 
129 5c5b700d 2021-01-11 op > Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding invalid sequences may have security consequences or cause other problems.  See Security Considerations (Section 10) below.
130 5c5b700d 2021-01-11 op > — RFC3629, 3. “UTF-8 definition”
131 5c5b700d 2021-01-11 op 
132 5c5b700d 2021-01-11 op Well, we could reverse the logic and have the various checks in every if-else branch but, honestly, this is becoming even messier.  Magic numbers everywhere, long checks, etc; if only there were a simpler decoder…
133 5c5b700d 2021-01-11 op 
134 5c5b700d 2021-01-11 op Introducing the “Flexible and Economical UTF-8 decoder”, by Björn Höhrmann.
135 5c5b700d 2021-01-11 op 
136 5c5b700d 2021-01-11 op => https://bjoern.hoehrmann.de/utf-8/decoder/dfa/  «Flexible and Economical UTF-8 decoder».
137 5c5b700d 2021-01-11 op 
138 5c5b700d 2021-01-11 op The decoder goes as follows:
139 5c5b700d 2021-01-11 op ```
140 5c5b700d 2021-01-11 op // Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
141 5c5b700d 2021-01-11 op // See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
142 5c5b700d 2021-01-11 op 
143 5c5b700d 2021-01-11 op #define UTF8_ACCEPT 0
144 5c5b700d 2021-01-11 op #define UTF8_REJECT 1
145 5c5b700d 2021-01-11 op 
146 5c5b700d 2021-01-11 op static const uint8_t utf8d[] = {
147 5c5b700d 2021-01-11 op   /* lots of data */
148 5c5b700d 2021-01-11 op };
149 5c5b700d 2021-01-11 op 
150 5c5b700d 2021-01-11 op uint32_t inline
151 5c5b700d 2021-01-11 op decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
152 5c5b700d 2021-01-11 op   uint32_t type = utf8d[byte];
153 5c5b700d 2021-01-11 op 
154 5c5b700d 2021-01-11 op   *codep = (*state != UTF8_ACCEPT) ?
155 5c5b700d 2021-01-11 op     (byte & 0x3fu) | (*codep << 6) :
156 5c5b700d 2021-01-11 op     (0xff >> type) & (byte);
157 5c5b700d 2021-01-11 op 
158 5c5b700d 2021-01-11 op   *state = utf8d[256 + *state*16 + type];
159 5c5b700d 2021-01-11 op   return *state;
160 5c5b700d 2021-01-11 op }
161 5c5b700d 2021-01-11 op ```
162 5c5b700d 2021-01-11 op 
163 5c5b700d 2021-01-11 op The beauty of this decoder lies in the technique used: a state machine.  I love state machines: they’re easy to design and reason about, fun and compact to implement and require only a small fixed amount of resources.
164 5c5b700d 2021-01-11 op 
165 5c5b700d 2021-01-11 op valid_multibyte_utf8 can now be built on top of decode easily as follows:
166 5c5b700d 2021-01-11 op ```
167 5c5b700d 2021-01-11 op int
168 5c5b700d 2021-01-11 op valid_multibyte_utf8(struct parser *p)
169 5c5b700d 2021-01-11 op {
170 5c5b700d 2021-01-11 op 	uint32_t cp = 0, state = 0;
171 5c5b700d 2021-01-11 op 
172 5c5b700d 2021-01-11 op         for (; *p->uri; p->uri++)
173 5c5b700d 2021-01-11 op 		if (!utf8_decode(&state, &cp, *p->uri))
174 5c5b700d 2021-01-11 op 			break;
175 5c5b700d 2021-01-11 op 
176 5c5b700d 2021-01-11 op 	/* reject also the ASCII range */
177 5c5b700d 2021-01-11 op         return !state && cp > 0x7F
178 5c5b700d 2021-01-11 op }
179 5c5b700d 2021-01-11 op ```
180 5c5b700d 2021-01-11 op 
181 5c5b700d 2021-01-11 op That’s all about it.  I found interesting to study these various techniques to decode UTF-8.  Also, if you don’t know the story behind how Ken Thompson designed it on a placemat, go read it, it’s fascinating!
182 5c5b700d 2021-01-11 op 
183 5c5b700d 2021-01-11 op => https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt UTF-8 history