commit 5c5b700d83bd388c33f0986625d7cfd1d3034eb4
from: Omar Polo <op@omarpolo.com>
date: Mon Jan 11 19:10:38 2021 UTC

new post: parsing UTF-8

commit - c9a53ab91c1fac15cb5745700a97aae0894e0b3f
commit + 5c5b700d83bd388c33f0986625d7cfd1d3034eb4
blob - /dev/null
blob + 19b419c2e086dee905202c2701ab96faa69fa376 (mode 644)
--- /dev/null
+++ resources/posts/parsing-utf8.gmi
@@ -0,0 +1,183 @@
+In one of the recent posts, the one were I was discussing the IRI implementation in gmid, I said that I wasn’t happy with the UTF-8 parser.
+
+=> /post/iris-are-not-hard.gmi  IRIs are not hard!
+
+Since then, I improved the valid_multibyte_utf8 function at least two times, and I’m happy with the current result, but I thought to document here the various “generations” of that functions.
+
+The purpose of valid_multibyte_utf8 is to tell if a string starts with a valid UTF-8 encoded UNICODE character, and advance the pointer past that glyph.  We’re interested only in U+80 and up, because of the characters in the ASCII range we’ve already taken care of.
+
+UTF-8 is a multibyte character encoding for UNICODE text.  Multibyte means that a single UNICODE codepoint can occupy more than one byte; in fact, UTF-8 is also variable-length: a codepoint can span between one and four bytes.
+
+UTF-8 is also a nice encoding, easy to parse but, as with everything that is related to UNICODE, with subtle and annoying details.
+
+The UNICODE codepoints are encoded as follows:
+
+* U+0000 - U+007F: one byte, compatible with ASCII
+* U+0080 - U+07FF: two bytes
+* U+0800 - U+D7FF and U+E0000 - U+FFFF: three bytes
+* U+10000 – U+10FFFF: four bytes
+
+Byte-wise, every UTF-8 character starts with either a 0 or with the bit patter 11XXXXXX.  Byte starting with 11 are called UTF-8 “start bytes”, and are followed by up to three byte that starts with 10XXXXXX, called “continuation bytes”.
+
+```
+one byte:       0.......
+two bytes:      110..... 10......
+three bytes:    1110.... 10...... 10......
+four bytes:     11110... 10...... 10...... 10......
+```
+
+A first, quick implementation in C of a UTF-8 parser could look something like this:
+
+```
+#define CONT_BYTE(b) ((b & 0xC0) == 0x80)
+
+int
+valid_multibyte_utf8(struct parser *p)
+{
+	/* p->uri is our string */
+	uint8_t s = *p->uri;
+
+	if ((s & 0x80) == 0)
+		/* return 1 here to accept ASCII */
+		return 0;
+
+	/* 2 bytes seq */
+	if ((s & 0xE0) == 0xC0)
+		return CONT_BYTE(*(++p->uri));
+
+	/* 3 bytes seq */
+	if ((s & 0xF0) == 0xE0)
+		return CONT_BYTE(*(++p->uri))
+			&& CONT_BYTE(*(++p->uri));
+
+	/* 4 bytes seq */
+	if ((s & 0xF8) == 0xF0)
+		return CONT_BYTE(*(++p->uri))
+			&& CONT_BYTE(*(++p->uri))
+			&& CONT_BYTE(*(++p->uri));
+
+	return 0;
+}
+```
+
+This reads nice, and seems pretty straightforward.  It checks the first byte, and then the appropriate number of continuation bytes.  It won’t overflow if a codepoint is truncated due to the short-circuit nature of the logical and in C.
+
+But it isn’t UNICODE compliant.
+
+This parser will happily accept byte sequence that looks like UTF-8 but aren’t valid.  In particular, everything in the range U+D800 - U+DFFF and U+110000 - U+1FFFFF DO NOT contain valid UNICODE codepoints.  But this parser will accept them.
+
+An easy “upgrade” could be something like this, that inverts the logic of the ifs and checks that the codepoint we parse is in the correct range:
+
+```
+int
+valid_multibyte_utf8(struct parser *p)
+{
+	uint32_t c;
+	uint8_t s;
+
+	c = 0;
+	s = *p->uri;
+
+	if ((s & 0xE0) == 0xC0) {
+		if (!CONT_BYTE(p->uri[1]))
+			return 0;
+		c = ((s & 0x1F) << 6) | (p->uri[1] & 0x3F);
+		p->uri += 1;
+	} else if ((s & 0xF0) == 0xE0) {
+		if (!CONT_BYTE(p->uri[1]) ||
+		    !CONT_BYTE(p->uri[2]))
+			return 0;
+		c = (s & 0x0F) << 12
+			| ((p->uri[1] & 0x3F) << 6)
+			| ((p->uri[2] & 0x3F));
+		p->uri += 2;
+	} else if ((s & 0xF8) == 0xF0) {
+		if (!CONT_BYTE(p->uri[1]) ||
+		    !CONT_BYTE(p->uri[2]) ||
+		    !CONT_BYTE(p->uri[3]))
+			return 0;
+		c = (s & 0x07) << 18
+			| ((p->uri[1] & 0x3F) << 12)
+			| ((p->uri[2] & 0x3F) << 6)
+			| ((p->uri[3] & 0x3F));
+		p->uri += 3;
+	} else
+		return 0;
+
+	return (((0x080 <= c) && (c <= 0x7FF))
+	    || (((0x800 <= c) && (c <= 0xFFFF)))
+	    || (((0x10000 <= c) && (c <= 0x10FFFF))));
+}
+```
+
+Oh my, this is starting to become ugly, isn’t it?  Well, at least we can be sure that this handle everything and move on.
+
+Except that even this version is not complete.  Sure, we’re sure that we’ve read a valid UNICODE codepoint, but here’s the twist: overlong sequences.
+
+In UTF-8 sometimes you can encode the same character in multiple ways.  The classic example, the one that various RFCs mentions, is the case of 0xC080.
+
+``` hexadecimal, binary and decimal visual representations of 0xC080
+ C    0    8    0       hexadecimal
+1100 0000 1000 0000     binary
+ 12   0    8    0       decimal
+```
+
+This looks like a legit UTF-8 two-byte characters, but it gets encoded to NUL, U+0000.
+
+In RFC3629 — “UTF-8, a transformation format of ISO 10646” — explicitly warns implementors about this issue:
+
+> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences.  For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding invalid sequences may have security consequences or cause other problems.  See Security Considerations (Section 10) below.
+> — RFC3629, 3. “UTF-8 definition”
+
+Well, we could reverse the logic and have the various checks in every if-else branch but, honestly, this is becoming even messier.  Magic numbers everywhere, long checks, etc; if only there were a simpler decoder…
+
+Introducing the “Flexible and Economical UTF-8 decoder”, by Björn Höhrmann.
+
+=> https://bjoern.hoehrmann.de/utf-8/decoder/dfa/  «Flexible and Economical UTF-8 decoder».
+
+The decoder goes as follows:
+```
+// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
+// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
+
+#define UTF8_ACCEPT 0
+#define UTF8_REJECT 1
+
+static const uint8_t utf8d[] = {
+  /* lots of data */
+};
+
+uint32_t inline
+decode(uint32_t* state, uint32_t* codep, uint32_t byte) {
+  uint32_t type = utf8d[byte];
+
+  *codep = (*state != UTF8_ACCEPT) ?
+    (byte & 0x3fu) | (*codep << 6) :
+    (0xff >> type) & (byte);
+
+  *state = utf8d[256 + *state*16 + type];
+  return *state;
+}
+```
+
+The beauty of this decoder lies in the technique used: a state machine.  I love state machines: they’re easy to design and reason about, fun and compact to implement and require only a small fixed amount of resources.
+
+valid_multibyte_utf8 can now be built on top of decode easily as follows:
+```
+int
+valid_multibyte_utf8(struct parser *p)
+{
+	uint32_t cp = 0, state = 0;
+
+        for (; *p->uri; p->uri++)
+		if (!utf8_decode(&state, &cp, *p->uri))
+			break;
+
+	/* reject also the ASCII range */
+        return !state && cp > 0x7F
+}
+```
+
+That’s all about it.  I found interesting to study these various techniques to decode UTF-8.  Also, if you don’t know the story behind how Ken Thompson designed it on a placemat, go read it, it’s fascinating!
+
+=> https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt UTF-8 history
blob - 6a1659efb3a63312b581ab8d8aaab8035e5e26fb
blob + 723764c4847003f799c22a821532c7610d4723a4
--- resources/posts.edn
+++ resources/posts.edn
@@ -1,4 +1,12 @@
-[{:title    "Tangle text/gemini files"
+[{:title    "Parsing UTF-8"
+  :short    "The ugly, the uglier and the finite approach"
+  :slug     "parsing-utf8"
+  :date     "2021/01/11"
+  :tags     #{:c :parsing}
+  :gemtext? true
+  :music    {:title "Metropolis Pt. 2 scenesc From a Memory"
+             :by    "Dream Theater"}}
+ {:title    "Tangle text/gemini files"
   :slug     "tangle-text-gemini-files"
   :date     "2020/12/30"
   :tags     #{:gemini :literate-programming}