Blame


1 3448adb0 2022-11-02 op cat << EOF
2 3448adb0 2022-11-02 op .Dd ${MAN_DATE}
3 3448adb0 2022-11-02 op .Dt LIBGRAPHEME 7
4 3448adb0 2022-11-02 op .Os suckless.org
5 3448adb0 2022-11-02 op .Sh NAME
6 3448adb0 2022-11-02 op .Nm libgrapheme
7 3448adb0 2022-11-02 op .Nd unicode string library
8 3448adb0 2022-11-02 op .Sh SYNOPSIS
9 3448adb0 2022-11-02 op .In grapheme.h
10 3448adb0 2022-11-02 op .Sh DESCRIPTION
11 3448adb0 2022-11-02 op The
12 3448adb0 2022-11-02 op .Nm
13 3448adb0 2022-11-02 op library provides functions to properly handle Unicode strings according
14 3448adb0 2022-11-02 op to the Unicode specification in regard to character, word, sentence and
15 3448adb0 2022-11-02 op line segmentation and case detection and conversion.
16 3448adb0 2022-11-02 op .Pp
17 3448adb0 2022-11-02 op Unicode strings are made up of user-perceived characters (so-called
18 3448adb0 2022-11-02 op .Dq grapheme clusters ,
19 3448adb0 2022-11-02 op see
20 3448adb0 2022-11-02 op .Sx MOTIVATION )
21 3448adb0 2022-11-02 op that are composed of one or more Unicode codepoints, which in turn
22 3448adb0 2022-11-02 op are encoded in one or more bytes in an encoding like UTF-8.
23 3448adb0 2022-11-02 op .Pp
24 3448adb0 2022-11-02 op There is a widespread misconception that it was enough to simply
25 3448adb0 2022-11-02 op determine codepoints in a string and treat them as user-perceived
26 3448adb0 2022-11-02 op characters to be Unicode compliant.
27 3448adb0 2022-11-02 op While this may work in some cases, this assumption quickly breaks,
28 3448adb0 2022-11-02 op especially for non-Western languages and decomposed Unicode strings
29 3448adb0 2022-11-02 op where user-perceived characters are usually represented using multiple
30 3448adb0 2022-11-02 op codepoints.
31 3448adb0 2022-11-02 op .Pp
32 3448adb0 2022-11-02 op Despite this complicated multilevel structure of Unicode strings,
33 3448adb0 2022-11-02 op .Nm
34 3448adb0 2022-11-02 op provides methods to work with them at the byte-level (i.e. UTF-8
35 3448adb0 2022-11-02 op .Sq char
36 3448adb0 2022-11-02 op arrays) while also offering codepoint-level methods.
37 3448adb0 2022-11-02 op Additionally, it is a
38 3448adb0 2022-11-02 op .Dq freestanding
39 3448adb0 2022-11-02 op library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
40 3448adb0 2022-11-02 op a standard library. This makes it easy to use in bare metal environments.
41 3448adb0 2022-11-02 op .Pp
42 3448adb0 2022-11-02 op Every documented function's manual page provides a self-contained
43 3448adb0 2022-11-02 op example illustrating the possible usage.
44 3448adb0 2022-11-02 op .Sh SEE ALSO
45 3448adb0 2022-11-02 op .Xr grapheme_decode_utf8 3 ,
46 3448adb0 2022-11-02 op .Xr grapheme_encode_utf8 3 ,
47 3448adb0 2022-11-02 op .Xr grapheme_is_character_break 3 ,
48 3448adb0 2022-11-02 op .Xr grapheme_is_lowercase 3 ,
49 3448adb0 2022-11-02 op .Xr grapheme_is_lowercase_utf8 3 ,
50 3448adb0 2022-11-02 op .Xr grapheme_is_titlecase 3 ,
51 3448adb0 2022-11-02 op .Xr grapheme_is_titlecase_utf8 3 ,
52 3448adb0 2022-11-02 op .Xr grapheme_is_uppercase 3 ,
53 3448adb0 2022-11-02 op .Xr grapheme_is_uppercase_utf8 3 ,
54 3448adb0 2022-11-02 op .Xr grapheme_next_character_break 3 ,
55 3448adb0 2022-11-02 op .Xr grapheme_next_character_break_utf8 3 ,
56 3448adb0 2022-11-02 op .Xr grapheme_next_line_break 3 ,
57 3448adb0 2022-11-02 op .Xr grapheme_next_line_break_utf8 3 ,
58 3448adb0 2022-11-02 op .Xr grapheme_next_sentence_break 3 ,
59 3448adb0 2022-11-02 op .Xr grapheme_next_sentence_break_utf8 3 ,
60 3448adb0 2022-11-02 op .Xr grapheme_next_word_break 3 ,
61 3448adb0 2022-11-02 op .Xr grapheme_next_word_break_utf8 3 ,
62 3448adb0 2022-11-02 op .Xr grapheme_to_lowercase 3 ,
63 3448adb0 2022-11-02 op .Xr grapheme_to_lowercase_utf8 3 ,
64 3448adb0 2022-11-02 op .Xr grapheme_to_titlecase 3 ,
65 3448adb0 2022-11-02 op .Xr grapheme_to_titlecase_utf8 3
66 3448adb0 2022-11-02 op .Xr grapheme_to_uppercase 3 ,
67 3448adb0 2022-11-02 op .Xr grapheme_to_uppercase_utf8 3 ,
68 3448adb0 2022-11-02 op .Sh STANDARDS
69 3448adb0 2022-11-02 op .Nm
70 3448adb0 2022-11-02 op is compliant with the Unicode ${UNICODE_VERSION} specification.
71 3448adb0 2022-11-02 op .Sh MOTIVATION
72 3448adb0 2022-11-02 op The idea behind every character encoding scheme like ASCII or Unicode
73 3448adb0 2022-11-02 op is to express abstract characters (which can be thought of as shapes
74 3448adb0 2022-11-02 op making up a written language). ASCII for instance, which comprises the
75 3448adb0 2022-11-02 op range 0 to 127, assigns the number 65 (0x41) to the abstract character
76 3448adb0 2022-11-02 op .Sq A .
77 3448adb0 2022-11-02 op This number is called a
78 3448adb0 2022-11-02 op .Dq codepoint ,
79 3448adb0 2022-11-02 op and all codepoints of an encoding make up its so-called
80 3448adb0 2022-11-02 op .Dq code space .
81 3448adb0 2022-11-02 op .Pp
82 3448adb0 2022-11-02 op Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
83 3448adb0 2022-11-02 op first 128 codepoints are identical to ASCII's. The additional code
84 3448adb0 2022-11-02 op points are needed as Unicode's goal is to express all writing systems
85 3448adb0 2022-11-02 op of the world.
86 3448adb0 2022-11-02 op To give an example, the abstract character
87 3448adb0 2022-11-02 op .Sq \[u00C4]
88 3448adb0 2022-11-02 op is not expressable in ASCII, given no ASCII codepoint has been assigned
89 3448adb0 2022-11-02 op to it.
90 3448adb0 2022-11-02 op It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
91 3448adb0 2022-11-02 op .Pp
92 3448adb0 2022-11-02 op One may assume that this process is straightfoward, but as more and
93 3448adb0 2022-11-02 op more codepoints were assigned to abstract characters, the Unicode
94 3448adb0 2022-11-02 op Consortium (that defines the Unicode standard) was facing a problem:
95 3448adb0 2022-11-02 op Many (mostly non-European) languages have such a large amount of
96 3448adb0 2022-11-02 op abstract characters that it would exhaust the available Unicode code
97 3448adb0 2022-11-02 op space if one tried to assign a codepoint to each abstract character.
98 3448adb0 2022-11-02 op The solution to that problem is best introduced with an example: Consider
99 3448adb0 2022-11-02 op the abstract character
100 3448adb0 2022-11-02 op .Sq \[u01DE] ,
101 3448adb0 2022-11-02 op which is
102 3448adb0 2022-11-02 op .Sq A
103 3448adb0 2022-11-02 op with an umlaut and a macron added to it.
104 3448adb0 2022-11-02 op In this sense, one can consider
105 3448adb0 2022-11-02 op .Sq \[u01DE]
106 3448adb0 2022-11-02 op as a two-fold modification (namely
107 3448adb0 2022-11-02 op .Dq add umlaut
108 3448adb0 2022-11-02 op and
109 3448adb0 2022-11-02 op .Dq add macron )
110 3448adb0 2022-11-02 op of the
111 3448adb0 2022-11-02 op .Dq base character
112 3448adb0 2022-11-02 op .Sq A .
113 3448adb0 2022-11-02 op .Pp
114 3448adb0 2022-11-02 op The Unicode Consortium adapted this idea by assigning codepoints to
115 3448adb0 2022-11-02 op modifications.
116 3448adb0 2022-11-02 op For example, the codepoint 0x308 represents adding an umlaut and 0x304
117 3448adb0 2022-11-02 op represents adding a macron, and thus, the codepoint sequence
118 3448adb0 2022-11-02 op .Dq 0x41 0x308 0x304 ,
119 3448adb0 2022-11-02 op namely the base character
120 3448adb0 2022-11-02 op .Sq A
121 3448adb0 2022-11-02 op followed by the umlaut and macron modifiers, represents the abstract
122 3448adb0 2022-11-02 op character
123 3448adb0 2022-11-02 op .Sq \[u01DE] .
124 3448adb0 2022-11-02 op As a side-note, the single codepoint 0x1DE was also assigned to
125 3448adb0 2022-11-02 op .Sq \[u01DE] ,
126 3448adb0 2022-11-02 op which is a good example for the fact that there can be multiple
127 3448adb0 2022-11-02 op representations of a single abstract character in Unicode.
128 3448adb0 2022-11-02 op .Pp
129 3448adb0 2022-11-02 op Expressing a single abstract character with multiple codepoints solved
130 3448adb0 2022-11-02 op the code space exhaustion-problem, and the concept has been greatly
131 3448adb0 2022-11-02 op expanded since its first introduction (emojis, joiners, etc.). A sequence
132 3448adb0 2022-11-02 op (which can also have the length 1) of codepoints that belong together
133 3448adb0 2022-11-02 op this way and represents an abstract character is called a
134 3448adb0 2022-11-02 op .Dq grapheme cluster .
135 3448adb0 2022-11-02 op .Pp
136 3448adb0 2022-11-02 op In many applications it is necessary to count the number of
137 3448adb0 2022-11-02 op user-perceived characters, i.e. grapheme clusters, in a string.
138 3448adb0 2022-11-02 op A good example for this is a terminal text editor, which needs to
139 3448adb0 2022-11-02 op properly align characters on a grid.
140 3448adb0 2022-11-02 op This is pretty simple with ASCII-strings, where you just count the number
141 3448adb0 2022-11-02 op of bytes (as each byte is a codepoint and each codepoint is a grapheme
142 3448adb0 2022-11-02 op cluster).
143 3448adb0 2022-11-02 op With Unicode-strings, it is a common mistake to simply adapt the
144 3448adb0 2022-11-02 op ASCII-approach and count the number of code points.
145 3448adb0 2022-11-02 op This is wrong, as, for example, the sequence
146 3448adb0 2022-11-02 op .Dq 0x41 0x308 0x304 ,
147 3448adb0 2022-11-02 op while made up of 3 codepoints, is a single grapheme cluster and
148 3448adb0 2022-11-02 op represents the user-perceived character
149 3448adb0 2022-11-02 op .Sq \[u01DE] .
150 3448adb0 2022-11-02 op .Pp
151 3448adb0 2022-11-02 op The proper way to segment a string into user-perceived characters
152 3448adb0 2022-11-02 op is to segment it into its grapheme clusters by applying the Unicode
153 3448adb0 2022-11-02 op grapheme cluster breaking algorithm (UAX #29).
154 3448adb0 2022-11-02 op It is based on a complex ruleset and lookup-tables and determines if a
155 3448adb0 2022-11-02 op grapheme cluster ends or is continued between two codepoints.
156 3448adb0 2022-11-02 op Libraries like ICU and libunistring, which also offer this functionality,
157 3448adb0 2022-11-02 op are often bloated, not correct, difficult to use or not reasonably
158 3448adb0 2022-11-02 op statically linkable.
159 3448adb0 2022-11-02 op .Pp
160 3448adb0 2022-11-02 op Analogously, the standard provides algorithms to separate strings by
161 3448adb0 2022-11-02 op words, sentences and lines, convert cases and compare strings.
162 3448adb0 2022-11-02 op The motivation behind
163 3448adb0 2022-11-02 op .Nm
164 3448adb0 2022-11-02 op is to make unicode handling suck less and abide by the UNIX philosophy.
165 3448adb0 2022-11-02 op .Sh AUTHORS
166 3448adb0 2022-11-02 op .An Laslo Hunhold Aq Mt dev@frign.de
167 3448adb0 2022-11-02 op EOF