op public repos

Blame

Date:: Wed Nov 2 20:01:35 2022 UTC
Message:: bundle libgrapheme 2.0.2 in case it's not available
Actions:: History | Blob | Raw File
  1 3448adb0 2022-11-02 op cat << EOF
  2 3448adb0 2022-11-02 op .Dd ${MAN_DATE}
  3 3448adb0 2022-11-02 op .Dt LIBGRAPHEME 7
  4 3448adb0 2022-11-02 op .Os suckless.org
  5 3448adb0 2022-11-02 op .Sh NAME
  6 3448adb0 2022-11-02 op .Nm libgrapheme
  7 3448adb0 2022-11-02 op .Nd unicode string library
  8 3448adb0 2022-11-02 op .Sh SYNOPSIS
  9 3448adb0 2022-11-02 op .In grapheme.h
 10 3448adb0 2022-11-02 op .Sh DESCRIPTION
 11 3448adb0 2022-11-02 op The
 12 3448adb0 2022-11-02 op .Nm
 13 3448adb0 2022-11-02 op library provides functions to properly handle Unicode strings according
 14 3448adb0 2022-11-02 op to the Unicode specification in regard to character, word, sentence and
 15 3448adb0 2022-11-02 op line segmentation and case detection and conversion.
 16 3448adb0 2022-11-02 op .Pp
 17 3448adb0 2022-11-02 op Unicode strings are made up of user-perceived characters (so-called
 18 3448adb0 2022-11-02 op .Dq grapheme clusters ,
 19 3448adb0 2022-11-02 op see
 20 3448adb0 2022-11-02 op .Sx MOTIVATION )
 21 3448adb0 2022-11-02 op that are composed of one or more Unicode codepoints, which in turn
 22 3448adb0 2022-11-02 op are encoded in one or more bytes in an encoding like UTF-8.
 23 3448adb0 2022-11-02 op .Pp
 24 3448adb0 2022-11-02 op There is a widespread misconception that it was enough to simply
 25 3448adb0 2022-11-02 op determine codepoints in a string and treat them as user-perceived
 26 3448adb0 2022-11-02 op characters to be Unicode compliant.
 27 3448adb0 2022-11-02 op While this may work in some cases, this assumption quickly breaks,
 28 3448adb0 2022-11-02 op especially for non-Western languages and decomposed Unicode strings
 29 3448adb0 2022-11-02 op where user-perceived characters are usually represented using multiple
 30 3448adb0 2022-11-02 op codepoints.
 31 3448adb0 2022-11-02 op .Pp
 32 3448adb0 2022-11-02 op Despite this complicated multilevel structure of Unicode strings,
 33 3448adb0 2022-11-02 op .Nm
 34 3448adb0 2022-11-02 op provides methods to work with them at the byte-level (i.e. UTF-8
 35 3448adb0 2022-11-02 op .Sq char
 36 3448adb0 2022-11-02 op arrays) while also offering codepoint-level methods.
 37 3448adb0 2022-11-02 op Additionally, it is a
 38 3448adb0 2022-11-02 op .Dq freestanding
 39 3448adb0 2022-11-02 op library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
 40 3448adb0 2022-11-02 op a standard library. This makes it easy to use in bare metal environments.
 41 3448adb0 2022-11-02 op .Pp
 42 3448adb0 2022-11-02 op Every documented function's manual page provides a self-contained
 43 3448adb0 2022-11-02 op example illustrating the possible usage.
 44 3448adb0 2022-11-02 op .Sh SEE ALSO
 45 3448adb0 2022-11-02 op .Xr grapheme_decode_utf8 3 ,
 46 3448adb0 2022-11-02 op .Xr grapheme_encode_utf8 3 ,
 47 3448adb0 2022-11-02 op .Xr grapheme_is_character_break 3 ,
 48 3448adb0 2022-11-02 op .Xr grapheme_is_lowercase 3 ,
 49 3448adb0 2022-11-02 op .Xr grapheme_is_lowercase_utf8 3 ,
 50 3448adb0 2022-11-02 op .Xr grapheme_is_titlecase 3 ,
 51 3448adb0 2022-11-02 op .Xr grapheme_is_titlecase_utf8 3 ,
 52 3448adb0 2022-11-02 op .Xr grapheme_is_uppercase 3 ,
 53 3448adb0 2022-11-02 op .Xr grapheme_is_uppercase_utf8 3 ,
 54 3448adb0 2022-11-02 op .Xr grapheme_next_character_break 3 ,
 55 3448adb0 2022-11-02 op .Xr grapheme_next_character_break_utf8 3 ,
 56 3448adb0 2022-11-02 op .Xr grapheme_next_line_break 3 ,
 57 3448adb0 2022-11-02 op .Xr grapheme_next_line_break_utf8 3 ,
 58 3448adb0 2022-11-02 op .Xr grapheme_next_sentence_break 3 ,
 59 3448adb0 2022-11-02 op .Xr grapheme_next_sentence_break_utf8 3 ,
 60 3448adb0 2022-11-02 op .Xr grapheme_next_word_break 3 ,
 61 3448adb0 2022-11-02 op .Xr grapheme_next_word_break_utf8 3 ,
 62 3448adb0 2022-11-02 op .Xr grapheme_to_lowercase 3 ,
 63 3448adb0 2022-11-02 op .Xr grapheme_to_lowercase_utf8 3 ,
 64 3448adb0 2022-11-02 op .Xr grapheme_to_titlecase 3 ,
 65 3448adb0 2022-11-02 op .Xr grapheme_to_titlecase_utf8 3
 66 3448adb0 2022-11-02 op .Xr grapheme_to_uppercase 3 ,
 67 3448adb0 2022-11-02 op .Xr grapheme_to_uppercase_utf8 3 ,
 68 3448adb0 2022-11-02 op .Sh STANDARDS
 69 3448adb0 2022-11-02 op .Nm
 70 3448adb0 2022-11-02 op is compliant with the Unicode ${UNICODE_VERSION} specification.
 71 3448adb0 2022-11-02 op .Sh MOTIVATION
 72 3448adb0 2022-11-02 op The idea behind every character encoding scheme like ASCII or Unicode
 73 3448adb0 2022-11-02 op is to express abstract characters (which can be thought of as shapes
 74 3448adb0 2022-11-02 op making up a written language). ASCII for instance, which comprises the
 75 3448adb0 2022-11-02 op range 0 to 127, assigns the number 65 (0x41) to the abstract character
 76 3448adb0 2022-11-02 op .Sq A .
 77 3448adb0 2022-11-02 op This number is called a
 78 3448adb0 2022-11-02 op .Dq codepoint ,
 79 3448adb0 2022-11-02 op and all codepoints of an encoding make up its so-called
 80 3448adb0 2022-11-02 op .Dq code space .
 81 3448adb0 2022-11-02 op .Pp
 82 3448adb0 2022-11-02 op Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
 83 3448adb0 2022-11-02 op first 128 codepoints are identical to ASCII's. The additional code
 84 3448adb0 2022-11-02 op points are needed as Unicode's goal is to express all writing systems
 85 3448adb0 2022-11-02 op of the world.
 86 3448adb0 2022-11-02 op To give an example, the abstract character
 87 3448adb0 2022-11-02 op .Sq \[u00C4]
 88 3448adb0 2022-11-02 op is not expressable in ASCII, given no ASCII codepoint has been assigned
 89 3448adb0 2022-11-02 op to it.
 90 3448adb0 2022-11-02 op It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
 91 3448adb0 2022-11-02 op .Pp
 92 3448adb0 2022-11-02 op One may assume that this process is straightfoward, but as more and
 93 3448adb0 2022-11-02 op more codepoints were assigned to abstract characters, the Unicode
 94 3448adb0 2022-11-02 op Consortium (that defines the Unicode standard) was facing a problem:
 95 3448adb0 2022-11-02 op Many (mostly non-European) languages have such a large amount of
 96 3448adb0 2022-11-02 op abstract characters that it would exhaust the available Unicode code
 97 3448adb0 2022-11-02 op space if one tried to assign a codepoint to each abstract character.
 98 3448adb0 2022-11-02 op The solution to that problem is best introduced with an example: Consider
 99 3448adb0 2022-11-02 op the abstract character
100 3448adb0 2022-11-02 op .Sq \[u01DE] ,
101 3448adb0 2022-11-02 op which is
102 3448adb0 2022-11-02 op .Sq A
103 3448adb0 2022-11-02 op with an umlaut and a macron added to it.
104 3448adb0 2022-11-02 op In this sense, one can consider
105 3448adb0 2022-11-02 op .Sq \[u01DE]
106 3448adb0 2022-11-02 op as a two-fold modification (namely
107 3448adb0 2022-11-02 op .Dq add umlaut
108 3448adb0 2022-11-02 op and
109 3448adb0 2022-11-02 op .Dq add macron )
110 3448adb0 2022-11-02 op of the
111 3448adb0 2022-11-02 op .Dq base character
112 3448adb0 2022-11-02 op .Sq A .
113 3448adb0 2022-11-02 op .Pp
114 3448adb0 2022-11-02 op The Unicode Consortium adapted this idea by assigning codepoints to
115 3448adb0 2022-11-02 op modifications.
116 3448adb0 2022-11-02 op For example, the codepoint 0x308 represents adding an umlaut and 0x304
117 3448adb0 2022-11-02 op represents adding a macron, and thus, the codepoint sequence
118 3448adb0 2022-11-02 op .Dq 0x41 0x308 0x304 ,
119 3448adb0 2022-11-02 op namely the base character
120 3448adb0 2022-11-02 op .Sq A
121 3448adb0 2022-11-02 op followed by the umlaut and macron modifiers, represents the abstract
122 3448adb0 2022-11-02 op character
123 3448adb0 2022-11-02 op .Sq \[u01DE] .
124 3448adb0 2022-11-02 op As a side-note, the single codepoint 0x1DE was also assigned to
125 3448adb0 2022-11-02 op .Sq \[u01DE] ,
126 3448adb0 2022-11-02 op which is a good example for the fact that there can be multiple
127 3448adb0 2022-11-02 op representations of a single abstract character in Unicode.
128 3448adb0 2022-11-02 op .Pp
129 3448adb0 2022-11-02 op Expressing a single abstract character with multiple codepoints solved
130 3448adb0 2022-11-02 op the code space exhaustion-problem, and the concept has been greatly
131 3448adb0 2022-11-02 op expanded since its first introduction (emojis, joiners, etc.). A sequence
132 3448adb0 2022-11-02 op (which can also have the length 1) of codepoints that belong together
133 3448adb0 2022-11-02 op this way and represents an abstract character is called a
134 3448adb0 2022-11-02 op .Dq grapheme cluster .
135 3448adb0 2022-11-02 op .Pp
136 3448adb0 2022-11-02 op In many applications it is necessary to count the number of
137 3448adb0 2022-11-02 op user-perceived characters, i.e. grapheme clusters, in a string.
138 3448adb0 2022-11-02 op A good example for this is a terminal text editor, which needs to
139 3448adb0 2022-11-02 op properly align characters on a grid.
140 3448adb0 2022-11-02 op This is pretty simple with ASCII-strings, where you just count the number
141 3448adb0 2022-11-02 op of bytes (as each byte is a codepoint and each codepoint is a grapheme
142 3448adb0 2022-11-02 op cluster).
143 3448adb0 2022-11-02 op With Unicode-strings, it is a common mistake to simply adapt the
144 3448adb0 2022-11-02 op ASCII-approach and count the number of code points.
145 3448adb0 2022-11-02 op This is wrong, as, for example, the sequence
146 3448adb0 2022-11-02 op .Dq 0x41 0x308 0x304 ,
147 3448adb0 2022-11-02 op while made up of 3 codepoints, is a single grapheme cluster and
148 3448adb0 2022-11-02 op represents the user-perceived character
149 3448adb0 2022-11-02 op .Sq \[u01DE] .
150 3448adb0 2022-11-02 op .Pp
151 3448adb0 2022-11-02 op The proper way to segment a string into user-perceived characters
152 3448adb0 2022-11-02 op is to segment it into its grapheme clusters by applying the Unicode
153 3448adb0 2022-11-02 op grapheme cluster breaking algorithm (UAX #29).
154 3448adb0 2022-11-02 op It is based on a complex ruleset and lookup-tables and determines if a
155 3448adb0 2022-11-02 op grapheme cluster ends or is continued between two codepoints.
156 3448adb0 2022-11-02 op Libraries like ICU and libunistring, which also offer this functionality,
157 3448adb0 2022-11-02 op are often bloated, not correct, difficult to use or not reasonably
158 3448adb0 2022-11-02 op statically linkable.
159 3448adb0 2022-11-02 op .Pp
160 3448adb0 2022-11-02 op Analogously, the standard provides algorithms to separate strings by
161 3448adb0 2022-11-02 op words, sentences and lines, convert cases and compare strings.
162 3448adb0 2022-11-02 op The motivation behind
163 3448adb0 2022-11-02 op .Nm
164 3448adb0 2022-11-02 op is to make unicode handling suck less and abide by the UNIX philosophy.
165 3448adb0 2022-11-02 op .Sh AUTHORS
166 3448adb0 2022-11-02 op .An Laslo Hunhold Aq Mt dev@frign.de
167 3448adb0 2022-11-02 op EOF