op public repos

Blob

Date:: Sat Feb 13 11:10:56 2021 UTC
Message:: new posts
Actions:: History | Blame | Raw File
1 If it’s still not clear, I love writing parsers.  A parser is a program that given a stream of characters builds a data structure: it’s able to give meaning to a stream of bytes!  What can be more exiting to do than writing parsers?
2 
3 Some time ago, I tried to use transducers to parse text/gemini files but, given my ignorance with how transducers works, the resulting code is more verbose than it really needs to be.
4 
5 => /post/parsing-gemtext-with-clojure.gmi Parsing gemtext with clojure
6 
7 Today, I gave myself a second possibility at building parsers on top of transducers, and I think the result is way more clean and maybe even shorter than my text/gemini parser, even if the subject has a more complex grammar.
8 
9 Today’s subject, as you may have guessed by the title of the entry, are PO files.
10 
11 => https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html GNU gettext description of PO files.
12 
13 PO files are commonly used to hold translations data.  The format, as described by the link above, is as follows:
14 
15 ``` example of PO file
16 white-space
17 #  translator-comments
18 #. extracted-comments
19 #: reference...
20 #, flag...
21 #| msgid previous-untranslated-string
22 msgid untranslated-string
23 msgstr translated-string
24 ```
25 
26 Inventing your own translations system almost never has a good outcome; especially when there are formats such as PO that are supported by a variety of tools, including nice GUIs such as poedit.  The sad news is that in the Clojure ecosystem I couldn’t find what I personally consider a good option when it comes to managing translations.
27 
28 There’s Tempura written by Peter Taoussanis (which, by the way, maintains A LOT of cool libraries), but I don’t particularly like how it works, and I have to plug a parser from/to PO by hand if I want the translators to use poedit (or similar software.)
29 
30 Another option is Pottery, which I overall like, but
31 * multiline translation strings are broken: I have a pending PR since september 2020 to fix it, but no reply as of time of writing
32 * they switched to the hippocratic license, which is NOT free software, so there are ethic implications (ironic, uh?)
33 
34 => https://github.com/ptaoussanis/tempura       Tempura
35 => https://github.com/brightin/pottery          Pottery
36 
37 So here’s why I’m rolling my own.  It’s not yet complete, and I’ve just finished the first version of the PO parser/unparser, but I though to post a literal programming-esque post describing how I’m parsing PO files using transducers.
38 
39 DISCLAIMER: the code was not heavily tested yet, so it may mis-behave.  It’s just for demonstration purposes (for the moment.)
40 
41 ```clojure
42 (ns op.rtr.po
43   "Utilities to parse PO files."
44   (:require
45    [clojure.edn :as edn]
46    [clojure.string :as str])
47   (:import
48    (java.io StringWriter)))
49 ```
50 
51 Well, we’ve got a nice palindrome namespace, which is good, and we’re requiring a few things.  clojure.string is quite obvious, since we’re gonna play with them a lot.  We’ll also (ab)use clojure.edn during the parsing.  StringWriter is imported only to provide a convenience function for parsing PO from strings.  Will come in handy also for testing purposes.
52 
53 The body of this library is the transducer parse, which is made by a bunch of small functions that do simple things.
54 
55 ```clojure
56 (def ^:private split-on-blank
57   "Transducer that splits on blank lines."
58   (partition-by #(= % "")))
59 ```
60 
61 The split-on-blank transducer will group sequential blank lines and sequential non-blank lines together, this way we can separate each entry in the file.
62 
63 ```clojure
64 (def ^:private remove-empty-lines
65   "Transducer that remove groups of empty lines."
66   (filter #(not= "" (first %))))
67 ```
68 
69 The remove-empty-lines will simply remove the garbage that split-on-blank produces: it will get rid of the block of empty lines, so we only have sequences of entries.
70 
71 ```clojure
72 (declare parse-comments)
73 (declare parse-keys)
74 
75 (def ^:private parse-entries
76   (let [comment-line? (fn [line] (str/starts-with? line "#")))]
77     (map (fn [lines]
78            (let [[comments keys] (partition-by comment-line? lines)]
79              {:comments (parse-comments comments)
80               :keys     (parse-keys keys)}))))
81 ```
82 
83 Ignoring for a bit parse-comments and parse-keys, this step will take a block of lines that constitute an entry, and parse it into a map of comments and keys, by using partition-by to split the lines of the entries into two.
84 
85 And we have every piece, we can define a parser now!
86 
87 ```clojure
88 (def ^:private parser
89   (comp split-on-blank
90         remove-empty-lines
91         parse-entries))
92 ```
93 
94 We can provide a nice API to parse PO file from various sources very easily:
95 
96 ```clojure
97 (defn parse
98   "Parse the PO file given as stream of lines `l`."
99   [l]
100   (transduce parser conj [] l))
101 
102 (defn parse-from-reader
103   "Parse the PO file given in reader `rdr`.  `rdr` must implement `java.io.BufferedReader`."
104   [rdr]
105   (parse (line-seq rdr)))
106 
107 (defn parse-from-string
108   "Parse the PO file given as string."
109   [s]
110   (parse (str/split-lines s)))
111 ```
112 
113 And we’re done.  This was all for this time.  Bye!
114 
115 Well, no… I still haven’t provided the implementation for parse-comments and parse-keys.  To be honest, they’re quite ugly.  parse-keys in particular is the ugliest part of the library as of now, but y’know what?  Were in 2021 now, if it runs, ship it!
116 
117 Jokes aside, I should refactor these into something more manageable, but I will focus on the rest of the library fist.
118 
119 parse-comments takes a block of comment lines and tries to make a sense out if it.
120 
121 ```clojure
122 (defn- parse-comments [comments]
123   (into {}
124         (for [comment comments]
125           (let [len          (count comment)
126                 proper?      (>= len 2)
127                 start        (when proper? (subs comment 0 2))
128                 rest         (when proper? (subs comment 2))
129                 remove-empty #(filter (partial not= "") %)]
130             (case start
131               "#:" [:reference (remove-empty (str/split rest #" +"))]
132               "#," [:flags (remove-empty (str/split rest #" +"))]
133               "# " [:translator-comment rest]
134               ;; TODO: add other types
135               [:unknown-comment comment])))))
136 ```
137 
138 We simply loop through each line and do some simple pattern matching on the first two bytes of each.  We then group all those vector of two elements into a single hash map.  I should probably refactor this to use group-by to avoid loosing some information: say one provides two reference comments, we would lose one of the two.
139 
140 To define parse-keys we need an helper: join-sequential-strings
141 
142 ```clojure
143 (defn- join-sequential-strings [rf]
144   (let [acc (volatile! nil)]
145     (fn
146       ([] (rf))
147       ([res] (if-let [a @acc]
148                (do (vreset! acc nil)
149                    (rf res (apply str a)))
150                (rf res)))
151       ([res i]
152        (if (string? i)
153          (do (vswap! acc conj i)
154              res)
155          (rf (or (when-let [a @acc]
156                    (vreset! acc nil)
157                    (rf res (apply str a)))
158                  res)
159              i))))))
160 ```
161 
162 The thing about this post, compared to the one about text/gemini, is that I’m becoming more comfortable with transducers, and I’m starting to use the standard library more and more.  In fact, this is the only transducer written by hand we’ve seen so far.
163 
164 As every respectful stateful transducer, it allocates its state, using volatile!.  rf is the reducing function, and our transducer function is the one with three arities inside the let.
165 
166 The one-arity branch is called to signal the end of the stream.  The transducer has reached the end of the sequence and call us with the accumulated result ‘res’.  There we flush our accumulator, if we had something accumulated, or call the reducing function on the result and end.
167 
168 The two-arity branch is called on each item in that was fed to the transducer.  The first argument, res, is the accumulated result, and i is the current item: if it’s a string, we accumulate it into acc, otherwise we drain our accumulator and pass i to rf as-is.
169 
170 One important thing I learned writing it is that, even if it should be obvious, rf is a pure function.  When we call rf no side-effects occurs.  So, to provide two items we can’t simply call rf two times: we have to call rf on the output of rf, and make sure we return it!
171 
172 In this case, if we’ve accumulated some strings, we reset our accumulator and call rf on the concatenation of them.  Then we call rf on this new result, or on the original res if we haven’t accumulated anything, passing i.
173 
174 It may becomes clearer if we replace rf with conj and res with [] (the empty vector).
175 
176 With this, we can finally define parse-keys and end our little parser:
177 
178 ```clojure
179 (def ^:private keywordize-things
180   (map #(if (string? %) % (keyword %))))
181 
182 (defn- parse-keys [keys]
183   (apply hash-map
184          (transduce (comp join-sequential-strings
185                           keywordize-things)
186                     conj
187                     []
188                     ;; XXX: double hack for double fun!
189                     (edn/read-string (str "[" (apply str (interpose " " keys)) "]")))))
190 ```
191 
192 keywordize-things is another transducer that would turn into a keyword everything but strings, and parse-keys compose these last two transducer to parse the entry; but it does so with a twist, by abusing edn/read-string.
193 
194 In a PO file, after the comment each entry has a section like this:
195 ```
196 msgid “message id”
197 …
198 ```
199 that is, a keyword followed by a string.  But the string can span multiple lines:
200 ```
201 msgid ""
202 "hello\n"
203 "world"
204 ```
205 
206 To parse these situation, and to handle things like \n or \" inside the strings, I’m abusing the edn/read-string function.  I’m concatenating every line by joining them with a space in between, and then wrapping the string into “[” and “]”, before calling the edn parser.  This way, the edn parser will turn ‘msgid’ (for instance) into a symbol, and read every string for us.
207 
208 Then we use the transducers defined before to join the strings and turn the symbols into keywords and we have a proper parser.  (Well, rewriting this hack will probably be the argument of a following post!)
209 
210 A quick test:
211 
212 ```clojure
213 (parse-from-string "
214 
215 #: lib/error.c:116
216 msgid \"Unknown system error\"
217 msgstr \"Errore sconosciuto del sistema\"
218 
219 #: lib/error.c:116 lib/anothererror.c:134
220 msgid \"Known system error\"
221 msgstr \"Errore conosciuto del sistema\"
222 
223 ")
224 ;; =>
225 ;; [{:comments {:reference ("lib/error.c:116")}
226 ;;   :keys {:msgid "Unknown system error"
227 ;;          :msgstr "Errore sconosciuto del sistema"}}
228 ;;  {:comments {:reference ("lib/error.c:116" "lib/anothererror.c:134")}
229 ;;   :keys {:msgid "Known system error"
230 ;;          :msgstr "Errore conosciuto del sistema"}}]
231 ```
232 
233 Yay! It works!
234 
235 Writing an unparse function is also pretty easy, and is left as an exercise to the reader, because where I live now it’s pretty late and I want to sleep :P
236 
237 To conclude, another nice property of parser is that if you have a “unparse” operation (i.e. turning your data structure back into its textual representation), then the composition of these two should be the identity function.  It’s a handy property for testing!
238 
239 ```clojure
240 (let [x [{:comments {:reference '("lib/error.c:116")}
241           :keys     {:msgid  "Unknown system error"
242                      :msgstr "Errore sconosciuto del sistema"}}]]
243   (= x
244      (parse-from-string (unparse-to-string x))))
245 ;; => true
246 ```
247 
248 This was all for this time!  (For real this time.)  Thanks for reading.