Blob


1 If it’s still not clear, I love writing parsers. A parser is a program that given a stream of characters builds a data structure: it’s able to give meaning to a stream of bytes! What can be more exiting to do than writing parsers?
3 Some time ago, I tried to use transducers to parse text/gemini files but, given my ignorance with how transducers works, the resulting code is more verbose than it really needs to be.
5 => /post/parsing-gemtext-with-clojure.gmi Parsing gemtext with clojure
7 Today, I gave myself a second possibility at building parsers on top of transducers, and I think the result is way more clean and maybe even shorter than my text/gemini parser, even if the subject has a more complex grammar.
9 Today’s subject, as you may have guessed by the title of the entry, are PO files.
11 => https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html GNU gettext description of PO files.
13 PO files are commonly used to hold translations data. The format, as described by the link above, is as follows:
15 ``` example of PO file
16 white-space
17 # translator-comments
18 #. extracted-comments
19 #: reference...
20 #, flag...
21 #| msgid previous-untranslated-string
22 msgid untranslated-string
23 msgstr translated-string
24 ```
26 Inventing your own translations system almost never has a good outcome; especially when there are formats such as PO that are supported by a variety of tools, including nice GUIs such as poedit. The sad news is that in the Clojure ecosystem I couldn’t find what I personally consider a good option when it comes to managing translations.
28 There’s Tempura written by Peter Taoussanis (which, by the way, maintains A LOT of cool libraries), but I don’t particularly like how it works, and I have to plug a parser from/to PO by hand if I want the translators to use poedit (or similar software.)
30 Another option is Pottery, which I overall like, but
31 * multiline translation strings are broken: I have a pending PR since september 2020 to fix it, but no reply as of time of writing
32 * they switched to the hippocratic license, which is NOT free software, so there are ethic implications (ironic, uh?)
34 => https://github.com/ptaoussanis/tempura Tempura
35 => https://github.com/brightin/pottery Pottery
37 So here’s why I’m rolling my own. It’s not yet complete, and I’ve just finished the first version of the PO parser/unparser, but I though to post a literal programming-esque post describing how I’m parsing PO files using transducers.
39 DISCLAIMER: the code was not heavily tested yet, so it may mis-behave. It’s just for demonstration purposes (for the moment.)
41 ```clojure
42 (ns op.rtr.po
43 "Utilities to parse PO files."
44 (:require
45 [clojure.edn :as edn]
46 [clojure.string :as str])
47 (:import
48 (java.io StringWriter)))
49 ```
51 Well, we’ve got a nice palindrome namespace, which is good, and we’re requiring a few things. clojure.string is quite obvious, since we’re gonna play with them a lot. We’ll also (ab)use clojure.edn during the parsing. StringWriter is imported only to provide a convenience function for parsing PO from strings. Will come in handy also for testing purposes.
53 The body of this library is the transducer parse, which is made by a bunch of small functions that do simple things.
55 ```clojure
56 (def ^:private split-on-blank
57 "Transducer that splits on blank lines."
58 (partition-by #(= % "")))
59 ```
61 The split-on-blank transducer will group sequential blank lines and sequential non-blank lines together, this way we can separate each entry in the file.
63 ```clojure
64 (def ^:private remove-empty-lines
65 "Transducer that remove groups of empty lines."
66 (filter #(not= "" (first %))))
67 ```
69 The remove-empty-lines will simply remove the garbage that split-on-blank produces: it will get rid of the block of empty lines, so we only have sequences of entries.
71 ```clojure
72 (declare parse-comments)
73 (declare parse-keys)
75 (def ^:private parse-entries
76 (let [comment-line? (fn [line] (str/starts-with? line "#")))]
77 (map (fn [lines]
78 (let [[comments keys] (partition-by comment-line? lines)]
79 {:comments (parse-comments comments)
80 :keys (parse-keys keys)}))))
81 ```
83 Ignoring for a bit parse-comments and parse-keys, this step will take a block of lines that constitute an entry, and parse it into a map of comments and keys, by using partition-by to split the lines of the entries into two.
85 And we have every piece, we can define a parser now!
87 ```clojure
88 (def ^:private parser
89 (comp split-on-blank
90 remove-empty-lines
91 parse-entries))
92 ```
94 We can provide a nice API to parse PO file from various sources very easily:
96 ```clojure
97 (defn parse
98 "Parse the PO file given as stream of lines `l`."
99 [l]
100 (transduce parser conj [] l))
102 (defn parse-from-reader
103 "Parse the PO file given in reader `rdr`. `rdr` must implement `java.io.BufferedReader`."
104 [rdr]
105 (parse (line-seq rdr)))
107 (defn parse-from-string
108 "Parse the PO file given as string."
109 [s]
110 (parse (str/split-lines s)))
111 ```
113 And we’re done. This was all for this time. Bye!
115 Well, no… I still haven’t provided the implementation for parse-comments and parse-keys. To be honest, they’re quite ugly. parse-keys in particular is the ugliest part of the library as of now, but y’know what? Were in 2021 now, if it runs, ship it!
117 Jokes aside, I should refactor these into something more manageable, but I will focus on the rest of the library fist.
119 parse-comments takes a block of comment lines and tries to make a sense out if it.
121 ```clojure
122 (defn- parse-comments [comments]
123 (into {}
124 (for [comment comments]
125 (let [len (count comment)
126 proper? (>= len 2)
127 start (when proper? (subs comment 0 2))
128 rest (when proper? (subs comment 2))
129 remove-empty #(filter (partial not= "") %)]
130 (case start
131 "#:" [:reference (remove-empty (str/split rest #" +"))]
132 "#," [:flags (remove-empty (str/split rest #" +"))]
133 "# " [:translator-comment rest]
134 ;; TODO: add other types
135 [:unknown-comment comment])))))
136 ```
138 We simply loop through each line and do some simple pattern matching on the first two bytes of each. We then group all those vector of two elements into a single hash map. I should probably refactor this to use group-by to avoid loosing some information: say one provides two reference comments, we would lose one of the two.
140 To define parse-keys we need an helper: join-sequential-strings
142 ```clojure
143 (defn- join-sequential-strings [rf]
144 (let [acc (volatile! nil)]
145 (fn
146 ([] (rf))
147 ([res] (if-let [a @acc]
148 (do (vreset! acc nil)
149 (rf res (apply str a)))
150 (rf res)))
151 ([res i]
152 (if (string? i)
153 (do (vswap! acc conj i)
154 res)
155 (rf (or (when-let [a @acc]
156 (vreset! acc nil)
157 (rf res (apply str a)))
158 res)
159 i))))))
160 ```
162 The thing about this post, compared to the one about text/gemini, is that I’m becoming more comfortable with transducers, and I’m starting to use the standard library more and more. In fact, this is the only transducer written by hand we’ve seen so far.
164 As every respectful stateful transducer, it allocates its state, using volatile!. rf is the reducing function, and our transducer function is the one with three arities inside the let.
166 The one-arity branch is called to signal the end of the stream. The transducer has reached the end of the sequence and call us with the accumulated result ‘res’. There we flush our accumulator, if we had something accumulated, or call the reducing function on the result and end.
168 The two-arity branch is called on each item in that was fed to the transducer. The first argument, res, is the accumulated result, and i is the current item: if it’s a string, we accumulate it into acc, otherwise we drain our accumulator and pass i to rf as-is.
170 One important thing I learned writing it is that, even if it should be obvious, rf is a pure function. When we call rf no side-effects occurs. So, to provide two items we can’t simply call rf two times: we have to call rf on the output of rf, and make sure we return it!
172 In this case, if we’ve accumulated some strings, we reset our accumulator and call rf on the concatenation of them. Then we call rf on this new result, or on the original res if we haven’t accumulated anything, passing i.
174 It may becomes clearer if we replace rf with conj and res with [] (the empty vector).
176 With this, we can finally define parse-keys and end our little parser:
178 ```clojure
179 (def ^:private keywordize-things
180 (map #(if (string? %) % (keyword %))))
182 (defn- parse-keys [keys]
183 (apply hash-map
184 (transduce (comp join-sequential-strings
185 keywordize-things)
186 conj
187 []
188 ;; XXX: double hack for double fun!
189 (edn/read-string (str "[" (apply str (interpose " " keys)) "]")))))
190 ```
192 keywordize-things is another transducer that would turn into a keyword everything but strings, and parse-keys compose these last two transducer to parse the entry; but it does so with a twist, by abusing edn/read-string.
194 In a PO file, after the comment each entry has a section like this:
195 ```
196 msgid “message id”
197
198 ```
199 that is, a keyword followed by a string. But the string can span multiple lines:
200 ```
201 msgid ""
202 "hello\n"
203 "world"
204 ```
206 To parse these situation, and to handle things like \n or \" inside the strings, I’m abusing the edn/read-string function. I’m concatenating every line by joining them with a space in between, and then wrapping the string into “[” and “]”, before calling the edn parser. This way, the edn parser will turn ‘msgid’ (for instance) into a symbol, and read every string for us.
208 Then we use the transducers defined before to join the strings and turn the symbols into keywords and we have a proper parser. (Well, rewriting this hack will probably be the argument of a following post!)
210 A quick test:
212 ```clojure
213 (parse-from-string "
215 #: lib/error.c:116
216 msgid \"Unknown system error\"
217 msgstr \"Errore sconosciuto del sistema\"
219 #: lib/error.c:116 lib/anothererror.c:134
220 msgid \"Known system error\"
221 msgstr \"Errore conosciuto del sistema\"
223 ")
224 ;; =>
225 ;; [{:comments {:reference ("lib/error.c:116")}
226 ;; :keys {:msgid "Unknown system error"
227 ;; :msgstr "Errore sconosciuto del sistema"}}
228 ;; {:comments {:reference ("lib/error.c:116" "lib/anothererror.c:134")}
229 ;; :keys {:msgid "Known system error"
230 ;; :msgstr "Errore conosciuto del sistema"}}]
231 ```
233 Yay! It works!
235 Writing an unparse function is also pretty easy, and is left as an exercise to the reader, because where I live now it’s pretty late and I want to sleep :P
237 To conclude, another nice property of parser is that if you have a “unparse” operation (i.e. turning your data structure back into its textual representation), then the composition of these two should be the identity function. It’s a handy property for testing!
239 ```clojure
240 (let [x [{:comments {:reference '("lib/error.c:116")}
241 :keys {:msgid "Unknown system error"
242 :msgstr "Errore sconosciuto del sistema"}}]]
243 (= x
244 (parse-from-string (unparse-to-string x))))
245 ;; => true
246 ```
248 This was all for this time! (For real this time.) Thanks for reading.