Blame


1 c418ae42 2021-02-13 op If it’s still not clear, I love writing parsers. A parser is a program that given a stream of characters builds a data structure: it’s able to give meaning to a stream of bytes! What can be more exiting to do than writing parsers?
2 c418ae42 2021-02-13 op
3 c418ae42 2021-02-13 op Some time ago, I tried to use transducers to parse text/gemini files but, given my ignorance with how transducers works, the resulting code is more verbose than it really needs to be.
4 c418ae42 2021-02-13 op
5 c418ae42 2021-02-13 op => /post/parsing-gemtext-with-clojure.gmi Parsing gemtext with clojure
6 c418ae42 2021-02-13 op
7 c418ae42 2021-02-13 op Today, I gave myself a second possibility at building parsers on top of transducers, and I think the result is way more clean and maybe even shorter than my text/gemini parser, even if the subject has a more complex grammar.
8 c418ae42 2021-02-13 op
9 c418ae42 2021-02-13 op Today’s subject, as you may have guessed by the title of the entry, are PO files.
10 c418ae42 2021-02-13 op
11 c418ae42 2021-02-13 op => https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html GNU gettext description of PO files.
12 c418ae42 2021-02-13 op
13 c418ae42 2021-02-13 op PO files are commonly used to hold translations data. The format, as described by the link above, is as follows:
14 c418ae42 2021-02-13 op
15 c418ae42 2021-02-13 op ``` example of PO file
16 c418ae42 2021-02-13 op white-space
17 c418ae42 2021-02-13 op # translator-comments
18 c418ae42 2021-02-13 op #. extracted-comments
19 c418ae42 2021-02-13 op #: reference...
20 c418ae42 2021-02-13 op #, flag...
21 c418ae42 2021-02-13 op #| msgid previous-untranslated-string
22 c418ae42 2021-02-13 op msgid untranslated-string
23 c418ae42 2021-02-13 op msgstr translated-string
24 c418ae42 2021-02-13 op ```
25 c418ae42 2021-02-13 op
26 c418ae42 2021-02-13 op Inventing your own translations system almost never has a good outcome; especially when there are formats such as PO that are supported by a variety of tools, including nice GUIs such as poedit. The sad news is that in the Clojure ecosystem I couldn’t find what I personally consider a good option when it comes to managing translations.
27 c418ae42 2021-02-13 op
28 c418ae42 2021-02-13 op There’s Tempura written by Peter Taoussanis (which, by the way, maintains A LOT of cool libraries), but I don’t particularly like how it works, and I have to plug a parser from/to PO by hand if I want the translators to use poedit (or similar software.)
29 c418ae42 2021-02-13 op
30 c418ae42 2021-02-13 op Another option is Pottery, which I overall like, but
31 c418ae42 2021-02-13 op * multiline translation strings are broken: I have a pending PR since september 2020 to fix it, but no reply as of time of writing
32 c418ae42 2021-02-13 op * they switched to the hippocratic license, which is NOT free software, so there are ethic implications (ironic, uh?)
33 c418ae42 2021-02-13 op
34 c418ae42 2021-02-13 op => https://github.com/ptaoussanis/tempura Tempura
35 c418ae42 2021-02-13 op => https://github.com/brightin/pottery Pottery
36 c418ae42 2021-02-13 op
37 c418ae42 2021-02-13 op So here’s why I’m rolling my own. It’s not yet complete, and I’ve just finished the first version of the PO parser/unparser, but I though to post a literal programming-esque post describing how I’m parsing PO files using transducers.
38 c418ae42 2021-02-13 op
39 c418ae42 2021-02-13 op DISCLAIMER: the code was not heavily tested yet, so it may mis-behave. It’s just for demonstration purposes (for the moment.)
40 c418ae42 2021-02-13 op
41 c418ae42 2021-02-13 op ```clojure
42 c418ae42 2021-02-13 op (ns op.rtr.po
43 c418ae42 2021-02-13 op "Utilities to parse PO files."
44 c418ae42 2021-02-13 op (:require
45 c418ae42 2021-02-13 op [clojure.edn :as edn]
46 c418ae42 2021-02-13 op [clojure.string :as str])
47 c418ae42 2021-02-13 op (:import
48 c418ae42 2021-02-13 op (java.io StringWriter)))
49 c418ae42 2021-02-13 op ```
50 c418ae42 2021-02-13 op
51 c418ae42 2021-02-13 op Well, we’ve got a nice palindrome namespace, which is good, and we’re requiring a few things. clojure.string is quite obvious, since we’re gonna play with them a lot. We’ll also (ab)use clojure.edn during the parsing. StringWriter is imported only to provide a convenience function for parsing PO from strings. Will come in handy also for testing purposes.
52 c418ae42 2021-02-13 op
53 c418ae42 2021-02-13 op The body of this library is the transducer parse, which is made by a bunch of small functions that do simple things.
54 c418ae42 2021-02-13 op
55 c418ae42 2021-02-13 op ```clojure
56 c418ae42 2021-02-13 op (def ^:private split-on-blank
57 c418ae42 2021-02-13 op "Transducer that splits on blank lines."
58 c418ae42 2021-02-13 op (partition-by #(= % "")))
59 c418ae42 2021-02-13 op ```
60 c418ae42 2021-02-13 op
61 c418ae42 2021-02-13 op The split-on-blank transducer will group sequential blank lines and sequential non-blank lines together, this way we can separate each entry in the file.
62 c418ae42 2021-02-13 op
63 c418ae42 2021-02-13 op ```clojure
64 c418ae42 2021-02-13 op (def ^:private remove-empty-lines
65 c418ae42 2021-02-13 op "Transducer that remove groups of empty lines."
66 c418ae42 2021-02-13 op (filter #(not= "" (first %))))
67 c418ae42 2021-02-13 op ```
68 c418ae42 2021-02-13 op
69 c418ae42 2021-02-13 op The remove-empty-lines will simply remove the garbage that split-on-blank produces: it will get rid of the block of empty lines, so we only have sequences of entries.
70 c418ae42 2021-02-13 op
71 c418ae42 2021-02-13 op ```clojure
72 c418ae42 2021-02-13 op (declare parse-comments)
73 c418ae42 2021-02-13 op (declare parse-keys)
74 c418ae42 2021-02-13 op
75 c418ae42 2021-02-13 op (def ^:private parse-entries
76 c418ae42 2021-02-13 op (let [comment-line? (fn [line] (str/starts-with? line "#")))]
77 c418ae42 2021-02-13 op (map (fn [lines]
78 c418ae42 2021-02-13 op (let [[comments keys] (partition-by comment-line? lines)]
79 c418ae42 2021-02-13 op {:comments (parse-comments comments)
80 c418ae42 2021-02-13 op :keys (parse-keys keys)}))))
81 c418ae42 2021-02-13 op ```
82 c418ae42 2021-02-13 op
83 c418ae42 2021-02-13 op Ignoring for a bit parse-comments and parse-keys, this step will take a block of lines that constitute an entry, and parse it into a map of comments and keys, by using partition-by to split the lines of the entries into two.
84 c418ae42 2021-02-13 op
85 c418ae42 2021-02-13 op And we have every piece, we can define a parser now!
86 c418ae42 2021-02-13 op
87 c418ae42 2021-02-13 op ```clojure
88 c418ae42 2021-02-13 op (def ^:private parser
89 c418ae42 2021-02-13 op (comp split-on-blank
90 c418ae42 2021-02-13 op remove-empty-lines
91 c418ae42 2021-02-13 op parse-entries))
92 c418ae42 2021-02-13 op ```
93 c418ae42 2021-02-13 op
94 c418ae42 2021-02-13 op We can provide a nice API to parse PO file from various sources very easily:
95 c418ae42 2021-02-13 op
96 c418ae42 2021-02-13 op ```clojure
97 c418ae42 2021-02-13 op (defn parse
98 c418ae42 2021-02-13 op "Parse the PO file given as stream of lines `l`."
99 c418ae42 2021-02-13 op [l]
100 c418ae42 2021-02-13 op (transduce parser conj [] l))
101 c418ae42 2021-02-13 op
102 c418ae42 2021-02-13 op (defn parse-from-reader
103 c418ae42 2021-02-13 op "Parse the PO file given in reader `rdr`. `rdr` must implement `java.io.BufferedReader`."
104 c418ae42 2021-02-13 op [rdr]
105 c418ae42 2021-02-13 op (parse (line-seq rdr)))
106 c418ae42 2021-02-13 op
107 c418ae42 2021-02-13 op (defn parse-from-string
108 c418ae42 2021-02-13 op "Parse the PO file given as string."
109 c418ae42 2021-02-13 op [s]
110 c418ae42 2021-02-13 op (parse (str/split-lines s)))
111 c418ae42 2021-02-13 op ```
112 c418ae42 2021-02-13 op
113 c418ae42 2021-02-13 op And we’re done. This was all for this time. Bye!
114 c418ae42 2021-02-13 op
115 c418ae42 2021-02-13 op Well, no… I still haven’t provided the implementation for parse-comments and parse-keys. To be honest, they’re quite ugly. parse-keys in particular is the ugliest part of the library as of now, but y’know what? Were in 2021 now, if it runs, ship it!
116 c418ae42 2021-02-13 op
117 c418ae42 2021-02-13 op Jokes aside, I should refactor these into something more manageable, but I will focus on the rest of the library fist.
118 c418ae42 2021-02-13 op
119 c418ae42 2021-02-13 op parse-comments takes a block of comment lines and tries to make a sense out if it.
120 c418ae42 2021-02-13 op
121 c418ae42 2021-02-13 op ```clojure
122 c418ae42 2021-02-13 op (defn- parse-comments [comments]
123 c418ae42 2021-02-13 op (into {}
124 c418ae42 2021-02-13 op (for [comment comments]
125 c418ae42 2021-02-13 op (let [len (count comment)
126 c418ae42 2021-02-13 op proper? (>= len 2)
127 c418ae42 2021-02-13 op start (when proper? (subs comment 0 2))
128 c418ae42 2021-02-13 op rest (when proper? (subs comment 2))
129 c418ae42 2021-02-13 op remove-empty #(filter (partial not= "") %)]
130 c418ae42 2021-02-13 op (case start
131 c418ae42 2021-02-13 op "#:" [:reference (remove-empty (str/split rest #" +"))]
132 c418ae42 2021-02-13 op "#," [:flags (remove-empty (str/split rest #" +"))]
133 c418ae42 2021-02-13 op "# " [:translator-comment rest]
134 c418ae42 2021-02-13 op ;; TODO: add other types
135 c418ae42 2021-02-13 op [:unknown-comment comment])))))
136 c418ae42 2021-02-13 op ```
137 c418ae42 2021-02-13 op
138 c418ae42 2021-02-13 op We simply loop through each line and do some simple pattern matching on the first two bytes of each. We then group all those vector of two elements into a single hash map. I should probably refactor this to use group-by to avoid loosing some information: say one provides two reference comments, we would lose one of the two.
139 c418ae42 2021-02-13 op
140 c418ae42 2021-02-13 op To define parse-keys we need an helper: join-sequential-strings
141 c418ae42 2021-02-13 op
142 c418ae42 2021-02-13 op ```clojure
143 c418ae42 2021-02-13 op (defn- join-sequential-strings [rf]
144 c418ae42 2021-02-13 op (let [acc (volatile! nil)]
145 c418ae42 2021-02-13 op (fn
146 c418ae42 2021-02-13 op ([] (rf))
147 c418ae42 2021-02-13 op ([res] (if-let [a @acc]
148 c418ae42 2021-02-13 op (do (vreset! acc nil)
149 c418ae42 2021-02-13 op (rf res (apply str a)))
150 c418ae42 2021-02-13 op (rf res)))
151 c418ae42 2021-02-13 op ([res i]
152 c418ae42 2021-02-13 op (if (string? i)
153 c418ae42 2021-02-13 op (do (vswap! acc conj i)
154 c418ae42 2021-02-13 op res)
155 c418ae42 2021-02-13 op (rf (or (when-let [a @acc]
156 c418ae42 2021-02-13 op (vreset! acc nil)
157 c418ae42 2021-02-13 op (rf res (apply str a)))
158 c418ae42 2021-02-13 op res)
159 c418ae42 2021-02-13 op i))))))
160 c418ae42 2021-02-13 op ```
161 c418ae42 2021-02-13 op
162 c418ae42 2021-02-13 op The thing about this post, compared to the one about text/gemini, is that I’m becoming more comfortable with transducers, and I’m starting to use the standard library more and more. In fact, this is the only transducer written by hand we’ve seen so far.
163 c418ae42 2021-02-13 op
164 c418ae42 2021-02-13 op As every respectful stateful transducer, it allocates its state, using volatile!. rf is the reducing function, and our transducer function is the one with three arities inside the let.
165 c418ae42 2021-02-13 op
166 c418ae42 2021-02-13 op The one-arity branch is called to signal the end of the stream. The transducer has reached the end of the sequence and call us with the accumulated result ‘res’. There we flush our accumulator, if we had something accumulated, or call the reducing function on the result and end.
167 c418ae42 2021-02-13 op
168 c418ae42 2021-02-13 op The two-arity branch is called on each item in that was fed to the transducer. The first argument, res, is the accumulated result, and i is the current item: if it’s a string, we accumulate it into acc, otherwise we drain our accumulator and pass i to rf as-is.
169 c418ae42 2021-02-13 op
170 c418ae42 2021-02-13 op One important thing I learned writing it is that, even if it should be obvious, rf is a pure function. When we call rf no side-effects occurs. So, to provide two items we can’t simply call rf two times: we have to call rf on the output of rf, and make sure we return it!
171 c418ae42 2021-02-13 op
172 c418ae42 2021-02-13 op In this case, if we’ve accumulated some strings, we reset our accumulator and call rf on the concatenation of them. Then we call rf on this new result, or on the original res if we haven’t accumulated anything, passing i.
173 c418ae42 2021-02-13 op
174 c418ae42 2021-02-13 op It may becomes clearer if we replace rf with conj and res with [] (the empty vector).
175 c418ae42 2021-02-13 op
176 c418ae42 2021-02-13 op With this, we can finally define parse-keys and end our little parser:
177 c418ae42 2021-02-13 op
178 c418ae42 2021-02-13 op ```clojure
179 c418ae42 2021-02-13 op (def ^:private keywordize-things
180 c418ae42 2021-02-13 op (map #(if (string? %) % (keyword %))))
181 c418ae42 2021-02-13 op
182 c418ae42 2021-02-13 op (defn- parse-keys [keys]
183 c418ae42 2021-02-13 op (apply hash-map
184 c418ae42 2021-02-13 op (transduce (comp join-sequential-strings
185 c418ae42 2021-02-13 op keywordize-things)
186 c418ae42 2021-02-13 op conj
187 c418ae42 2021-02-13 op []
188 c418ae42 2021-02-13 op ;; XXX: double hack for double fun!
189 c418ae42 2021-02-13 op (edn/read-string (str "[" (apply str (interpose " " keys)) "]")))))
190 c418ae42 2021-02-13 op ```
191 c418ae42 2021-02-13 op
192 c418ae42 2021-02-13 op keywordize-things is another transducer that would turn into a keyword everything but strings, and parse-keys compose these last two transducer to parse the entry; but it does so with a twist, by abusing edn/read-string.
193 c418ae42 2021-02-13 op
194 c418ae42 2021-02-13 op In a PO file, after the comment each entry has a section like this:
195 c418ae42 2021-02-13 op ```
196 c418ae42 2021-02-13 op msgid “message id”
197 c418ae42 2021-02-13 op
198 c418ae42 2021-02-13 op ```
199 c418ae42 2021-02-13 op that is, a keyword followed by a string. But the string can span multiple lines:
200 c418ae42 2021-02-13 op ```
201 c418ae42 2021-02-13 op msgid ""
202 c418ae42 2021-02-13 op "hello\n"
203 c418ae42 2021-02-13 op "world"
204 c418ae42 2021-02-13 op ```
205 c418ae42 2021-02-13 op
206 c418ae42 2021-02-13 op To parse these situation, and to handle things like \n or \" inside the strings, I’m abusing the edn/read-string function. I’m concatenating every line by joining them with a space in between, and then wrapping the string into “[” and “]”, before calling the edn parser. This way, the edn parser will turn ‘msgid’ (for instance) into a symbol, and read every string for us.
207 c418ae42 2021-02-13 op
208 c418ae42 2021-02-13 op Then we use the transducers defined before to join the strings and turn the symbols into keywords and we have a proper parser. (Well, rewriting this hack will probably be the argument of a following post!)
209 c418ae42 2021-02-13 op
210 c418ae42 2021-02-13 op A quick test:
211 c418ae42 2021-02-13 op
212 c418ae42 2021-02-13 op ```clojure
213 c418ae42 2021-02-13 op (parse-from-string "
214 c418ae42 2021-02-13 op
215 c418ae42 2021-02-13 op #: lib/error.c:116
216 c418ae42 2021-02-13 op msgid \"Unknown system error\"
217 c418ae42 2021-02-13 op msgstr \"Errore sconosciuto del sistema\"
218 c418ae42 2021-02-13 op
219 c418ae42 2021-02-13 op #: lib/error.c:116 lib/anothererror.c:134
220 c418ae42 2021-02-13 op msgid \"Known system error\"
221 c418ae42 2021-02-13 op msgstr \"Errore conosciuto del sistema\"
222 c418ae42 2021-02-13 op
223 c418ae42 2021-02-13 op ")
224 c418ae42 2021-02-13 op ;; =>
225 c418ae42 2021-02-13 op ;; [{:comments {:reference ("lib/error.c:116")}
226 c418ae42 2021-02-13 op ;; :keys {:msgid "Unknown system error"
227 c418ae42 2021-02-13 op ;; :msgstr "Errore sconosciuto del sistema"}}
228 c418ae42 2021-02-13 op ;; {:comments {:reference ("lib/error.c:116" "lib/anothererror.c:134")}
229 c418ae42 2021-02-13 op ;; :keys {:msgid "Known system error"
230 c418ae42 2021-02-13 op ;; :msgstr "Errore conosciuto del sistema"}}]
231 c418ae42 2021-02-13 op ```
232 c418ae42 2021-02-13 op
233 c418ae42 2021-02-13 op Yay! It works!
234 c418ae42 2021-02-13 op
235 c418ae42 2021-02-13 op Writing an unparse function is also pretty easy, and is left as an exercise to the reader, because where I live now it’s pretty late and I want to sleep :P
236 c418ae42 2021-02-13 op
237 c418ae42 2021-02-13 op To conclude, another nice property of parser is that if you have a “unparse” operation (i.e. turning your data structure back into its textual representation), then the composition of these two should be the identity function. It’s a handy property for testing!
238 c418ae42 2021-02-13 op
239 c418ae42 2021-02-13 op ```clojure
240 c418ae42 2021-02-13 op (let [x [{:comments {:reference '("lib/error.c:116")}
241 c418ae42 2021-02-13 op :keys {:msgid "Unknown system error"
242 c418ae42 2021-02-13 op :msgstr "Errore sconosciuto del sistema"}}]]
243 c418ae42 2021-02-13 op (= x
244 c418ae42 2021-02-13 op (parse-from-string (unparse-to-string x))))
245 c418ae42 2021-02-13 op ;; => true
246 c418ae42 2021-02-13 op ```
247 c418ae42 2021-02-13 op
248 c418ae42 2021-02-13 op This was all for this time! (For real this time.) Thanks for reading.