Blame


1 179c4a70 2021-08-03 op EDIT April 2021: I have revisited this parser and published it as a library. The version presented here is not only overly-complex, but also overly-verbose.
2 179c4a70 2021-08-03 op
3 179c4a70 2021-08-03 op => /post/gemtext-clojure.gmi Parsing text/gemini with Clojure revisited
4 179c4a70 2021-08-03 op
5 179c4a70 2021-08-03 op
6 791e96fe 2020-10-03 op I've written various parsers in the past. From simplistic stuff in Korn shell, to hand written recursive descendant parsers in Go and or C, to yacc-based parsers in C. I even played with Alex and Happy in Haskell, but that was ages ago and I don't recall anything.
7 791e96fe 2020-10-03 op
8 791e96fe 2020-10-03 op One thing that I never tried was to write a parser in Clojure. Until now.
9 791e96fe 2020-10-03 op
10 791e96fe 2020-10-03 op This post is an experiment: I'm trying to do some literate programming in gemtext (the Gemini protocol "native" response format, text/gemini). If you extract all the block codes you should end up with the same gemtext parser I'm using.
11 791e96fe 2020-10-03 op
12 791e96fe 2020-10-03 op => https://git.omarpolo.com/blog/tree/src/blog/gemtext.clj?id=965388145931751bf314276404816f631c27d01d gemtext.clj (as at the time of writing)
13 791e96fe 2020-10-03 op => gemini://gemini.circumlunar.space/docs/specification.gmi Gemini specification (over Gemini)
14 791e96fe 2020-10-03 op => https://gemini.circumlunar.space/docs/specification.html Gemini specification (over HTTP)
15 791e96fe 2020-10-03 op
16 791e96fe 2020-10-03 op One last note before the implementation: while reading the code, please keep in mind that I really wanted to play with the transducers. There are probably more efficient or neater way to parse stuff in clojure, but I didn't want to write an efficient or fast parser. I wanted to have some fun with the transducers!
17 791e96fe 2020-10-03 op
18 791e96fe 2020-10-03 op Given that gemtext is a line-oriented protocol, I thought to split the input in into lines and use some transducers magic to end up with a neat hiccup-like data structure.
19 791e96fe 2020-10-03 op
20 791e96fe 2020-10-03 op Now we can begin. The aren't third parties dependencies here:
21 791e96fe 2020-10-03 op
22 791e96fe 2020-10-03 op ```
23 791e96fe 2020-10-03 op (ns blog.gemtext
24 791e96fe 2020-10-03 op (:require
25 791e96fe 2020-10-03 op [clojure.string :as str]
26 791e96fe 2020-10-03 op [clojure.walk :as walk]))
27 791e96fe 2020-10-03 op ```
28 791e96fe 2020-10-03 op
29 791e96fe 2020-10-03 op We'll use only clojure.string and clojure.walk, as well as the standard library.
30 791e96fe 2020-10-03 op
31 791e96fe 2020-10-03 op We also need one helper function, starts-with?: it's a wrapper around clojure.string/starts-with?. The need for such function will be clear later.
32 791e96fe 2020-10-03 op
33 791e96fe 2020-10-03 op ```
34 791e96fe 2020-10-03 op (defn- starts-with?
35 791e96fe 2020-10-03 op "check if `s` starts with `substr`. Return `false` if `s` is not a
36 791e96fe 2020-10-03 op string."
37 791e96fe 2020-10-03 op [s substr]
38 791e96fe 2020-10-03 op (when (string? s)
39 791e96fe 2020-10-03 op (str/starts-with? s substr)))
40 791e96fe 2020-10-03 op ```
41 791e96fe 2020-10-03 op
42 791e96fe 2020-10-03 op Every syntactical element of gemtext will be parsed by its own little transducer. The transducer chain will be fed with a stream of lines, and will return a stream of hiccup-like data structure. Let's begin with the most elaborate one: the match-code-blocks transducer:
43 791e96fe 2020-10-03 op
44 791e96fe 2020-10-03 op ```
45 791e96fe 2020-10-03 op (defn- match-code-blocks []
46 791e96fe 2020-10-03 op (fn [rf]
47 791e96fe 2020-10-03 op (let [acc (volatile! [])
48 791e96fe 2020-10-03 op state (volatile! :normal)]
49 791e96fe 2020-10-03 op (fn
50 791e96fe 2020-10-03 op ([] (rf))
51 791e96fe 2020-10-03 op ([result] (rf result))
52 791e96fe 2020-10-03 op ([result line]
53 791e96fe 2020-10-03 op (let [in-verbatim? (= :verbatim @state)
54 791e96fe 2020-10-03 op marker? (starts-with? line "```")]
55 791e96fe 2020-10-03 op (cond
56 791e96fe 2020-10-03 op (and (not in-verbatim?) marker?) ;go to verbatim
57 791e96fe 2020-10-03 op (do (vreset! state :verbatim)
58 791e96fe 2020-10-03 op result)
59 791e96fe 2020-10-03 op
60 791e96fe 2020-10-03 op ;; return what we've got and go to :normal
61 791e96fe 2020-10-03 op (and in-verbatim? marker?)
62 791e96fe 2020-10-03 op (let [res [:verbatim (str/join "\n" @acc)]]
63 791e96fe 2020-10-03 op (vreset! state :normal)
64 791e96fe 2020-10-03 op (vreset! acc [])
65 791e96fe 2020-10-03 op (rf result res))
66 791e96fe 2020-10-03 op
67 791e96fe 2020-10-03 op in-verbatim?
68 791e96fe 2020-10-03 op (do (vswap! acc conj line)
69 791e96fe 2020-10-03 op result)
70 791e96fe 2020-10-03 op
71 791e96fe 2020-10-03 op :else
72 791e96fe 2020-10-03 op (rf result line))))))))
73 791e96fe 2020-10-03 op ```
74 791e96fe 2020-10-03 op
75 791e96fe 2020-10-03 op Woah, that was a lot. We defined a function, that returns a function that takes a reducing-function rf. This sets up some local state (acc and state variables), and returns another function!
76 791e96fe 2020-10-03 op
77 791e96fe 2020-10-03 op That's a lot of functions.
78 791e96fe 2020-10-03 op
79 791e96fe 2020-10-03 op In the innermost function, only the 2-arity branch is interesting, the other two are just scaffolding. There we check if we've got a line with the ``` marker, and if so we switch between the "capturing state", where we capture all the lines in a code block, and an "identity state", where we pass what we've read unconditionally.
80 791e96fe 2020-10-03 op
81 791e96fe 2020-10-03 op When we switch from capturing to identity state, we also return a vector of `[:verbatim "matched lines"]`.
82 791e96fe 2020-10-03 op
83 791e96fe 2020-10-03 op The important thing here is to return the line we got as-is if it's not a marker or within the two markers.
84 791e96fe 2020-10-03 op
85 791e96fe 2020-10-03 op The rest is similar to this, but maybe simpler. In retrospect, I should have wrote some macro to reduce the boilerplate.
86 791e96fe 2020-10-03 op
87 791e96fe 2020-10-03 op Now, let's parse the headings. gemtext has three types of headings, with a syntax (and meaning) similar to markdown.
88 791e96fe 2020-10-03 op
89 791e96fe 2020-10-03 op ```
90 791e96fe 2020-10-03 op (defn- match-headings []
91 791e96fe 2020-10-03 op (fn [rf]
92 791e96fe 2020-10-03 op (fn
93 791e96fe 2020-10-03 op ([] (rf))
94 791e96fe 2020-10-03 op ([result] (rf result))
95 791e96fe 2020-10-03 op ([result line]
96 791e96fe 2020-10-03 op (rf result
97 791e96fe 2020-10-03 op (cond
98 791e96fe 2020-10-03 op ;; the space character after the # is madatory
99 791e96fe 2020-10-03 op (starts-with? line "# ") [:h1 (subs line 2)]
100 791e96fe 2020-10-03 op (starts-with? line "## ") [:h2 (subs line 3)]
101 791e96fe 2020-10-03 op (starts-with? line "### ") [:h3 (subs line 4)]
102 791e96fe 2020-10-03 op :else line))))))
103 791e96fe 2020-10-03 op ```
104 791e96fe 2020-10-03 op
105 791e96fe 2020-10-03 op There are definitely similarities with the previous example, but here you'll understand why I defined starts-with? instead of using directly clojure.string/starts-with?. The "line" we get here can be a string. Or can be a hiccup form. In fact, every transducer will se as input the output of the transducers that run before him. The helper function saves us from checking every time if the input is a string or not.
106 791e96fe 2020-10-03 op
107 791e96fe 2020-10-03 op Now, some of the other syntactical elements are so similar to parse that I wrote a generic matcher:
108 791e96fe 2020-10-03 op
109 791e96fe 2020-10-03 op ```
110 791e96fe 2020-10-03 op (defn- generic-matcher
111 791e96fe 2020-10-03 op "Return a generic matcher transducer. Will wrap line that starts with
112 791e96fe 2020-10-03 op `start` within `[type line]`."
113 791e96fe 2020-10-03 op [start type]
114 791e96fe 2020-10-03 op (fn [rf]
115 791e96fe 2020-10-03 op (fn
116 791e96fe 2020-10-03 op ([] (rf))
117 791e96fe 2020-10-03 op ([result] (rf result))
118 791e96fe 2020-10-03 op ([result line]
119 791e96fe 2020-10-03 op (rf result
120 791e96fe 2020-10-03 op (if (starts-with? line start)
121 791e96fe 2020-10-03 op [type (subs line (count start))]
122 791e96fe 2020-10-03 op line))))))
123 791e96fe 2020-10-03 op ```
124 791e96fe 2020-10-03 op
125 791e96fe 2020-10-03 op and I've used it to match the lists and the quotes:
126 791e96fe 2020-10-03 op
127 791e96fe 2020-10-03 op ```
128 791e96fe 2020-10-03 op (defn- match-lists [] (generic-matcher "* " :li))
129 791e96fe 2020-10-03 op (defn- match-blockquotes [] (generic-matcher "> " :blockquote))
130 791e96fe 2020-10-03 op ```
131 791e96fe 2020-10-03 op
132 791e96fe 2020-10-03 op In case you didn't know, lists in gemtext starts with a star * followed by at least one space, followed by arbitrary text. The quote are similar, except that they begins with >.
133 791e96fe 2020-10-03 op
134 791e96fe 2020-10-03 op Matching the links is a little bit difficult:
135 791e96fe 2020-10-03 op
136 791e96fe 2020-10-03 op ```
137 791e96fe 2020-10-03 op (defn- match-links []
138 791e96fe 2020-10-03 op (fn [rf]
139 791e96fe 2020-10-03 op (fn
140 791e96fe 2020-10-03 op ([] (rf))
141 791e96fe 2020-10-03 op ([result] (rf result))
142 791e96fe 2020-10-03 op ([result line]
143 791e96fe 2020-10-03 op (let [spaces? #{\space \tab}
144 791e96fe 2020-10-03 op nonblank? (complement spaces?)]
145 791e96fe 2020-10-03 op (rf result
146 791e96fe 2020-10-03 op (if-not (starts-with? line "=>")
147 791e96fe 2020-10-03 op line
148 791e96fe 2020-10-03 op (->> (seq line)
149 791e96fe 2020-10-03 op (drop 2) ; drop the marker
150 791e96fe 2020-10-03 op (drop-while spaces?) ; drop also the optional spaces
151 791e96fe 2020-10-03 op (split-with nonblank?) ; separate URL from (optional) label
152 791e96fe 2020-10-03 op (apply #(vector :link
153 791e96fe 2020-10-03 op (apply str %1)
154 791e96fe 2020-10-03 op (apply str (drop-while spaces? %2))))))))))))
155 791e96fe 2020-10-03 op ```
156 791e96fe 2020-10-03 op
157 791e96fe 2020-10-03 op First of all, unlike in HTML, links on gemini aren't inline entities. Second, their syntax is an "arrow" `=>` eventually followed by spaces, followed by the URL, and an optional description.
158 791e96fe 2020-10-03 op
159 791e96fe 2020-10-03 op In the previous function, if we match a line that starts with "=>" we start split it into the URL and the (optional) description.
160 791e96fe 2020-10-03 op
161 791e96fe 2020-10-03 op (seq line) will return a sequence of character, from which we remove the marker. Then we split this in the URL and description. To keep the implementation easy an URL is just a sequence of bytes other than a space or tab. We also drop any blanks from the description.
162 791e96fe 2020-10-03 op
163 791e96fe 2020-10-03 op How cool are the threading macros? (also note that we used a set as a function for even more style points!)
164 791e96fe 2020-10-03 op
165 791e96fe 2020-10-03 op Anyway, the only missing piece is matching the paragraphs. Given that we match on every syntactical entities present in the specification (as of now at least), every non-matched line is a paragraph.
166 791e96fe 2020-10-03 op
167 791e96fe 2020-10-03 op ```
168 791e96fe 2020-10-03 op (defn match-paragraphs [] (generic-matcher "" :paragraph))
169 791e96fe 2020-10-03 op ```
170 791e96fe 2020-10-03 op
171 791e96fe 2020-10-03 op Done!
172 791e96fe 2020-10-03 op
173 791e96fe 2020-10-03 op Now we only need to chain these transducer together. Keeping in mind that the order is important, here's the chain:
174 791e96fe 2020-10-03 op
175 791e96fe 2020-10-03 op ```
176 791e96fe 2020-10-03 op (def parser
177 791e96fe 2020-10-03 op (comp (match-code-blocks)
178 791e96fe 2020-10-03 op (match-headings)
179 791e96fe 2020-10-03 op (match-lists)
180 791e96fe 2020-10-03 op (match-blockquotes)
181 791e96fe 2020-10-03 op (match-links)
182 791e96fe 2020-10-03 op (match-paragraphs)))
183 791e96fe 2020-10-03 op ```
184 791e96fe 2020-10-03 op
185 791e96fe 2020-10-03 op Lines will flow from the first transducer (match-code-blocks) towards the end, being enriched at every step. Beautiful!
186 791e96fe 2020-10-03 op
187 791e96fe 2020-10-03 op Parsing is thus dead-simple now that we've got every piece:
188 791e96fe 2020-10-03 op
189 791e96fe 2020-10-03 op ```
190 791e96fe 2020-10-03 op (defn parse
191 791e96fe 2020-10-03 op "Given a string representing a gemtext document, parse it into an
192 791e96fe 2020-10-03 op hiccup-like data structure."
193 791e96fe 2020-10-03 op [str]
194 791e96fe 2020-10-03 op (transduce parser conj [] (str/split-lines str)))
195 791e96fe 2020-10-03 op ```
196 791e96fe 2020-10-03 op
197 791e96fe 2020-10-03 op We use our chain (the parser), fed with the lines from str, and conj everything into []. The empty vector is important, because if you use a list or nil, due to how conj works, you'll get the parsed document in reverse order.
198 791e96fe 2020-10-03 op
199 791e96fe 2020-10-03 op Aaaand that's all! As a final bonus, if you reached this point, here's an implementation of unparse, a function that takes our hiccup-like format and returns a string, and to-hiccup to translate our hiccup to HTML-hiccup.
200 791e96fe 2020-10-03 op
201 791e96fe 2020-10-03 op ```
202 791e96fe 2020-10-03 op (defn unparse [thing]
203 791e96fe 2020-10-03 op (let [sw (StringBuilder.)]
204 791e96fe 2020-10-03 op (walk/prewalk
205 791e96fe 2020-10-03 op (fn [t]
206 791e96fe 2020-10-03 op (cond
207 791e96fe 2020-10-03 op (nil? t) nil
208 791e96fe 2020-10-03 op
209 791e96fe 2020-10-03 op (or (seq? t)
210 791e96fe 2020-10-03 op (vector? t))
211 791e96fe 2020-10-03 op (if-not (keyword? (first t))
212 791e96fe 2020-10-03 op t
213 791e96fe 2020-10-03 op (let [[type a b] t]
214 791e96fe 2020-10-03 op (.append sw
215 791e96fe 2020-10-03 op (case type
216 791e96fe 2020-10-03 op :verbatim (str "```\n" a "\n```")
217 791e96fe 2020-10-03 op :h1 (str "# " a)
218 791e96fe 2020-10-03 op :h2 (str "## " a)
219 791e96fe 2020-10-03 op :h3 (str "### " a)
220 791e96fe 2020-10-03 op :li (str "* " a)
221 791e96fe 2020-10-03 op :blockquote (str "> " a)
222 791e96fe 2020-10-03 op :link (str "=> " a " " b)
223 791e96fe 2020-10-03 op :paragraph a))
224 791e96fe 2020-10-03 op (.append sw "\n")
225 791e96fe 2020-10-03 op nil))))
226 791e96fe 2020-10-03 op thing)
227 791e96fe 2020-10-03 op (.toString sw)))
228 791e96fe 2020-10-03 op ```
229 791e96fe 2020-10-03 op
230 791e96fe 2020-10-03 op I've used clojure.walk/prewalk to traverse the input, as it makes easy to navigate inside a nested data structure. The idea is that if we get a sequence whose first element is a keyword, that's a "tag", otherwise we recursively inspect its content. If it's a tag, we convert it to a string, pushing it into a StringBuilder.
231 791e96fe 2020-10-03 op
232 791e96fe 2020-10-03 op The to-hiccup function is practically the same
233 791e96fe 2020-10-03 op
234 791e96fe 2020-10-03 op ```
235 791e96fe 2020-10-03 op (defn to-hiccup [doc]
236 791e96fe 2020-10-03 op (let [l (atom [])]
237 791e96fe 2020-10-03 op (walk/prewalk
238 791e96fe 2020-10-03 op (fn [t]
239 791e96fe 2020-10-03 op (cond
240 791e96fe 2020-10-03 op (nil? t) nil
241 791e96fe 2020-10-03 op
242 791e96fe 2020-10-03 op (or (seq? t)
243 791e96fe 2020-10-03 op (vector? t))
244 791e96fe 2020-10-03 op (if-not (keyword? (first t))
245 791e96fe 2020-10-03 op t
246 791e96fe 2020-10-03 op (let [[type a b] t]
247 791e96fe 2020-10-03 op (swap! l conj
248 791e96fe 2020-10-03 op (case type
249 791e96fe 2020-10-03 op :verbatim [:pre [:code a]]
250 791e96fe 2020-10-03 op :h1 [:h1 a]
251 791e96fe 2020-10-03 op :h2 [:h2 a]
252 791e96fe 2020-10-03 op :h3 [:h3 a]
253 791e96fe 2020-10-03 op :li [:ul [:li a]] ;; TODO!
254 791e96fe 2020-10-03 op :blockquote [:blockquote a]
255 791e96fe 2020-10-03 op :link [:p.link [:a {:href a} b]]
256 791e96fe 2020-10-03 op :paragraph [:p a]))
257 791e96fe 2020-10-03 op nil))))
258 791e96fe 2020-10-03 op doc)
259 791e96fe 2020-10-03 op (seq @l)))
260 791e96fe 2020-10-03 op ```
261 791e96fe 2020-10-03 op
262 791e96fe 2020-10-03 op The only thing that's missing from to-hiccup is the (html/gemini) list handling. That is, while on gemini you only have "list item", in HTML the lists item must be wrapped inside a container tag, ul or ol. Since I'm not using lists that much, I don't care at the moment. One can probably improve it by doing some post-processing on the content of @l and grouping every :li.
263 791e96fe 2020-10-03 op
264 791e96fe 2020-10-03 op But that's really all.
265 791e96fe 2020-10-03 op
266 791e96fe 2020-10-03 op I'm using this code to parse the posts (so that they can be rendered also in HTML), and to build every gemini page.