Blob


1 EDIT April 2021: I have revisited this parser and published it as a library. The version presented here is not only overly-complex, but also overly-verbose.
3 => /post/gemtext-clojure.gmi Parsing text/gemini with Clojure revisited
6 I've written various parsers in the past. From simplistic stuff in Korn shell, to hand written recursive descendant parsers in Go and or C, to yacc-based parsers in C. I even played with Alex and Happy in Haskell, but that was ages ago and I don't recall anything.
8 One thing that I never tried was to write a parser in Clojure. Until now.
10 This post is an experiment: I'm trying to do some literate programming in gemtext (the Gemini protocol "native" response format, text/gemini). If you extract all the block codes you should end up with the same gemtext parser I'm using.
12 => https://git.omarpolo.com/blog/tree/src/blog/gemtext.clj?id=965388145931751bf314276404816f631c27d01d gemtext.clj (as at the time of writing)
13 => gemini://gemini.circumlunar.space/docs/specification.gmi Gemini specification (over Gemini)
14 => https://gemini.circumlunar.space/docs/specification.html Gemini specification (over HTTP)
16 One last note before the implementation: while reading the code, please keep in mind that I really wanted to play with the transducers. There are probably more efficient or neater way to parse stuff in clojure, but I didn't want to write an efficient or fast parser. I wanted to have some fun with the transducers!
18 Given that gemtext is a line-oriented protocol, I thought to split the input in into lines and use some transducers magic to end up with a neat hiccup-like data structure.
20 Now we can begin. The aren't third parties dependencies here:
22 ```
23 (ns blog.gemtext
24 (:require
25 [clojure.string :as str]
26 [clojure.walk :as walk]))
27 ```
29 We'll use only clojure.string and clojure.walk, as well as the standard library.
31 We also need one helper function, starts-with?: it's a wrapper around clojure.string/starts-with?. The need for such function will be clear later.
33 ```
34 (defn- starts-with?
35 "check if `s` starts with `substr`. Return `false` if `s` is not a
36 string."
37 [s substr]
38 (when (string? s)
39 (str/starts-with? s substr)))
40 ```
42 Every syntactical element of gemtext will be parsed by its own little transducer. The transducer chain will be fed with a stream of lines, and will return a stream of hiccup-like data structure. Let's begin with the most elaborate one: the match-code-blocks transducer:
44 ```
45 (defn- match-code-blocks []
46 (fn [rf]
47 (let [acc (volatile! [])
48 state (volatile! :normal)]
49 (fn
50 ([] (rf))
51 ([result] (rf result))
52 ([result line]
53 (let [in-verbatim? (= :verbatim @state)
54 marker? (starts-with? line "```")]
55 (cond
56 (and (not in-verbatim?) marker?) ;go to verbatim
57 (do (vreset! state :verbatim)
58 result)
60 ;; return what we've got and go to :normal
61 (and in-verbatim? marker?)
62 (let [res [:verbatim (str/join "\n" @acc)]]
63 (vreset! state :normal)
64 (vreset! acc [])
65 (rf result res))
67 in-verbatim?
68 (do (vswap! acc conj line)
69 result)
71 :else
72 (rf result line))))))))
73 ```
75 Woah, that was a lot. We defined a function, that returns a function that takes a reducing-function rf. This sets up some local state (acc and state variables), and returns another function!
77 That's a lot of functions.
79 In the innermost function, only the 2-arity branch is interesting, the other two are just scaffolding. There we check if we've got a line with the ``` marker, and if so we switch between the "capturing state", where we capture all the lines in a code block, and an "identity state", where we pass what we've read unconditionally.
81 When we switch from capturing to identity state, we also return a vector of `[:verbatim "matched lines"]`.
83 The important thing here is to return the line we got as-is if it's not a marker or within the two markers.
85 The rest is similar to this, but maybe simpler. In retrospect, I should have wrote some macro to reduce the boilerplate.
87 Now, let's parse the headings. gemtext has three types of headings, with a syntax (and meaning) similar to markdown.
89 ```
90 (defn- match-headings []
91 (fn [rf]
92 (fn
93 ([] (rf))
94 ([result] (rf result))
95 ([result line]
96 (rf result
97 (cond
98 ;; the space character after the # is madatory
99 (starts-with? line "# ") [:h1 (subs line 2)]
100 (starts-with? line "## ") [:h2 (subs line 3)]
101 (starts-with? line "### ") [:h3 (subs line 4)]
102 :else line))))))
103 ```
105 There are definitely similarities with the previous example, but here you'll understand why I defined starts-with? instead of using directly clojure.string/starts-with?. The "line" we get here can be a string. Or can be a hiccup form. In fact, every transducer will se as input the output of the transducers that run before him. The helper function saves us from checking every time if the input is a string or not.
107 Now, some of the other syntactical elements are so similar to parse that I wrote a generic matcher:
109 ```
110 (defn- generic-matcher
111 "Return a generic matcher transducer. Will wrap line that starts with
112 `start` within `[type line]`."
113 [start type]
114 (fn [rf]
115 (fn
116 ([] (rf))
117 ([result] (rf result))
118 ([result line]
119 (rf result
120 (if (starts-with? line start)
121 [type (subs line (count start))]
122 line))))))
123 ```
125 and I've used it to match the lists and the quotes:
127 ```
128 (defn- match-lists [] (generic-matcher "* " :li))
129 (defn- match-blockquotes [] (generic-matcher "> " :blockquote))
130 ```
132 In case you didn't know, lists in gemtext starts with a star * followed by at least one space, followed by arbitrary text. The quote are similar, except that they begins with >.
134 Matching the links is a little bit difficult:
136 ```
137 (defn- match-links []
138 (fn [rf]
139 (fn
140 ([] (rf))
141 ([result] (rf result))
142 ([result line]
143 (let [spaces? #{\space \tab}
144 nonblank? (complement spaces?)]
145 (rf result
146 (if-not (starts-with? line "=>")
147 line
148 (->> (seq line)
149 (drop 2) ; drop the marker
150 (drop-while spaces?) ; drop also the optional spaces
151 (split-with nonblank?) ; separate URL from (optional) label
152 (apply #(vector :link
153 (apply str %1)
154 (apply str (drop-while spaces? %2))))))))))))
155 ```
157 First of all, unlike in HTML, links on gemini aren't inline entities. Second, their syntax is an "arrow" `=>` eventually followed by spaces, followed by the URL, and an optional description.
159 In the previous function, if we match a line that starts with "=>" we start split it into the URL and the (optional) description.
161 (seq line) will return a sequence of character, from which we remove the marker. Then we split this in the URL and description. To keep the implementation easy an URL is just a sequence of bytes other than a space or tab. We also drop any blanks from the description.
163 How cool are the threading macros? (also note that we used a set as a function for even more style points!)
165 Anyway, the only missing piece is matching the paragraphs. Given that we match on every syntactical entities present in the specification (as of now at least), every non-matched line is a paragraph.
167 ```
168 (defn match-paragraphs [] (generic-matcher "" :paragraph))
169 ```
171 Done!
173 Now we only need to chain these transducer together. Keeping in mind that the order is important, here's the chain:
175 ```
176 (def parser
177 (comp (match-code-blocks)
178 (match-headings)
179 (match-lists)
180 (match-blockquotes)
181 (match-links)
182 (match-paragraphs)))
183 ```
185 Lines will flow from the first transducer (match-code-blocks) towards the end, being enriched at every step. Beautiful!
187 Parsing is thus dead-simple now that we've got every piece:
189 ```
190 (defn parse
191 "Given a string representing a gemtext document, parse it into an
192 hiccup-like data structure."
193 [str]
194 (transduce parser conj [] (str/split-lines str)))
195 ```
197 We use our chain (the parser), fed with the lines from str, and conj everything into []. The empty vector is important, because if you use a list or nil, due to how conj works, you'll get the parsed document in reverse order.
199 Aaaand that's all! As a final bonus, if you reached this point, here's an implementation of unparse, a function that takes our hiccup-like format and returns a string, and to-hiccup to translate our hiccup to HTML-hiccup.
201 ```
202 (defn unparse [thing]
203 (let [sw (StringBuilder.)]
204 (walk/prewalk
205 (fn [t]
206 (cond
207 (nil? t) nil
209 (or (seq? t)
210 (vector? t))
211 (if-not (keyword? (first t))
213 (let [[type a b] t]
214 (.append sw
215 (case type
216 :verbatim (str "```\n" a "\n```")
217 :h1 (str "# " a)
218 :h2 (str "## " a)
219 :h3 (str "### " a)
220 :li (str "* " a)
221 :blockquote (str "> " a)
222 :link (str "=> " a " " b)
223 :paragraph a))
224 (.append sw "\n")
225 nil))))
226 thing)
227 (.toString sw)))
228 ```
230 I've used clojure.walk/prewalk to traverse the input, as it makes easy to navigate inside a nested data structure. The idea is that if we get a sequence whose first element is a keyword, that's a "tag", otherwise we recursively inspect its content. If it's a tag, we convert it to a string, pushing it into a StringBuilder.
232 The to-hiccup function is practically the same
234 ```
235 (defn to-hiccup [doc]
236 (let [l (atom [])]
237 (walk/prewalk
238 (fn [t]
239 (cond
240 (nil? t) nil
242 (or (seq? t)
243 (vector? t))
244 (if-not (keyword? (first t))
246 (let [[type a b] t]
247 (swap! l conj
248 (case type
249 :verbatim [:pre [:code a]]
250 :h1 [:h1 a]
251 :h2 [:h2 a]
252 :h3 [:h3 a]
253 :li [:ul [:li a]] ;; TODO!
254 :blockquote [:blockquote a]
255 :link [:p.link [:a {:href a} b]]
256 :paragraph [:p a]))
257 nil))))
258 doc)
259 (seq @l)))
260 ```
262 The only thing that's missing from to-hiccup is the (html/gemini) list handling. That is, while on gemini you only have "list item", in HTML the lists item must be wrapped inside a container tag, ul or ol. Since I'm not using lists that much, I don't care at the moment. One can probably improve it by doing some post-processing on the content of @l and grouping every :li.
264 But that's really all.
266 I'm using this code to parse the posts (so that they can be rendered also in HTML), and to build every gemini page.