1 791e96fe 2020-10-03 op I've written various parsers in the past. From simplistic stuff in Korn shell, to hand written recursive descendant parsers in Go and or C, to yacc-based parsers in C. I even played with Alex and Happy in Haskell, but that was ages ago and I don't recall anything.
3 791e96fe 2020-10-03 op One thing that I never tried was to write a parser in Clojure. Until now.
5 791e96fe 2020-10-03 op This post is an experiment: I'm trying to do some literate programming in gemtext (the Gemini protocol "native" response format, text/gemini). If you extract all the block codes you should end up with the same gemtext parser I'm using.
7 791e96fe 2020-10-03 op => https://git.omarpolo.com/blog/tree/src/blog/gemtext.clj?id=965388145931751bf314276404816f631c27d01d gemtext.clj (as at the time of writing)
8 791e96fe 2020-10-03 op => gemini://gemini.circumlunar.space/docs/specification.gmi Gemini specification (over Gemini)
9 791e96fe 2020-10-03 op => https://gemini.circumlunar.space/docs/specification.html Gemini specification (over HTTP)
11 791e96fe 2020-10-03 op One last note before the implementation: while reading the code, please keep in mind that I really wanted to play with the transducers. There are probably more efficient or neater way to parse stuff in clojure, but I didn't want to write an efficient or fast parser. I wanted to have some fun with the transducers!
13 791e96fe 2020-10-03 op Given that gemtext is a line-oriented protocol, I thought to split the input in into lines and use some transducers magic to end up with a neat hiccup-like data structure.
15 791e96fe 2020-10-03 op Now we can begin. The aren't third parties dependencies here:
18 791e96fe 2020-10-03 op (ns blog.gemtext
20 791e96fe 2020-10-03 op [clojure.string :as str]
21 791e96fe 2020-10-03 op [clojure.walk :as walk]))
24 791e96fe 2020-10-03 op We'll use only clojure.string and clojure.walk, as well as the standard library.
26 791e96fe 2020-10-03 op We also need one helper function, starts-with?: it's a wrapper around clojure.string/starts-with?. The need for such function will be clear later.
29 791e96fe 2020-10-03 op (defn- starts-with?
30 791e96fe 2020-10-03 op "check if `s` starts with `substr`. Return `false` if `s` is not a
33 791e96fe 2020-10-03 op (when (string? s)
34 791e96fe 2020-10-03 op (str/starts-with? s substr)))
37 791e96fe 2020-10-03 op Every syntactical element of gemtext will be parsed by its own little transducer. The transducer chain will be fed with a stream of lines, and will return a stream of hiccup-like data structure. Let's begin with the most elaborate one: the match-code-blocks transducer:
40 791e96fe 2020-10-03 op (defn- match-code-blocks []
42 791e96fe 2020-10-03 op (let [acc (volatile! [])
43 791e96fe 2020-10-03 op state (volatile! :normal)]
46 791e96fe 2020-10-03 op ([result] (rf result))
47 791e96fe 2020-10-03 op ([result line]
48 791e96fe 2020-10-03 op (let [in-verbatim? (= :verbatim @state)
49 791e96fe 2020-10-03 op marker? (starts-with? line "```")]
51 791e96fe 2020-10-03 op (and (not in-verbatim?) marker?) ;go to verbatim
52 791e96fe 2020-10-03 op (do (vreset! state :verbatim)
55 791e96fe 2020-10-03 op ;; return what we've got and go to :normal
56 791e96fe 2020-10-03 op (and in-verbatim? marker?)
57 791e96fe 2020-10-03 op (let [res [:verbatim (str/join "\n" @acc)]]
58 791e96fe 2020-10-03 op (vreset! state :normal)
59 791e96fe 2020-10-03 op (vreset! acc [])
60 791e96fe 2020-10-03 op (rf result res))
63 791e96fe 2020-10-03 op (do (vswap! acc conj line)
67 791e96fe 2020-10-03 op (rf result line))))))))
70 791e96fe 2020-10-03 op Woah, that was a lot. We defined a function, that returns a function that takes a reducing-function rf. This sets up some local state (acc and state variables), and returns another function!
72 791e96fe 2020-10-03 op That's a lot of functions.
74 791e96fe 2020-10-03 op In the innermost function, only the 2-arity branch is interesting, the other two are just scaffolding. There we check if we've got a line with the ``` marker, and if so we switch between the "capturing state", where we capture all the lines in a code block, and an "identity state", where we pass what we've read unconditionally.
76 791e96fe 2020-10-03 op When we switch from capturing to identity state, we also return a vector of `[:verbatim "matched lines"]`.
78 791e96fe 2020-10-03 op The important thing here is to return the line we got as-is if it's not a marker or within the two markers.
80 791e96fe 2020-10-03 op The rest is similar to this, but maybe simpler. In retrospect, I should have wrote some macro to reduce the boilerplate.
82 791e96fe 2020-10-03 op Now, let's parse the headings. gemtext has three types of headings, with a syntax (and meaning) similar to markdown.
85 791e96fe 2020-10-03 op (defn- match-headings []
89 791e96fe 2020-10-03 op ([result] (rf result))
90 791e96fe 2020-10-03 op ([result line]
93 791e96fe 2020-10-03 op ;; the space character after the # is madatory
94 791e96fe 2020-10-03 op (starts-with? line "# ") [:h1 (subs line 2)]
95 791e96fe 2020-10-03 op (starts-with? line "## ") [:h2 (subs line 3)]
96 791e96fe 2020-10-03 op (starts-with? line "### ") [:h3 (subs line 4)]
97 791e96fe 2020-10-03 op :else line))))))
100 791e96fe 2020-10-03 op There are definitely similarities with the previous example, but here you'll understand why I defined starts-with? instead of using directly clojure.string/starts-with?. The "line" we get here can be a string. Or can be a hiccup form. In fact, every transducer will se as input the output of the transducers that run before him. The helper function saves us from checking every time if the input is a string or not.
102 791e96fe 2020-10-03 op Now, some of the other syntactical elements are so similar to parse that I wrote a generic matcher:
105 791e96fe 2020-10-03 op (defn- generic-matcher
106 791e96fe 2020-10-03 op "Return a generic matcher transducer. Will wrap line that starts with
107 791e96fe 2020-10-03 op `start` within `[type line]`."
112 791e96fe 2020-10-03 op ([result] (rf result))
113 791e96fe 2020-10-03 op ([result line]
115 791e96fe 2020-10-03 op (if (starts-with? line start)
116 791e96fe 2020-10-03 op [type (subs line (count start))]
120 791e96fe 2020-10-03 op and I've used it to match the lists and the quotes:
123 791e96fe 2020-10-03 op (defn- match-lists [] (generic-matcher "* " :li))
124 791e96fe 2020-10-03 op (defn- match-blockquotes [] (generic-matcher "> " :blockquote))
127 791e96fe 2020-10-03 op In case you didn't know, lists in gemtext starts with a star * followed by at least one space, followed by arbitrary text. The quote are similar, except that they begins with >.
129 791e96fe 2020-10-03 op Matching the links is a little bit difficult:
132 791e96fe 2020-10-03 op (defn- match-links []
136 791e96fe 2020-10-03 op ([result] (rf result))
137 791e96fe 2020-10-03 op ([result line]
138 791e96fe 2020-10-03 op (let [spaces? #{\space \tab}
139 791e96fe 2020-10-03 op nonblank? (complement spaces?)]
141 791e96fe 2020-10-03 op (if-not (starts-with? line "=>")
143 791e96fe 2020-10-03 op (->> (seq line)
144 791e96fe 2020-10-03 op (drop 2) ; drop the marker
145 791e96fe 2020-10-03 op (drop-while spaces?) ; drop also the optional spaces
146 791e96fe 2020-10-03 op (split-with nonblank?) ; separate URL from (optional) label
147 791e96fe 2020-10-03 op (apply #(vector :link
148 791e96fe 2020-10-03 op (apply str %1)
149 791e96fe 2020-10-03 op (apply str (drop-while spaces? %2))))))))))))
152 791e96fe 2020-10-03 op First of all, unlike in HTML, links on gemini aren't inline entities. Second, their syntax is an "arrow" `=>` eventually followed by spaces, followed by the URL, and an optional description.
154 791e96fe 2020-10-03 op In the previous function, if we match a line that starts with "=>" we start split it into the URL and the (optional) description.
156 791e96fe 2020-10-03 op (seq line) will return a sequence of character, from which we remove the marker. Then we split this in the URL and description. To keep the implementation easy an URL is just a sequence of bytes other than a space or tab. We also drop any blanks from the description.
158 791e96fe 2020-10-03 op How cool are the threading macros? (also note that we used a set as a function for even more style points!)
160 791e96fe 2020-10-03 op Anyway, the only missing piece is matching the paragraphs. Given that we match on every syntactical entities present in the specification (as of now at least), every non-matched line is a paragraph.
163 791e96fe 2020-10-03 op (defn match-paragraphs [] (generic-matcher "" :paragraph))
168 791e96fe 2020-10-03 op Now we only need to chain these transducer together. Keeping in mind that the order is important, here's the chain:
172 791e96fe 2020-10-03 op (comp (match-code-blocks)
173 791e96fe 2020-10-03 op (match-headings)
174 791e96fe 2020-10-03 op (match-lists)
175 791e96fe 2020-10-03 op (match-blockquotes)
176 791e96fe 2020-10-03 op (match-links)
177 791e96fe 2020-10-03 op (match-paragraphs)))
180 791e96fe 2020-10-03 op Lines will flow from the first transducer (match-code-blocks) towards the end, being enriched at every step. Beautiful!
182 791e96fe 2020-10-03 op Parsing is thus dead-simple now that we've got every piece:
186 791e96fe 2020-10-03 op "Given a string representing a gemtext document, parse it into an
187 791e96fe 2020-10-03 op hiccup-like data structure."
189 791e96fe 2020-10-03 op (transduce parser conj [] (str/split-lines str)))
192 791e96fe 2020-10-03 op We use our chain (the parser), fed with the lines from str, and conj everything into []. The empty vector is important, because if you use a list or nil, due to how conj works, you'll get the parsed document in reverse order.
194 791e96fe 2020-10-03 op Aaaand that's all! As a final bonus, if you reached this point, here's an implementation of unparse, a function that takes our hiccup-like format and returns a string, and to-hiccup to translate our hiccup to HTML-hiccup.
197 791e96fe 2020-10-03 op (defn unparse [thing]
198 791e96fe 2020-10-03 op (let [sw (StringBuilder.)]
199 791e96fe 2020-10-03 op (walk/prewalk
206 791e96fe 2020-10-03 op (if-not (keyword? (first t))
208 791e96fe 2020-10-03 op (let [[type a b] t]
211 791e96fe 2020-10-03 op :verbatim (str "```\n" a "\n```")
212 791e96fe 2020-10-03 op :h1 (str "# " a)
213 791e96fe 2020-10-03 op :h2 (str "## " a)
214 791e96fe 2020-10-03 op :h3 (str "### " a)
215 791e96fe 2020-10-03 op :li (str "* " a)
216 791e96fe 2020-10-03 op :blockquote (str "> " a)
217 791e96fe 2020-10-03 op :link (str "=> " a " " b)
218 791e96fe 2020-10-03 op :paragraph a))
219 791e96fe 2020-10-03 op (.append sw "\n")
222 791e96fe 2020-10-03 op (.toString sw)))
225 791e96fe 2020-10-03 op I've used clojure.walk/prewalk to traverse the input, as it makes easy to navigate inside a nested data structure. The idea is that if we get a sequence whose first element is a keyword, that's a "tag", otherwise we recursively inspect its content. If it's a tag, we convert it to a string, pushing it into a StringBuilder.
227 791e96fe 2020-10-03 op The to-hiccup function is practically the same
230 791e96fe 2020-10-03 op (defn to-hiccup [doc]
231 791e96fe 2020-10-03 op (let [l (atom [])]
232 791e96fe 2020-10-03 op (walk/prewalk
239 791e96fe 2020-10-03 op (if-not (keyword? (first t))
241 791e96fe 2020-10-03 op (let [[type a b] t]
242 791e96fe 2020-10-03 op (swap! l conj
244 791e96fe 2020-10-03 op :verbatim [:pre [:code a]]
248 791e96fe 2020-10-03 op :li [:ul [:li a]] ;; TODO!
249 791e96fe 2020-10-03 op :blockquote [:blockquote a]
250 791e96fe 2020-10-03 op :link [:p.link [:a {:href a} b]]
251 791e96fe 2020-10-03 op :paragraph [:p a]))
257 791e96fe 2020-10-03 op The only thing that's missing from to-hiccup is the (html/gemini) list handling. That is, while on gemini you only have "list item", in HTML the lists item must be wrapped inside a container tag, ul or ol. Since I'm not using lists that much, I don't care at the moment. One can probably improve it by doing some post-processing on the content of @l and grouping every :li.
259 791e96fe 2020-10-03 op But that's really all.
261 791e96fe 2020-10-03 op I'm using this code to parse the posts (so that they can be rendered also in HTML), and to build every gemini page.