Blame


1 791e96fe 2020-10-03 op I've written various parsers in the past. From simplistic stuff in Korn shell, to hand written recursive descendant parsers in Go and or C, to yacc-based parsers in C. I even played with Alex and Happy in Haskell, but that was ages ago and I don't recall anything.
2 791e96fe 2020-10-03 op
3 791e96fe 2020-10-03 op One thing that I never tried was to write a parser in Clojure. Until now.
4 791e96fe 2020-10-03 op
5 791e96fe 2020-10-03 op This post is an experiment: I'm trying to do some literate programming in gemtext (the Gemini protocol "native" response format, text/gemini). If you extract all the block codes you should end up with the same gemtext parser I'm using.
6 791e96fe 2020-10-03 op
7 791e96fe 2020-10-03 op => https://git.omarpolo.com/blog/tree/src/blog/gemtext.clj?id=965388145931751bf314276404816f631c27d01d gemtext.clj (as at the time of writing)
8 791e96fe 2020-10-03 op => gemini://gemini.circumlunar.space/docs/specification.gmi Gemini specification (over Gemini)
9 791e96fe 2020-10-03 op => https://gemini.circumlunar.space/docs/specification.html Gemini specification (over HTTP)
10 791e96fe 2020-10-03 op
11 791e96fe 2020-10-03 op One last note before the implementation: while reading the code, please keep in mind that I really wanted to play with the transducers. There are probably more efficient or neater way to parse stuff in clojure, but I didn't want to write an efficient or fast parser. I wanted to have some fun with the transducers!
12 791e96fe 2020-10-03 op
13 791e96fe 2020-10-03 op Given that gemtext is a line-oriented protocol, I thought to split the input in into lines and use some transducers magic to end up with a neat hiccup-like data structure.
14 791e96fe 2020-10-03 op
15 791e96fe 2020-10-03 op Now we can begin. The aren't third parties dependencies here:
16 791e96fe 2020-10-03 op
17 791e96fe 2020-10-03 op ```
18 791e96fe 2020-10-03 op (ns blog.gemtext
19 791e96fe 2020-10-03 op (:require
20 791e96fe 2020-10-03 op [clojure.string :as str]
21 791e96fe 2020-10-03 op [clojure.walk :as walk]))
22 791e96fe 2020-10-03 op ```
23 791e96fe 2020-10-03 op
24 791e96fe 2020-10-03 op We'll use only clojure.string and clojure.walk, as well as the standard library.
25 791e96fe 2020-10-03 op
26 791e96fe 2020-10-03 op We also need one helper function, starts-with?: it's a wrapper around clojure.string/starts-with?. The need for such function will be clear later.
27 791e96fe 2020-10-03 op
28 791e96fe 2020-10-03 op ```
29 791e96fe 2020-10-03 op (defn- starts-with?
30 791e96fe 2020-10-03 op "check if `s` starts with `substr`. Return `false` if `s` is not a
31 791e96fe 2020-10-03 op string."
32 791e96fe 2020-10-03 op [s substr]
33 791e96fe 2020-10-03 op (when (string? s)
34 791e96fe 2020-10-03 op (str/starts-with? s substr)))
35 791e96fe 2020-10-03 op ```
36 791e96fe 2020-10-03 op
37 791e96fe 2020-10-03 op Every syntactical element of gemtext will be parsed by its own little transducer. The transducer chain will be fed with a stream of lines, and will return a stream of hiccup-like data structure. Let's begin with the most elaborate one: the match-code-blocks transducer:
38 791e96fe 2020-10-03 op
39 791e96fe 2020-10-03 op ```
40 791e96fe 2020-10-03 op (defn- match-code-blocks []
41 791e96fe 2020-10-03 op (fn [rf]
42 791e96fe 2020-10-03 op (let [acc (volatile! [])
43 791e96fe 2020-10-03 op state (volatile! :normal)]
44 791e96fe 2020-10-03 op (fn
45 791e96fe 2020-10-03 op ([] (rf))
46 791e96fe 2020-10-03 op ([result] (rf result))
47 791e96fe 2020-10-03 op ([result line]
48 791e96fe 2020-10-03 op (let [in-verbatim? (= :verbatim @state)
49 791e96fe 2020-10-03 op marker? (starts-with? line "```")]
50 791e96fe 2020-10-03 op (cond
51 791e96fe 2020-10-03 op (and (not in-verbatim?) marker?) ;go to verbatim
52 791e96fe 2020-10-03 op (do (vreset! state :verbatim)
53 791e96fe 2020-10-03 op result)
54 791e96fe 2020-10-03 op
55 791e96fe 2020-10-03 op ;; return what we've got and go to :normal
56 791e96fe 2020-10-03 op (and in-verbatim? marker?)
57 791e96fe 2020-10-03 op (let [res [:verbatim (str/join "\n" @acc)]]
58 791e96fe 2020-10-03 op (vreset! state :normal)
59 791e96fe 2020-10-03 op (vreset! acc [])
60 791e96fe 2020-10-03 op (rf result res))
61 791e96fe 2020-10-03 op
62 791e96fe 2020-10-03 op in-verbatim?
63 791e96fe 2020-10-03 op (do (vswap! acc conj line)
64 791e96fe 2020-10-03 op result)
65 791e96fe 2020-10-03 op
66 791e96fe 2020-10-03 op :else
67 791e96fe 2020-10-03 op (rf result line))))))))
68 791e96fe 2020-10-03 op ```
69 791e96fe 2020-10-03 op
70 791e96fe 2020-10-03 op Woah, that was a lot. We defined a function, that returns a function that takes a reducing-function rf. This sets up some local state (acc and state variables), and returns another function!
71 791e96fe 2020-10-03 op
72 791e96fe 2020-10-03 op That's a lot of functions.
73 791e96fe 2020-10-03 op
74 791e96fe 2020-10-03 op In the innermost function, only the 2-arity branch is interesting, the other two are just scaffolding. There we check if we've got a line with the ``` marker, and if so we switch between the "capturing state", where we capture all the lines in a code block, and an "identity state", where we pass what we've read unconditionally.
75 791e96fe 2020-10-03 op
76 791e96fe 2020-10-03 op When we switch from capturing to identity state, we also return a vector of `[:verbatim "matched lines"]`.
77 791e96fe 2020-10-03 op
78 791e96fe 2020-10-03 op The important thing here is to return the line we got as-is if it's not a marker or within the two markers.
79 791e96fe 2020-10-03 op
80 791e96fe 2020-10-03 op The rest is similar to this, but maybe simpler. In retrospect, I should have wrote some macro to reduce the boilerplate.
81 791e96fe 2020-10-03 op
82 791e96fe 2020-10-03 op Now, let's parse the headings. gemtext has three types of headings, with a syntax (and meaning) similar to markdown.
83 791e96fe 2020-10-03 op
84 791e96fe 2020-10-03 op ```
85 791e96fe 2020-10-03 op (defn- match-headings []
86 791e96fe 2020-10-03 op (fn [rf]
87 791e96fe 2020-10-03 op (fn
88 791e96fe 2020-10-03 op ([] (rf))
89 791e96fe 2020-10-03 op ([result] (rf result))
90 791e96fe 2020-10-03 op ([result line]
91 791e96fe 2020-10-03 op (rf result
92 791e96fe 2020-10-03 op (cond
93 791e96fe 2020-10-03 op ;; the space character after the # is madatory
94 791e96fe 2020-10-03 op (starts-with? line "# ") [:h1 (subs line 2)]
95 791e96fe 2020-10-03 op (starts-with? line "## ") [:h2 (subs line 3)]
96 791e96fe 2020-10-03 op (starts-with? line "### ") [:h3 (subs line 4)]
97 791e96fe 2020-10-03 op :else line))))))
98 791e96fe 2020-10-03 op ```
99 791e96fe 2020-10-03 op
100 791e96fe 2020-10-03 op There are definitely similarities with the previous example, but here you'll understand why I defined starts-with? instead of using directly clojure.string/starts-with?. The "line" we get here can be a string. Or can be a hiccup form. In fact, every transducer will se as input the output of the transducers that run before him. The helper function saves us from checking every time if the input is a string or not.
101 791e96fe 2020-10-03 op
102 791e96fe 2020-10-03 op Now, some of the other syntactical elements are so similar to parse that I wrote a generic matcher:
103 791e96fe 2020-10-03 op
104 791e96fe 2020-10-03 op ```
105 791e96fe 2020-10-03 op (defn- generic-matcher
106 791e96fe 2020-10-03 op "Return a generic matcher transducer. Will wrap line that starts with
107 791e96fe 2020-10-03 op `start` within `[type line]`."
108 791e96fe 2020-10-03 op [start type]
109 791e96fe 2020-10-03 op (fn [rf]
110 791e96fe 2020-10-03 op (fn
111 791e96fe 2020-10-03 op ([] (rf))
112 791e96fe 2020-10-03 op ([result] (rf result))
113 791e96fe 2020-10-03 op ([result line]
114 791e96fe 2020-10-03 op (rf result
115 791e96fe 2020-10-03 op (if (starts-with? line start)
116 791e96fe 2020-10-03 op [type (subs line (count start))]
117 791e96fe 2020-10-03 op line))))))
118 791e96fe 2020-10-03 op ```
119 791e96fe 2020-10-03 op
120 791e96fe 2020-10-03 op and I've used it to match the lists and the quotes:
121 791e96fe 2020-10-03 op
122 791e96fe 2020-10-03 op ```
123 791e96fe 2020-10-03 op (defn- match-lists [] (generic-matcher "* " :li))
124 791e96fe 2020-10-03 op (defn- match-blockquotes [] (generic-matcher "> " :blockquote))
125 791e96fe 2020-10-03 op ```
126 791e96fe 2020-10-03 op
127 791e96fe 2020-10-03 op In case you didn't know, lists in gemtext starts with a star * followed by at least one space, followed by arbitrary text. The quote are similar, except that they begins with >.
128 791e96fe 2020-10-03 op
129 791e96fe 2020-10-03 op Matching the links is a little bit difficult:
130 791e96fe 2020-10-03 op
131 791e96fe 2020-10-03 op ```
132 791e96fe 2020-10-03 op (defn- match-links []
133 791e96fe 2020-10-03 op (fn [rf]
134 791e96fe 2020-10-03 op (fn
135 791e96fe 2020-10-03 op ([] (rf))
136 791e96fe 2020-10-03 op ([result] (rf result))
137 791e96fe 2020-10-03 op ([result line]
138 791e96fe 2020-10-03 op (let [spaces? #{\space \tab}
139 791e96fe 2020-10-03 op nonblank? (complement spaces?)]
140 791e96fe 2020-10-03 op (rf result
141 791e96fe 2020-10-03 op (if-not (starts-with? line "=>")
142 791e96fe 2020-10-03 op line
143 791e96fe 2020-10-03 op (->> (seq line)
144 791e96fe 2020-10-03 op (drop 2) ; drop the marker
145 791e96fe 2020-10-03 op (drop-while spaces?) ; drop also the optional spaces
146 791e96fe 2020-10-03 op (split-with nonblank?) ; separate URL from (optional) label
147 791e96fe 2020-10-03 op (apply #(vector :link
148 791e96fe 2020-10-03 op (apply str %1)
149 791e96fe 2020-10-03 op (apply str (drop-while spaces? %2))))))))))))
150 791e96fe 2020-10-03 op ```
151 791e96fe 2020-10-03 op
152 791e96fe 2020-10-03 op First of all, unlike in HTML, links on gemini aren't inline entities. Second, their syntax is an "arrow" `=>` eventually followed by spaces, followed by the URL, and an optional description.
153 791e96fe 2020-10-03 op
154 791e96fe 2020-10-03 op In the previous function, if we match a line that starts with "=>" we start split it into the URL and the (optional) description.
155 791e96fe 2020-10-03 op
156 791e96fe 2020-10-03 op (seq line) will return a sequence of character, from which we remove the marker. Then we split this in the URL and description. To keep the implementation easy an URL is just a sequence of bytes other than a space or tab. We also drop any blanks from the description.
157 791e96fe 2020-10-03 op
158 791e96fe 2020-10-03 op How cool are the threading macros? (also note that we used a set as a function for even more style points!)
159 791e96fe 2020-10-03 op
160 791e96fe 2020-10-03 op Anyway, the only missing piece is matching the paragraphs. Given that we match on every syntactical entities present in the specification (as of now at least), every non-matched line is a paragraph.
161 791e96fe 2020-10-03 op
162 791e96fe 2020-10-03 op ```
163 791e96fe 2020-10-03 op (defn match-paragraphs [] (generic-matcher "" :paragraph))
164 791e96fe 2020-10-03 op ```
165 791e96fe 2020-10-03 op
166 791e96fe 2020-10-03 op Done!
167 791e96fe 2020-10-03 op
168 791e96fe 2020-10-03 op Now we only need to chain these transducer together. Keeping in mind that the order is important, here's the chain:
169 791e96fe 2020-10-03 op
170 791e96fe 2020-10-03 op ```
171 791e96fe 2020-10-03 op (def parser
172 791e96fe 2020-10-03 op (comp (match-code-blocks)
173 791e96fe 2020-10-03 op (match-headings)
174 791e96fe 2020-10-03 op (match-lists)
175 791e96fe 2020-10-03 op (match-blockquotes)
176 791e96fe 2020-10-03 op (match-links)
177 791e96fe 2020-10-03 op (match-paragraphs)))
178 791e96fe 2020-10-03 op ```
179 791e96fe 2020-10-03 op
180 791e96fe 2020-10-03 op Lines will flow from the first transducer (match-code-blocks) towards the end, being enriched at every step. Beautiful!
181 791e96fe 2020-10-03 op
182 791e96fe 2020-10-03 op Parsing is thus dead-simple now that we've got every piece:
183 791e96fe 2020-10-03 op
184 791e96fe 2020-10-03 op ```
185 791e96fe 2020-10-03 op (defn parse
186 791e96fe 2020-10-03 op "Given a string representing a gemtext document, parse it into an
187 791e96fe 2020-10-03 op hiccup-like data structure."
188 791e96fe 2020-10-03 op [str]
189 791e96fe 2020-10-03 op (transduce parser conj [] (str/split-lines str)))
190 791e96fe 2020-10-03 op ```
191 791e96fe 2020-10-03 op
192 791e96fe 2020-10-03 op We use our chain (the parser), fed with the lines from str, and conj everything into []. The empty vector is important, because if you use a list or nil, due to how conj works, you'll get the parsed document in reverse order.
193 791e96fe 2020-10-03 op
194 791e96fe 2020-10-03 op Aaaand that's all! As a final bonus, if you reached this point, here's an implementation of unparse, a function that takes our hiccup-like format and returns a string, and to-hiccup to translate our hiccup to HTML-hiccup.
195 791e96fe 2020-10-03 op
196 791e96fe 2020-10-03 op ```
197 791e96fe 2020-10-03 op (defn unparse [thing]
198 791e96fe 2020-10-03 op (let [sw (StringBuilder.)]
199 791e96fe 2020-10-03 op (walk/prewalk
200 791e96fe 2020-10-03 op (fn [t]
201 791e96fe 2020-10-03 op (cond
202 791e96fe 2020-10-03 op (nil? t) nil
203 791e96fe 2020-10-03 op
204 791e96fe 2020-10-03 op (or (seq? t)
205 791e96fe 2020-10-03 op (vector? t))
206 791e96fe 2020-10-03 op (if-not (keyword? (first t))
207 791e96fe 2020-10-03 op t
208 791e96fe 2020-10-03 op (let [[type a b] t]
209 791e96fe 2020-10-03 op (.append sw
210 791e96fe 2020-10-03 op (case type
211 791e96fe 2020-10-03 op :verbatim (str "```\n" a "\n```")
212 791e96fe 2020-10-03 op :h1 (str "# " a)
213 791e96fe 2020-10-03 op :h2 (str "## " a)
214 791e96fe 2020-10-03 op :h3 (str "### " a)
215 791e96fe 2020-10-03 op :li (str "* " a)
216 791e96fe 2020-10-03 op :blockquote (str "> " a)
217 791e96fe 2020-10-03 op :link (str "=> " a " " b)
218 791e96fe 2020-10-03 op :paragraph a))
219 791e96fe 2020-10-03 op (.append sw "\n")
220 791e96fe 2020-10-03 op nil))))
221 791e96fe 2020-10-03 op thing)
222 791e96fe 2020-10-03 op (.toString sw)))
223 791e96fe 2020-10-03 op ```
224 791e96fe 2020-10-03 op
225 791e96fe 2020-10-03 op I've used clojure.walk/prewalk to traverse the input, as it makes easy to navigate inside a nested data structure. The idea is that if we get a sequence whose first element is a keyword, that's a "tag", otherwise we recursively inspect its content. If it's a tag, we convert it to a string, pushing it into a StringBuilder.
226 791e96fe 2020-10-03 op
227 791e96fe 2020-10-03 op The to-hiccup function is practically the same
228 791e96fe 2020-10-03 op
229 791e96fe 2020-10-03 op ```
230 791e96fe 2020-10-03 op (defn to-hiccup [doc]
231 791e96fe 2020-10-03 op (let [l (atom [])]
232 791e96fe 2020-10-03 op (walk/prewalk
233 791e96fe 2020-10-03 op (fn [t]
234 791e96fe 2020-10-03 op (cond
235 791e96fe 2020-10-03 op (nil? t) nil
236 791e96fe 2020-10-03 op
237 791e96fe 2020-10-03 op (or (seq? t)
238 791e96fe 2020-10-03 op (vector? t))
239 791e96fe 2020-10-03 op (if-not (keyword? (first t))
240 791e96fe 2020-10-03 op t
241 791e96fe 2020-10-03 op (let [[type a b] t]
242 791e96fe 2020-10-03 op (swap! l conj
243 791e96fe 2020-10-03 op (case type
244 791e96fe 2020-10-03 op :verbatim [:pre [:code a]]
245 791e96fe 2020-10-03 op :h1 [:h1 a]
246 791e96fe 2020-10-03 op :h2 [:h2 a]
247 791e96fe 2020-10-03 op :h3 [:h3 a]
248 791e96fe 2020-10-03 op :li [:ul [:li a]] ;; TODO!
249 791e96fe 2020-10-03 op :blockquote [:blockquote a]
250 791e96fe 2020-10-03 op :link [:p.link [:a {:href a} b]]
251 791e96fe 2020-10-03 op :paragraph [:p a]))
252 791e96fe 2020-10-03 op nil))))
253 791e96fe 2020-10-03 op doc)
254 791e96fe 2020-10-03 op (seq @l)))
255 791e96fe 2020-10-03 op ```
256 791e96fe 2020-10-03 op
257 791e96fe 2020-10-03 op The only thing that's missing from to-hiccup is the (html/gemini) list handling. That is, while on gemini you only have "list item", in HTML the lists item must be wrapped inside a container tag, ul or ol. Since I'm not using lists that much, I don't care at the moment. One can probably improve it by doing some post-processing on the content of @l and grouping every :li.
258 791e96fe 2020-10-03 op
259 791e96fe 2020-10-03 op But that's really all.
260 791e96fe 2020-10-03 op
261 791e96fe 2020-10-03 op I'm using this code to parse the posts (so that they can be rendered also in HTML), and to build every gemini page.