summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorOmar Polo <op@omarpolo.com>2021-02-13 12:10:56 +0100
committerOmar Polo <op@omarpolo.com>2021-02-13 12:10:56 +0100
commitc418ae420922e6b00caa7955a533e8d19fa28ba1 (patch)
tree57cd355ebc855747fe7a130c4183211c514af661
parent69c096377636a12b0239a435c6559454fb0e2885 (diff)
downloadblog-c418ae420922e6b00caa7955a533e8d19fa28ba1.tar.gz
blog-c418ae420922e6b00caa7955a533e8d19fa28ba1.tar.bz2
new posts
-rw-r--r--resources/posts/daemon-reload.gmi123
-rw-r--r--resources/posts/emacs-macros.gmi41
-rw-r--r--resources/posts/fun-with-seccomp.gmi35
-rw-r--r--resources/posts/gmid-1.5.gmi50
-rw-r--r--resources/posts/gmid-sandbox.gmi214
-rw-r--r--resources/posts/parsing-po.gmi248
6 files changed, 711 insertions, 0 deletions
diff --git a/resources/posts/daemon-reload.gmi b/resources/posts/daemon-reload.gmi
new file mode 100644
index 0000000..14dfad5
--- /dev/null
+++ b/resources/posts/daemon-reload.gmi
@@ -0,0 +1,123 @@
+EDIT 2021/02/05: typos
+
+Some daemons are able to restart themselves. I mean, a real in-place restart, not a naïve external stop+re-exec.
+
+Why would you care if a daemon is able to restart in place or not? Well, it depends. For some daemons is almost a necessary feature (think of sshd, would you be happy if when you restart the daemon it would shut down every ongoing connections? I wouldn’t), in others a nice-to-have feature (httpd for instance), while in some case is an unnecessary complications.
+
+Generally speaking, with a various degree of importance, for network-related daemons being able to restart in place is a good thing. It means that you (the server administrator) can adjust things while the daemon is running and this is almost invisible for the outside word: ongoing connection are preserved and new connections are subject to the new set of rules.
+
+I just implemented something similar for gmid, my Gemini server, but the overall design can be used in various kind of daemons I guess.
+
+=> gemini://gemini.omarpolo.com/pages/gmid.gmi gmid
+
+The solution I chose was to keep a parent process that on SIGHUP re-reads the configuration and forks(2) to execute the real daemon code. The other processes on SIGHUP simply stop accepting new connections and finish to process what they have.
+
+Doing it this way simplifies the situation when you take into consideration that the daemon may want to chroot itself, or do any other kind of sandbox, or drop privileges and so on, since the main process remains outside the chroot/sandbox with the original privileges. It also isn’t a security concern since all it does is waiting on a signal (in other words, it cannot be influenced by the outside world.)
+
+One thing to be wary are race-conditions induced by signal handlers. Consider this bit of code
+
+```
+/* 1 when SIGHUP is received, 0 otherwise.
+ * This var is shared with the children. */
+volatile sig_atomic_t hupped;
+
+/* … */
+
+for (;;) {
+ hupped = 0;
+
+ switch (fork()) {
+ case 0:
+ return daemon_main();
+ }
+
+ wait_sighup();
+ /* after this point hupped is 1 */
+ reload_config();
+}
+```
+
+You see the problem?
+
+(spoiler: the reload_config call is there only to trick you)
+
+We set ‘hupped’ to 0 before we fork, so our child starts with hupped set to 0, then we fork and wait. But what if we receive a SIGHUP after we set the variable to 0, but before the fork? Or right before wait_sighup? The children will exit and the main process would get stuck waiting for a SIGHUP that was already delivered.
+
+Oh, and guarding the wait_sighup won’t work too
+
+```
+if (!hupped) {
+ /* what happens if SIGHUP gets delivered
+ * here, before the wait? */
+ wait_sighup();
+}
+```
+
+Fortunately, we can block signals with sigprocmask and wait for specific signals with sigwait.
+
+=> gemini://gemini.omarpolo.com/cgi/man?sigprocmask sigprocmask(2)
+=> gemini://gemini.omarpolo.com/cgi/man?sigwait sigwait(2)
+
+Frankly, I never used these “advanced” signals API before, as usually the “simplified” interface were enough, but it’s nice to learn new stuff.
+
+The right order should be
+* block all signals
+* fork
+* in the child, re-enable signals
+* in the parent, wait for sighup
+* re-enable signals
+* repeat
+
+or, if you prefer some real code, something along the lines of
+
+```C
+sigset_t set;
+
+void
+block_signals(void)
+{
+ sigset_t new;
+
+ sigemptyset(&new);
+ sigaddset(&new, SIGHUP);
+ sigprocmask(SIG_BLOCK, &new, &set);
+}
+
+void
+unblock_signals(void)
+{
+ sigprocmask(SIG_SETMASK, &set, NULL);
+}
+
+void
+wait_sighup(void)
+{
+ sigset_t mask;
+ int signo;
+
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGHUP);
+ sigwait(&mask, &signo);
+}
+
+/* … */
+
+volatile sig_atomic_t hupped;
+
+/* … */
+
+for (;;) {
+ block_signals();
+ hupped = 0;
+
+ switch (fork()) {
+ case 0:
+ unblock_signals();
+ return daemon_main();
+ }
+
+ wait_sighup();
+ unblock_signals();
+ reload_config();
+}
+```
diff --git a/resources/posts/emacs-macros.gmi b/resources/posts/emacs-macros.gmi
new file mode 100644
index 0000000..ed0c74d
--- /dev/null
+++ b/resources/posts/emacs-macros.gmi
@@ -0,0 +1,41 @@
+I just recalled how cool macros are. I was helping to convert a manuscript for a book from LibreOffice to LaTeX, and to speed the conversion we used pandoc. The problem was that pandoc added a lot of noise to the generated code. Take for instance this bit:
+
+```
+\hypertarget{foo-bar-baz}{%
+\subsubsection[foo bar baz]{\texorpdfstring{\protect\hypertarget{anchor-92}{}{}Foo Bar Baz}{Foo Bar Baz}}\label{foo-bar-baz}}
+```
+
+that needs to be converted to like
+
+```
+\section{Foo Bar Baz}
+```
+
+i.e. subsection → section and remove some crufts. If there were only a handful of those, I could have done it by hand, but given that were around 700 instance of those, it was unfeasible.
+
+My first idea was to fire up sam and play with it a bit. Unfortunately, it’s been a while since I’ve used it extensively, so I’m a bit rusty. The plan was to use the command x to select every paragraph, then g to filter that type of paragraphs and then something with s, but I failed.
+
+Given that I didn’t want to spend too much time on this, I tried to use an Emacs macro. Oh boy, it went so well.
+
+The macro goes along the lines of
+```
+C-s ;; isearch-forward
+\ ;; self-insert-command
+section ;; self-insert-command * 7
+RET ;; newline
+C-a ;; move-beginning-of-line
+2*C-M-SPC ;; mark-sexp
+C-w ;; kill-region
+C-p ;; previous-line
+2*C-o ;; open-line
+C-y ;; yank
+C-n ;; next-line
+M-h ;; mark-paragraph
+C-w ;; kill-region
+```
+
+it goes to the next \section, select the LaTeX command and the text within the square brackets (using mark-sexp two times), move that bit before and then killing the paragraph.
+
+Then ‘C-- C-x e’ and the whole file was fixed.
+
+The bottom line is, I guess, use your editor and learn it, or something along those lines.
diff --git a/resources/posts/fun-with-seccomp.gmi b/resources/posts/fun-with-seccomp.gmi
new file mode 100644
index 0000000..edcf552
--- /dev/null
+++ b/resources/posts/fun-with-seccomp.gmi
@@ -0,0 +1,35 @@
+Debugging for something unrelated, I noticed that on linux gmid’ server process would crash upon SIGHUP. I never noticed it before because (unfortunately) ‘ulimit -c 0’ seems to be the default on various systems (i.e. no core files) and I started testing on-the-fly reconfiguration only recently.
+
+What was particularly strange was that I got not logging whatsoever. I have a compile-time switch for seccomp to raise a catchable SIGSYS to dump the number of the forbidden system call and exit, but in this case my server processes were killed by a SIGSYS without any debugging info.
+
+My first theory was that during the process shutdown (server process gracefully shuts down after a SIGHUP) an unwanted syscall was done, maybe after stderr was flushed and closed and thus my signal handler wasn’t able to print info. But it didn’t seemed the case.
+
+On OpenBSD I have used in the past ktrace(1) to trace the system calls done by a process, so I searched for something similar for linux. Turns out, strace is quite flexible.
+
+I attached strace to the server process:
+
+```
+-bash-5.1# strace -p 30232
+strace: Process 30232 attached
+epoll_pwait(6,
+```
+
+Good, the server process is waiting on epoll as it should, let’s send it a SIGHUP:
+
+```
+-bash-5.1# strace -p 30232
+strace: Process 30232 attached
+epoll_pwait(6, 0x55724496a0, 32, -1, NULL, 8) = -1 EINTR (Interrupted system call)
+--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=30251, si_uid=1000} ---
+write(8, "\1", 1) = 1
+rt_sigreturn({mask=[]}) = ?
++++ killed by SIGSYS +++
+```
+
+Oh, what do we have here. rt_sigreturn(2)! (the write is libevent handling the signal)
+
+After an event handler is called, to restore the program stack linux injects a call to rt_sigreturn. If that syscall gets blocked by a BPF filter, it fails to handle the SIGSYS caused by the filter itself and just crash.
+
+But why for SIGHUP it crashes and for the catchable SIGSYS I was using for the debugging it doesn’t? Well, the SIGSYS handler calls directly _exit, so we don’t rearch the sigreturn.
+
+This is just a daily remainder of how low-level seccomp is, and how sometimes it just leaves you clueless, but also a nice opportunity to learn how signals are implemented on linux.
diff --git a/resources/posts/gmid-1.5.gmi b/resources/posts/gmid-1.5.gmi
new file mode 100644
index 0000000..42adeda
--- /dev/null
+++ b/resources/posts/gmid-1.5.gmi
@@ -0,0 +1,50 @@
+These last twenty days were pretty productive on the gmid front: I ended up doing way more things that I had planned for this v1.5 release.
+
+The headlines are the automatic sandboxing on OpenBSD, FreeBSD and linux and the introduction of the configuration file, but you’ll find the whole change log at the end of this entry.
+
+On OpenBSD pledge and unveil were already in place, but their usage has been improved during this release cycle: the daemon was split into two processes that run with different pledges. This enabled also the usage of capsicum on FreeBSD and seccomp on linux. Always in the same spirit, support for chroot and privilege dropping has been added, so it’s safe to start the daemon with root privileges.
+
+=> /post/gmid-sandbox.gmi Read “Comparing sandboxing techniques” for more information.
+
+With this release gmid has two modes: a daemon mode and a config-less mode. The config-less mode is similar to how gmid operated until now (i.e. running from the command line) and has been improved with an automatic certificate generation, while the daemon more is more akin to “normal” network daemons and needs a configuration file.
+
+The configuration file syntax has been inspired from OpenBSD’ httpd and is quite flexible. It supports a wide range of customizable parameters and location blocks to alter the behaviour per matching path.
+
+
+
+## v1.5 “Interstellar Overdrive” Changelog
+
+### New features
+
+* vhost support
+* configuration file
+* sandboxed by default on OpenBSD, FreeBSD and linux
+* customize the accepted TLS version
+* customizable default type
+* customizable mime mappings
+* provide a dockerfile
+* provide a lang parameter when serving text/gemini files
+* added a ‘configure’ script
+* customizable directory index
+* directory listings (disabled by default)
+* [config] location blocks support
+* chroot support
+* punycode support
+
+### Improvements
+
+* log ip, port, full request and response code (even for CGI scripts)
+* host name matching with globbing rules
+* automatically generate TLS certificates when running in config-less mode and no certificate was found
+
+### Bugfixes
+
+* [IRI] normalize scheme
+* [IRI] normalize hostnames
+* [IRI] accept a wider range of codepoints in hostnames
+* set SERVER_NAME when executing CGI scripts
+
+### Breaking changes
+
+* removed -C, -K flags
+* -d changed meaning: the directory to serve is now given as positional parameter and -d is used to specify the directory for the TLS certificates (either autogenerated or not.)
diff --git a/resources/posts/gmid-sandbox.gmi b/resources/posts/gmid-sandbox.gmi
new file mode 100644
index 0000000..80471da
--- /dev/null
+++ b/resources/posts/gmid-sandbox.gmi
@@ -0,0 +1,214 @@
+I had the opportunity to implement a sandbox and I'd like to write about the differences between the various sandboxing techniques available on three different operating systems: FreeBSD, Linux and OpenBSD.
+
+The scope of this entry is sandboxing gmid. gmid is a single-threaded server for the Gemini protocol; it serves static files and optionally executes CGI scripts.
+
+=> /pages/gmid.gmi gmid
+
+Before, the daemon was a single process listening on the port 1965 and eventually forking to execute CGI scripts, all of this managed by a poll(2)-based event loop.
+
+Now, the daemon is splitted into two processes: the listener, as the name suggest, listen on the port 1965 and is sandboxed, while the "executor" process stays out of the sandbox to execute the CGI scripts on-demand on behalf of the listener process. This separation allowed to execute arbitrarly CGI scripts while still keeping the benefits of a sandboxed network process.
+
+I want to focus on the sandboxing techniques used to limit the listener process on the various operating systems.
+
+
+## Capsicum
+
+It's probably the easiest of the three to understand, but also the less flexible. Capsicum allows a process to enter a sandbox where only certain operation are allowed: for instance, after cap_enter, open(2) is disabled, and one can only open new files using openat(2). Openat itself is restricted in a way that you cannot open files outside the given directory (i.e. you cannot openat(“..”) and escape) — like some sort of chroot(2).
+
+The “disabled” syscalls won't kill the program, as happens with pledge or seccomp, but instead will return an error. This can be both an advantage and a disadvantage, as it may lead the program to execute a code path that wasn't throughtfully tested, and possibly expose bugs because of it.
+
+Using capsicum isn't hard, but requires some preparation: the general rule you have to follow is pre-emptively open every resource you might need before entering the capsicum.
+
+Sandboxing gmid with capsicum required almost no changes to the code: except for the execution of CGI scripts, the daemon was only using openat and accept to obtain new file descriptors, so adding capsicum support was only a matter of calling cap_enter before the main loop. Splitting the daemon into two processes was needed to allow the execution of CGI scripts, but turned out was also useful for pledge and seccomp too.
+
+
+## Plege and unveil
+
+Pledge and unveil are two syscall provided by the OpenBSD kernel to limits what a process can do and see. They aren't really a sandbox techninque, but are so closely related to the argument that are usually considered one.
+
+With pledge(2), a process tells the kernel that from that moment onwards it will only do a certain categories of things. For instance, the cat program on OpenBSD, before the main loop, has a pledge of “stdio rpath” that means: «from now on I will only do I/O on already opened files (“stdio”) and open new files as read-only (“rpath”)». If a pledge gets violated, the kernel kills the program with SIGABRT and logs the pledge violation.
+
+One key feature of pledge is that is possible to drop pledges as you go. For example, you can start with pledges “A B C” and after a while make another pledge call for “A C”, effectively dropping the capability B. However, you cannot gain new capabilities.
+
+Unveil is a natural complement of pledge, as is used to limit the portion of the filesystem a process can access.
+
+One important aspect of both pledge and unveil is that they are reset upon exec: this is why I’m not going to strictly categorise them as sandboxing method. Nevertheless, this aspect is, in my opinion, one big demonstration of pragmatism and the reason pledge and unveil are so widespread, even in software not developed with OpenBSD in mind.
+
+On UNIX we have various programs that are, or act like, shells. We constantly fork(2) to exec(2) other programs that do stuff that we don’t want to do. Also, most programs follow, or can be easily modified to do, an initialisation phase where they require access to various places on the filesystem and a lot of capabilities, and a “main-loop” phase where they only do a couple of things. This means that it’s actually impossible to sandbox certain programs with capsicum(4) or with seccomp(2), while they’re dead-easy to pledge(2).
+
+Take a shell for instance. You cannot capsicum(4) csh. You can’t seccomp(2) bash. But you can pledge(2) ksh:
+
+```
+; grep 'if (pledge' /usr/src/bin/ksh/ -RinH
+/usr/src/bin/ksh/main.c:150: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
+/usr/src/bin/ksh/main.c:156: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
+/usr/src/bin/ksh/misc.c:303: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
+```
+
+OpenBSD is the only OS where BOTH the gmid processes, the listener and executor, are sandboxed. The listener runs with the “stdio recvfd rpath inet” pledges and can only see the directories that it serves, and the executor runs with “stdio sendfd proc exec”.
+
+To conclude, pledge is more like a complement of the compiler, a sort of runtime checks that you really do what you promised to, more than a sandbox technique.
+
+
+## Seccomp
+
+Seccomp is huge. It’s the most flexible and complex method of sandboxing I know of. It was also the least pleasant one to work with, but was fun nevertheless.
+
+Seccomp allows you to write a script in a particular language, BPF, that gets executed (in the kernel) before EVERY syscall. The script can decide to allow or disallow the system call, to kill the program or to return an error: it can control the behaviour of your program. Oh, and they are inherited by the children of your program, so you can control them too.
+
+BPF programs are designed to be “secure” to run kernel-side, they aren’t Turing-complete, as the have conditional jumps but you can only jump forward, and a maximum allowed size, so you know for certain that a BPF programs, from now on called filters, will complete and take at-worst n time. BPF programs are also validated to ensure that every possible code paths ends with a return.
+
+These filters can access the system call number and the parameters. One important restriction is that the filter can read the parameters but not deference pointers: that means that you cannot disallow open(2) if the first argument is “/tmp”, but you can allow ioctl(2) only on the file descriptors 1, 5 and 27.
+
+So, how it’s like to write a filter? Well, I hope you like C macros :)
+
+```C
+struct sock_filter filter[] = {
+ /* load the *current* architecture */
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+ (offsetof(struct seccomp_data, arch))),
+ /* ensure it's the same that we've been compiled on */
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K,
+ SECCOMP_AUDIT_ARCH, 1, 0),
+ /* if not, kill the program */
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
+
+ /* load the syscall number */
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+
+ /* allow write */
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+
+ /* … */
+};
+
+struct sock_fprog prog = {
+ .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
+ .filter = filter,
+};
+```
+
+and later load it with prctl:
+
+```C
+if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
+ fprintf(stderr, "%s: prctl(PR_SET_NO_NEW_PRIVS): %s\n",
+ __func__, strerror(errno));
+ exit(1);
+}
+
+if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == -1) {
+ fprintf(stderr, "%s: prctl(PR_SET_SECCOMP): %s\n",
+ __func__, strerror(errno));
+ exit(1);
+}
+```
+
+To make things a little bit readable I have defined a SC_ALLOW macro as:
+
+```C
+/* make the filter more readable */
+#define SC_ALLOW(nr) \
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_##nr, 0, 1), \
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
+```
+
+so you can write things like
+
+```C
+ /* … */
+ SC_ALLOW(accept),
+ SC_ALLOW(read),
+ SC_ALLOW(openat),
+ SC_ALLOW(fstat),
+ SC_ALLOW(close),
+ SC_ALLOW(lseek),
+ SC_ALLOW(brk),
+ SC_ALLOW(mmap),
+ SC_ALLOW(munmap),
+ /* … */
+```
+
+As you can see, BPF looks like assembly, and in fact you talk about BPF bytecode. I’m not going to teach the BPF here, but it’s fairly easy to learn if you have some previous experience with assembly or with the bytecode of some virtual machine.
+
+Debugging seccomp is also quite difficult. When you violate a pledge the OpenBSD kernel will make the program abort and logs something like
+
+```
+Jan 22 21:38:38 venera /bsd: foo[43103]: pledge "stdio", syscall 5
+```
+
+so you know a) what pledge you’re missing, “stdio” in this case, and b) what syscall you tried to issue, 5 in this example. You also get a core dump, so you can check the stacktrace to understand what’s going on.
+
+With BPF, your filter can do basically three things:
+* kill the program with an un-catchable SIGSYS
+* send a catchable SIGSYS
+* don’t execute the syscall and return an error (you can choose which)
+
+so if you want to debug things you have to implement your debugging strategy by yourself. I’m doing something similar to what OpenSSH does: at compile-time switch to make the filter raise a catchable SIGSYS and install an handler for it.
+
+```C
+/* uncomment to enable debugging. ONLY FOR DEVELOPMENT */
+/* #define SC_DEBUG */
+
+#ifdef SC_DEBUG
+# define SC_FAIL SECCOMP_RET_TRAP
+#else
+# define SC_FAIL SECCOMP_RET_KILL
+#endif
+
+static void
+sandbox_seccomp_violation(int signum, siginfo_t *info, void *ctx)
+{
+ fprintf(stderr, "%s: unexpected system call (arch:0x%x,syscall:%d @ %p)\n",
+ __func__, info->si_arch, info->si_syscall, info->si_call_addr);
+ _exit(1);
+}
+
+static void
+sandbox_seccomp_catch_sigsys(void)
+{
+ struct sigaction act;
+ sigset_t mask;
+
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGSYS);
+
+ act.sa_sigaction = &sandbox_seccomp_violation;
+ act.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGSYS, &act, NULL) == -1) {
+ fprintf(stderr, "%s: sigaction(SIGSYS): %s\n",
+ __func__, strerror(errno));
+ exit(1);
+ }
+ if (sigprocmask(SIG_UNBLOCK, &mask, NULL) == -1) {
+ fprintf(stderr, "%s: sigprocmask(SIGSYS): %s\n",
+ __func__, strerror(errno));
+ exit(1);
+ }
+}
+
+/* … */
+#ifdef SC_DEBUG
+ sandbox_seccomp_catch_sigsys();
+#endif
+```
+
+This way you can at least know what forbidden syscall you tried to run.
+
+
+## Wrapping up
+
+I’m not a security expert, so you should take my words with a huge grain of salt, but I think that if we want to build secure systems, we should try to make these important security mechanisms as easy as possible without defeating their purposes.
+
+If a security mechanism is easy enough to understand, to apply and to debug, we can expect to be picked up by a large number of people and everyone benefits from it. This is what I like about the OpenBSD system: over the years the tried to come up with simpler solution to common problems, so now you have things like reallocarray, strlcat & strlcpy, strtonum, etc. Small things that make errors difficult to code.
+
+You may criticise pledge(2) and unveil(2), but one important — and objective — point to note is how easy they are to add to a pre-existing program. You have window managers, shells, servers, utilities that runs under pledge, but I don’t know of a single window manager that runs under seccomp(2).
+
+Talking in particular about linux, only the current seccomp implementation in gmid is half of the lines of code of the first version of the daemon itself.
+
+Just as you cannot achieve security throughout obscurity, you cannot realise it with complexity either: at the end of the day, there isn’t really a considerable difference between obscurity and complexity.
+
+Anyway, thanks for reading! It was a really fun journey: I learned a lot and I had a blast. If you want to report something, please do so by sending me a mail at <op at omarpolo dot com>, or by sending a message to op2 on freenode. Bye and see you next time!
diff --git a/resources/posts/parsing-po.gmi b/resources/posts/parsing-po.gmi
new file mode 100644
index 0000000..81f6631
--- /dev/null
+++ b/resources/posts/parsing-po.gmi
@@ -0,0 +1,248 @@
+If it’s still not clear, I love writing parsers. A parser is a program that given a stream of characters builds a data structure: it’s able to give meaning to a stream of bytes! What can be more exiting to do than writing parsers?
+
+Some time ago, I tried to use transducers to parse text/gemini files but, given my ignorance with how transducers works, the resulting code is more verbose than it really needs to be.
+
+=> /post/parsing-gemtext-with-clojure.gmi Parsing gemtext with clojure
+
+Today, I gave myself a second possibility at building parsers on top of transducers, and I think the result is way more clean and maybe even shorter than my text/gemini parser, even if the subject has a more complex grammar.
+
+Today’s subject, as you may have guessed by the title of the entry, are PO files.
+
+=> https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html GNU gettext description of PO files.
+
+PO files are commonly used to hold translations data. The format, as described by the link above, is as follows:
+
+``` example of PO file
+white-space
+# translator-comments
+#. extracted-comments
+#: reference...
+#, flag...
+#| msgid previous-untranslated-string
+msgid untranslated-string
+msgstr translated-string
+```
+
+Inventing your own translations system almost never has a good outcome; especially when there are formats such as PO that are supported by a variety of tools, including nice GUIs such as poedit. The sad news is that in the Clojure ecosystem I couldn’t find what I personally consider a good option when it comes to managing translations.
+
+There’s Tempura written by Peter Taoussanis (which, by the way, maintains A LOT of cool libraries), but I don’t particularly like how it works, and I have to plug a parser from/to PO by hand if I want the translators to use poedit (or similar software.)
+
+Another option is Pottery, which I overall like, but
+* multiline translation strings are broken: I have a pending PR since september 2020 to fix it, but no reply as of time of writing
+* they switched to the hippocratic license, which is NOT free software, so there are ethic implications (ironic, uh?)
+
+=> https://github.com/ptaoussanis/tempura Tempura
+=> https://github.com/brightin/pottery Pottery
+
+So here’s why I’m rolling my own. It’s not yet complete, and I’ve just finished the first version of the PO parser/unparser, but I though to post a literal programming-esque post describing how I’m parsing PO files using transducers.
+
+DISCLAIMER: the code was not heavily tested yet, so it may mis-behave. It’s just for demonstration purposes (for the moment.)
+
+```clojure
+(ns op.rtr.po
+ "Utilities to parse PO files."
+ (:require
+ [clojure.edn :as edn]
+ [clojure.string :as str])
+ (:import
+ (java.io StringWriter)))
+```
+
+Well, we’ve got a nice palindrome namespace, which is good, and we’re requiring a few things. clojure.string is quite obvious, since we’re gonna play with them a lot. We’ll also (ab)use clojure.edn during the parsing. StringWriter is imported only to provide a convenience function for parsing PO from strings. Will come in handy also for testing purposes.
+
+The body of this library is the transducer parse, which is made by a bunch of small functions that do simple things.
+
+```clojure
+(def ^:private split-on-blank
+ "Transducer that splits on blank lines."
+ (partition-by #(= % "")))
+```
+
+The split-on-blank transducer will group sequential blank lines and sequential non-blank lines together, this way we can separate each entry in the file.
+
+```clojure
+(def ^:private remove-empty-lines
+ "Transducer that remove groups of empty lines."
+ (filter #(not= "" (first %))))
+```
+
+The remove-empty-lines will simply remove the garbage that split-on-blank produces: it will get rid of the block of empty lines, so we only have sequences of entries.
+
+```clojure
+(declare parse-comments)
+(declare parse-keys)
+
+(def ^:private parse-entries
+ (let [comment-line? (fn [line] (str/starts-with? line "#")))]
+ (map (fn [lines]
+ (let [[comments keys] (partition-by comment-line? lines)]
+ {:comments (parse-comments comments)
+ :keys (parse-keys keys)}))))
+```
+
+Ignoring for a bit parse-comments and parse-keys, this step will take a block of lines that constitute an entry, and parse it into a map of comments and keys, by using partition-by to split the lines of the entries into two.
+
+And we have every piece, we can define a parser now!
+
+```clojure
+(def ^:private parser
+ (comp split-on-blank
+ remove-empty-lines
+ parse-entries))
+```
+
+We can provide a nice API to parse PO file from various sources very easily:
+
+```clojure
+(defn parse
+ "Parse the PO file given as stream of lines `l`."
+ [l]
+ (transduce parser conj [] l))
+
+(defn parse-from-reader
+ "Parse the PO file given in reader `rdr`. `rdr` must implement `java.io.BufferedReader`."
+ [rdr]
+ (parse (line-seq rdr)))
+
+(defn parse-from-string
+ "Parse the PO file given as string."
+ [s]
+ (parse (str/split-lines s)))
+```
+
+And we’re done. This was all for this time. Bye!
+
+Well, no… I still haven’t provided the implementation for parse-comments and parse-keys. To be honest, they’re quite ugly. parse-keys in particular is the ugliest part of the library as of now, but y’know what? Were in 2021 now, if it runs, ship it!
+
+Jokes aside, I should refactor these into something more manageable, but I will focus on the rest of the library fist.
+
+parse-comments takes a block of comment lines and tries to make a sense out if it.
+
+```clojure
+(defn- parse-comments [comments]
+ (into {}
+ (for [comment comments]
+ (let [len (count comment)
+ proper? (>= len 2)
+ start (when proper? (subs comment 0 2))
+ rest (when proper? (subs comment 2))
+ remove-empty #(filter (partial not= "") %)]
+ (case start
+ "#:" [:reference (remove-empty (str/split rest #" +"))]
+ "#," [:flags (remove-empty (str/split rest #" +"))]
+ "# " [:translator-comment rest]
+ ;; TODO: add other types
+ [:unknown-comment comment])))))
+```
+
+We simply loop through each line and do some simple pattern matching on the first two bytes of each. We then group all those vector of two elements into a single hash map. I should probably refactor this to use group-by to avoid loosing some information: say one provides two reference comments, we would lose one of the two.
+
+To define parse-keys we need an helper: join-sequential-strings
+
+```clojure
+(defn- join-sequential-strings [rf]
+ (let [acc (volatile! nil)]
+ (fn
+ ([] (rf))
+ ([res] (if-let [a @acc]
+ (do (vreset! acc nil)
+ (rf res (apply str a)))
+ (rf res)))
+ ([res i]
+ (if (string? i)
+ (do (vswap! acc conj i)
+ res)
+ (rf (or (when-let [a @acc]
+ (vreset! acc nil)
+ (rf res (apply str a)))
+ res)
+ i))))))
+```
+
+The thing about this post, compared to the one about text/gemini, is that I’m becoming more comfortable with transducers, and I’m starting to use the standard library more and more. In fact, this is the only transducer written by hand we’ve seen so far.
+
+As every respectful stateful transducer, it allocates its state, using volatile!. rf is the reducing function, and our transducer function is the one with three arities inside the let.
+
+The one-arity branch is called to signal the end of the stream. The transducer has reached the end of the sequence and call us with the accumulated result ‘res’. There we flush our accumulator, if we had something accumulated, or call the reducing function on the result and end.
+
+The two-arity branch is called on each item in that was fed to the transducer. The first argument, res, is the accumulated result, and i is the current item: if it’s a string, we accumulate it into acc, otherwise we drain our accumulator and pass i to rf as-is.
+
+One important thing I learned writing it is that, even if it should be obvious, rf is a pure function. When we call rf no side-effects occurs. So, to provide two items we can’t simply call rf two times: we have to call rf on the output of rf, and make sure we return it!
+
+In this case, if we’ve accumulated some strings, we reset our accumulator and call rf on the concatenation of them. Then we call rf on this new result, or on the original res if we haven’t accumulated anything, passing i.
+
+It may becomes clearer if we replace rf with conj and res with [] (the empty vector).
+
+With this, we can finally define parse-keys and end our little parser:
+
+```clojure
+(def ^:private keywordize-things
+ (map #(if (string? %) % (keyword %))))
+
+(defn- parse-keys [keys]
+ (apply hash-map
+ (transduce (comp join-sequential-strings
+ keywordize-things)
+ conj
+ []
+ ;; XXX: double hack for double fun!
+ (edn/read-string (str "[" (apply str (interpose " " keys)) "]")))))
+```
+
+keywordize-things is another transducer that would turn into a keyword everything but strings, and parse-keys compose these last two transducer to parse the entry; but it does so with a twist, by abusing edn/read-string.
+
+In a PO file, after the comment each entry has a section like this:
+```
+msgid “message id”
+…
+```
+that is, a keyword followed by a string. But the string can span multiple lines:
+```
+msgid ""
+"hello\n"
+"world"
+```
+
+To parse these situation, and to handle things like \n or \" inside the strings, I’m abusing the edn/read-string function. I’m concatenating every line by joining them with a space in between, and then wrapping the string into “[” and “]”, before calling the edn parser. This way, the edn parser will turn ‘msgid’ (for instance) into a symbol, and read every string for us.
+
+Then we use the transducers defined before to join the strings and turn the symbols into keywords and we have a proper parser. (Well, rewriting this hack will probably be the argument of a following post!)
+
+A quick test:
+
+```clojure
+(parse-from-string "
+
+#: lib/error.c:116
+msgid \"Unknown system error\"
+msgstr \"Errore sconosciuto del sistema\"
+
+#: lib/error.c:116 lib/anothererror.c:134
+msgid \"Known system error\"
+msgstr \"Errore conosciuto del sistema\"
+
+")
+;; =>
+;; [{:comments {:reference ("lib/error.c:116")}
+;; :keys {:msgid "Unknown system error"
+;; :msgstr "Errore sconosciuto del sistema"}}
+;; {:comments {:reference ("lib/error.c:116" "lib/anothererror.c:134")}
+;; :keys {:msgid "Known system error"
+;; :msgstr "Errore conosciuto del sistema"}}]
+```
+
+Yay! It works!
+
+Writing an unparse function is also pretty easy, and is left as an exercise to the reader, because where I live now it’s pretty late and I want to sleep :P
+
+To conclude, another nice property of parser is that if you have a “unparse” operation (i.e. turning your data structure back into its textual representation), then the composition of these two should be the identity function. It’s a handy property for testing!
+
+```clojure
+(let [x [{:comments {:reference '("lib/error.c:116")}
+ :keys {:msgid "Unknown system error"
+ :msgstr "Errore sconosciuto del sistema"}}]]
+ (= x
+ (parse-from-string (unparse-to-string x))))
+;; => true
+```
+
+This was all for this time! (For real this time.) Thanks for reading.