Blame


1 c418ae42 2021-02-13 op I had the opportunity to implement a sandbox and I'd like to write about the differences between the various sandboxing techniques available on three different operating systems: FreeBSD, Linux and OpenBSD.
2 c418ae42 2021-02-13 op
3 c418ae42 2021-02-13 op The scope of this entry is sandboxing gmid. gmid is a single-threaded server for the Gemini protocol; it serves static files and optionally executes CGI scripts.
4 c418ae42 2021-02-13 op
5 c418ae42 2021-02-13 op => /pages/gmid.gmi gmid
6 c418ae42 2021-02-13 op
7 c418ae42 2021-02-13 op Before, the daemon was a single process listening on the port 1965 and eventually forking to execute CGI scripts, all of this managed by a poll(2)-based event loop.
8 c418ae42 2021-02-13 op
9 c418ae42 2021-02-13 op Now, the daemon is splitted into two processes: the listener, as the name suggest, listen on the port 1965 and is sandboxed, while the "executor" process stays out of the sandbox to execute the CGI scripts on-demand on behalf of the listener process. This separation allowed to execute arbitrarly CGI scripts while still keeping the benefits of a sandboxed network process.
10 c418ae42 2021-02-13 op
11 c418ae42 2021-02-13 op I want to focus on the sandboxing techniques used to limit the listener process on the various operating systems.
12 c418ae42 2021-02-13 op
13 c418ae42 2021-02-13 op
14 c418ae42 2021-02-13 op ## Capsicum
15 c418ae42 2021-02-13 op
16 c418ae42 2021-02-13 op It's probably the easiest of the three to understand, but also the less flexible. Capsicum allows a process to enter a sandbox where only certain operation are allowed: for instance, after cap_enter, open(2) is disabled, and one can only open new files using openat(2). Openat itself is restricted in a way that you cannot open files outside the given directory (i.e. you cannot openat(“..”) and escape) — like some sort of chroot(2).
17 c418ae42 2021-02-13 op
18 c418ae42 2021-02-13 op The “disabled” syscalls won't kill the program, as happens with pledge or seccomp, but instead will return an error. This can be both an advantage and a disadvantage, as it may lead the program to execute a code path that wasn't throughtfully tested, and possibly expose bugs because of it.
19 c418ae42 2021-02-13 op
20 c418ae42 2021-02-13 op Using capsicum isn't hard, but requires some preparation: the general rule you have to follow is pre-emptively open every resource you might need before entering the capsicum.
21 c418ae42 2021-02-13 op
22 c418ae42 2021-02-13 op Sandboxing gmid with capsicum required almost no changes to the code: except for the execution of CGI scripts, the daemon was only using openat and accept to obtain new file descriptors, so adding capsicum support was only a matter of calling cap_enter before the main loop. Splitting the daemon into two processes was needed to allow the execution of CGI scripts, but turned out was also useful for pledge and seccomp too.
23 c418ae42 2021-02-13 op
24 c418ae42 2021-02-13 op
25 c418ae42 2021-02-13 op ## Plege and unveil
26 c418ae42 2021-02-13 op
27 c418ae42 2021-02-13 op Pledge and unveil are two syscall provided by the OpenBSD kernel to limits what a process can do and see. They aren't really a sandbox techninque, but are so closely related to the argument that are usually considered one.
28 c418ae42 2021-02-13 op
29 c418ae42 2021-02-13 op With pledge(2), a process tells the kernel that from that moment onwards it will only do a certain categories of things. For instance, the cat program on OpenBSD, before the main loop, has a pledge of “stdio rpath” that means: «from now on I will only do I/O on already opened files (“stdio”) and open new files as read-only (“rpath”)». If a pledge gets violated, the kernel kills the program with SIGABRT and logs the pledge violation.
30 c418ae42 2021-02-13 op
31 c418ae42 2021-02-13 op One key feature of pledge is that is possible to drop pledges as you go. For example, you can start with pledges “A B C” and after a while make another pledge call for “A C”, effectively dropping the capability B. However, you cannot gain new capabilities.
32 c418ae42 2021-02-13 op
33 c418ae42 2021-02-13 op Unveil is a natural complement of pledge, as is used to limit the portion of the filesystem a process can access.
34 c418ae42 2021-02-13 op
35 c418ae42 2021-02-13 op One important aspect of both pledge and unveil is that they are reset upon exec: this is why I’m not going to strictly categorise them as sandboxing method. Nevertheless, this aspect is, in my opinion, one big demonstration of pragmatism and the reason pledge and unveil are so widespread, even in software not developed with OpenBSD in mind.
36 c418ae42 2021-02-13 op
37 c418ae42 2021-02-13 op On UNIX we have various programs that are, or act like, shells. We constantly fork(2) to exec(2) other programs that do stuff that we don’t want to do. Also, most programs follow, or can be easily modified to do, an initialisation phase where they require access to various places on the filesystem and a lot of capabilities, and a “main-loop” phase where they only do a couple of things. This means that it’s actually impossible to sandbox certain programs with capsicum(4) or with seccomp(2), while they’re dead-easy to pledge(2).
38 c418ae42 2021-02-13 op
39 c418ae42 2021-02-13 op Take a shell for instance. You cannot capsicum(4) csh. You can’t seccomp(2) bash. But you can pledge(2) ksh:
40 c418ae42 2021-02-13 op
41 c418ae42 2021-02-13 op ```
42 c418ae42 2021-02-13 op ; grep 'if (pledge' /usr/src/bin/ksh/ -RinH
43 c418ae42 2021-02-13 op /usr/src/bin/ksh/main.c:150: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
44 c418ae42 2021-02-13 op /usr/src/bin/ksh/main.c:156: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
45 c418ae42 2021-02-13 op /usr/src/bin/ksh/misc.c:303: if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
46 c418ae42 2021-02-13 op ```
47 c418ae42 2021-02-13 op
48 c418ae42 2021-02-13 op OpenBSD is the only OS where BOTH the gmid processes, the listener and executor, are sandboxed. The listener runs with the “stdio recvfd rpath inet” pledges and can only see the directories that it serves, and the executor runs with “stdio sendfd proc exec”.
49 c418ae42 2021-02-13 op
50 c418ae42 2021-02-13 op To conclude, pledge is more like a complement of the compiler, a sort of runtime checks that you really do what you promised to, more than a sandbox technique.
51 c418ae42 2021-02-13 op
52 c418ae42 2021-02-13 op
53 c418ae42 2021-02-13 op ## Seccomp
54 c418ae42 2021-02-13 op
55 c418ae42 2021-02-13 op Seccomp is huge. It’s the most flexible and complex method of sandboxing I know of. It was also the least pleasant one to work with, but was fun nevertheless.
56 c418ae42 2021-02-13 op
57 c418ae42 2021-02-13 op Seccomp allows you to write a script in a particular language, BPF, that gets executed (in the kernel) before EVERY syscall. The script can decide to allow or disallow the system call, to kill the program or to return an error: it can control the behaviour of your program. Oh, and they are inherited by the children of your program, so you can control them too.
58 c418ae42 2021-02-13 op
59 c418ae42 2021-02-13 op BPF programs are designed to be “secure” to run kernel-side, they aren’t Turing-complete, as the have conditional jumps but you can only jump forward, and a maximum allowed size, so you know for certain that a BPF programs, from now on called filters, will complete and take at-worst n time. BPF programs are also validated to ensure that every possible code paths ends with a return.
60 c418ae42 2021-02-13 op
61 c418ae42 2021-02-13 op These filters can access the system call number and the parameters. One important restriction is that the filter can read the parameters but not deference pointers: that means that you cannot disallow open(2) if the first argument is “/tmp”, but you can allow ioctl(2) only on the file descriptors 1, 5 and 27.
62 c418ae42 2021-02-13 op
63 c418ae42 2021-02-13 op So, how it’s like to write a filter? Well, I hope you like C macros :)
64 c418ae42 2021-02-13 op
65 c418ae42 2021-02-13 op ```C
66 c418ae42 2021-02-13 op struct sock_filter filter[] = {
67 c418ae42 2021-02-13 op /* load the *current* architecture */
68 c418ae42 2021-02-13 op BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
69 c418ae42 2021-02-13 op (offsetof(struct seccomp_data, arch))),
70 c418ae42 2021-02-13 op /* ensure it's the same that we've been compiled on */
71 c418ae42 2021-02-13 op BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K,
72 c418ae42 2021-02-13 op SECCOMP_AUDIT_ARCH, 1, 0),
73 c418ae42 2021-02-13 op /* if not, kill the program */
74 c418ae42 2021-02-13 op BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
75 c418ae42 2021-02-13 op
76 c418ae42 2021-02-13 op /* load the syscall number */
77 c418ae42 2021-02-13 op BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
78 c418ae42 2021-02-13 op (offsetof(struct seccomp_data, nr))),
79 c418ae42 2021-02-13 op
80 c418ae42 2021-02-13 op /* allow write */
81 c418ae42 2021-02-13 op BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
82 c418ae42 2021-02-13 op BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
83 c418ae42 2021-02-13 op
84 c418ae42 2021-02-13 op /* … */
85 c418ae42 2021-02-13 op };
86 c418ae42 2021-02-13 op
87 c418ae42 2021-02-13 op struct sock_fprog prog = {
88 c418ae42 2021-02-13 op .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
89 c418ae42 2021-02-13 op .filter = filter,
90 c418ae42 2021-02-13 op };
91 c418ae42 2021-02-13 op ```
92 c418ae42 2021-02-13 op
93 c418ae42 2021-02-13 op and later load it with prctl:
94 c418ae42 2021-02-13 op
95 c418ae42 2021-02-13 op ```C
96 c418ae42 2021-02-13 op if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
97 c418ae42 2021-02-13 op fprintf(stderr, "%s: prctl(PR_SET_NO_NEW_PRIVS): %s\n",
98 c418ae42 2021-02-13 op __func__, strerror(errno));
99 c418ae42 2021-02-13 op exit(1);
100 c418ae42 2021-02-13 op }
101 c418ae42 2021-02-13 op
102 c418ae42 2021-02-13 op if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == -1) {
103 c418ae42 2021-02-13 op fprintf(stderr, "%s: prctl(PR_SET_SECCOMP): %s\n",
104 c418ae42 2021-02-13 op __func__, strerror(errno));
105 c418ae42 2021-02-13 op exit(1);
106 c418ae42 2021-02-13 op }
107 c418ae42 2021-02-13 op ```
108 c418ae42 2021-02-13 op
109 c418ae42 2021-02-13 op To make things a little bit readable I have defined a SC_ALLOW macro as:
110 c418ae42 2021-02-13 op
111 c418ae42 2021-02-13 op ```C
112 c418ae42 2021-02-13 op /* make the filter more readable */
113 c418ae42 2021-02-13 op #define SC_ALLOW(nr) \
114 c418ae42 2021-02-13 op BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_##nr, 0, 1), \
115 c418ae42 2021-02-13 op BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
116 c418ae42 2021-02-13 op ```
117 c418ae42 2021-02-13 op
118 c418ae42 2021-02-13 op so you can write things like
119 c418ae42 2021-02-13 op
120 c418ae42 2021-02-13 op ```C
121 c418ae42 2021-02-13 op /* … */
122 c418ae42 2021-02-13 op SC_ALLOW(accept),
123 c418ae42 2021-02-13 op SC_ALLOW(read),
124 c418ae42 2021-02-13 op SC_ALLOW(openat),
125 c418ae42 2021-02-13 op SC_ALLOW(fstat),
126 c418ae42 2021-02-13 op SC_ALLOW(close),
127 c418ae42 2021-02-13 op SC_ALLOW(lseek),
128 c418ae42 2021-02-13 op SC_ALLOW(brk),
129 c418ae42 2021-02-13 op SC_ALLOW(mmap),
130 c418ae42 2021-02-13 op SC_ALLOW(munmap),
131 c418ae42 2021-02-13 op /* … */
132 c418ae42 2021-02-13 op ```
133 c418ae42 2021-02-13 op
134 c418ae42 2021-02-13 op As you can see, BPF looks like assembly, and in fact you talk about BPF bytecode. I’m not going to teach the BPF here, but it’s fairly easy to learn if you have some previous experience with assembly or with the bytecode of some virtual machine.
135 c418ae42 2021-02-13 op
136 c418ae42 2021-02-13 op Debugging seccomp is also quite difficult. When you violate a pledge the OpenBSD kernel will make the program abort and logs something like
137 c418ae42 2021-02-13 op
138 c418ae42 2021-02-13 op ```
139 c418ae42 2021-02-13 op Jan 22 21:38:38 venera /bsd: foo[43103]: pledge "stdio", syscall 5
140 c418ae42 2021-02-13 op ```
141 c418ae42 2021-02-13 op
142 c418ae42 2021-02-13 op so you know a) what pledge you’re missing, “stdio” in this case, and b) what syscall you tried to issue, 5 in this example. You also get a core dump, so you can check the stacktrace to understand what’s going on.
143 c418ae42 2021-02-13 op
144 c418ae42 2021-02-13 op With BPF, your filter can do basically three things:
145 c418ae42 2021-02-13 op * kill the program with an un-catchable SIGSYS
146 c418ae42 2021-02-13 op * send a catchable SIGSYS
147 c418ae42 2021-02-13 op * don’t execute the syscall and return an error (you can choose which)
148 c418ae42 2021-02-13 op
149 c418ae42 2021-02-13 op so if you want to debug things you have to implement your debugging strategy by yourself. I’m doing something similar to what OpenSSH does: at compile-time switch to make the filter raise a catchable SIGSYS and install an handler for it.
150 c418ae42 2021-02-13 op
151 c418ae42 2021-02-13 op ```C
152 c418ae42 2021-02-13 op /* uncomment to enable debugging. ONLY FOR DEVELOPMENT */
153 c418ae42 2021-02-13 op /* #define SC_DEBUG */
154 c418ae42 2021-02-13 op
155 c418ae42 2021-02-13 op #ifdef SC_DEBUG
156 c418ae42 2021-02-13 op # define SC_FAIL SECCOMP_RET_TRAP
157 c418ae42 2021-02-13 op #else
158 c418ae42 2021-02-13 op # define SC_FAIL SECCOMP_RET_KILL
159 c418ae42 2021-02-13 op #endif
160 c418ae42 2021-02-13 op
161 c418ae42 2021-02-13 op static void
162 c418ae42 2021-02-13 op sandbox_seccomp_violation(int signum, siginfo_t *info, void *ctx)
163 c418ae42 2021-02-13 op {
164 c418ae42 2021-02-13 op fprintf(stderr, "%s: unexpected system call (arch:0x%x,syscall:%d @ %p)\n",
165 c418ae42 2021-02-13 op __func__, info->si_arch, info->si_syscall, info->si_call_addr);
166 c418ae42 2021-02-13 op _exit(1);
167 c418ae42 2021-02-13 op }
168 c418ae42 2021-02-13 op
169 c418ae42 2021-02-13 op static void
170 c418ae42 2021-02-13 op sandbox_seccomp_catch_sigsys(void)
171 c418ae42 2021-02-13 op {
172 c418ae42 2021-02-13 op struct sigaction act;
173 c418ae42 2021-02-13 op sigset_t mask;
174 c418ae42 2021-02-13 op
175 c418ae42 2021-02-13 op memset(&act, 0, sizeof(act));
176 c418ae42 2021-02-13 op sigemptyset(&mask);
177 c418ae42 2021-02-13 op sigaddset(&mask, SIGSYS);
178 c418ae42 2021-02-13 op
179 c418ae42 2021-02-13 op act.sa_sigaction = &sandbox_seccomp_violation;
180 c418ae42 2021-02-13 op act.sa_flags = SA_SIGINFO;
181 c418ae42 2021-02-13 op if (sigaction(SIGSYS, &act, NULL) == -1) {
182 c418ae42 2021-02-13 op fprintf(stderr, "%s: sigaction(SIGSYS): %s\n",
183 c418ae42 2021-02-13 op __func__, strerror(errno));
184 c418ae42 2021-02-13 op exit(1);
185 c418ae42 2021-02-13 op }
186 c418ae42 2021-02-13 op if (sigprocmask(SIG_UNBLOCK, &mask, NULL) == -1) {
187 c418ae42 2021-02-13 op fprintf(stderr, "%s: sigprocmask(SIGSYS): %s\n",
188 c418ae42 2021-02-13 op __func__, strerror(errno));
189 c418ae42 2021-02-13 op exit(1);
190 c418ae42 2021-02-13 op }
191 c418ae42 2021-02-13 op }
192 c418ae42 2021-02-13 op
193 c418ae42 2021-02-13 op /* … */
194 c418ae42 2021-02-13 op #ifdef SC_DEBUG
195 c418ae42 2021-02-13 op sandbox_seccomp_catch_sigsys();
196 c418ae42 2021-02-13 op #endif
197 c418ae42 2021-02-13 op ```
198 c418ae42 2021-02-13 op
199 c418ae42 2021-02-13 op This way you can at least know what forbidden syscall you tried to run.
200 c418ae42 2021-02-13 op
201 c418ae42 2021-02-13 op
202 c418ae42 2021-02-13 op ## Wrapping up
203 c418ae42 2021-02-13 op
204 c418ae42 2021-02-13 op I’m not a security expert, so you should take my words with a huge grain of salt, but I think that if we want to build secure systems, we should try to make these important security mechanisms as easy as possible without defeating their purposes.
205 c418ae42 2021-02-13 op
206 c418ae42 2021-02-13 op If a security mechanism is easy enough to understand, to apply and to debug, we can expect to be picked up by a large number of people and everyone benefits from it. This is what I like about the OpenBSD system: over the years the tried to come up with simpler solution to common problems, so now you have things like reallocarray, strlcat & strlcpy, strtonum, etc. Small things that make errors difficult to code.
207 c418ae42 2021-02-13 op
208 c418ae42 2021-02-13 op You may criticise pledge(2) and unveil(2), but one important — and objective — point to note is how easy they are to add to a pre-existing program. You have window managers, shells, servers, utilities that runs under pledge, but I don’t know of a single window manager that runs under seccomp(2).
209 c418ae42 2021-02-13 op
210 c418ae42 2021-02-13 op Talking in particular about linux, only the current seccomp implementation in gmid is half of the lines of code of the first version of the daemon itself.
211 c418ae42 2021-02-13 op
212 c418ae42 2021-02-13 op Just as you cannot achieve security throughout obscurity, you cannot realise it with complexity either: at the end of the day, there isn’t really a considerable difference between obscurity and complexity.
213 c418ae42 2021-02-13 op
214 c418ae42 2021-02-13 op Anyway, thanks for reading! It was a really fun journey: I learned a lot and I had a blast. If you want to report something, please do so by sending me a mail at <op at omarpolo dot com>, or by sending a message to op2 on freenode. Bye and see you next time!