op public repos

Blob

Date:: Sat Feb 13 11:10:56 2021 UTC
Message:: new posts
Actions:: History | Blame | Raw File
1 I had the opportunity to implement a sandbox and I'd like to write about the differences between the various sandboxing techniques available on three different operating systems: FreeBSD, Linux and OpenBSD.
2 
3 The scope of this entry is sandboxing gmid.  gmid is a single-threaded server for the Gemini protocol; it serves static files and optionally executes CGI scripts.
4 
5 => /pages/gmid.gmi  gmid
6 
7 Before, the daemon was a single process listening on the port 1965 and eventually forking to execute CGI scripts, all of this managed by a poll(2)-based event loop.
8 
9 Now, the daemon is splitted into two processes: the listener, as the name suggest, listen on the port 1965 and is sandboxed, while the "executor" process stays out of the sandbox to execute the CGI scripts on-demand on behalf of the listener process.  This separation allowed to execute arbitrarly CGI scripts while still keeping the benefits of a sandboxed network process.
10 
11 I want to focus on the sandboxing techniques used to limit the listener process on the various operating systems.
12 
13 
14 ## Capsicum
15 
16 It's probably the easiest of the three to understand, but also the less flexible.  Capsicum allows a process to enter a sandbox where only certain operation are allowed: for instance, after cap_enter, open(2) is disabled, and one can only open new files using openat(2).  Openat itself is restricted in a way that you cannot open files outside the given directory (i.e. you cannot openat(“..”) and escape) — like some sort of chroot(2).
17 
18 The “disabled” syscalls won't kill the program, as happens with pledge or seccomp, but instead will return an error.  This can be both an advantage and a disadvantage, as it may lead the program to execute a code path that wasn't throughtfully tested, and possibly expose bugs because of it.
19 
20 Using capsicum isn't hard, but requires some preparation: the general rule you have to follow is pre-emptively open every resource you might need before entering the capsicum.
21 
22 Sandboxing gmid with capsicum required almost no changes to the code: except for the execution of CGI scripts, the daemon was only using openat and accept to obtain new file descriptors, so adding capsicum support was only a matter of calling cap_enter before the main loop.  Splitting the daemon into two processes was needed to allow the execution of CGI scripts, but turned out was also useful for pledge and seccomp too.
23 
24 
25 ## Plege and unveil
26 
27 Pledge and unveil are two syscall provided by the OpenBSD kernel to limits what a process can do and see.  They aren't really a sandbox techninque, but are so closely related to the argument that are usually considered one.
28 
29 With pledge(2), a process tells the kernel that from that moment onwards it will only do a certain categories of things.  For instance, the cat program on OpenBSD, before the main loop, has a pledge of “stdio rpath” that means: «from now on I will only do I/O on already opened files (“stdio”) and open new files as read-only (“rpath”)».  If a pledge gets violated, the kernel kills the program with SIGABRT and logs the pledge violation.
30 
31 One key feature of pledge is that is possible to drop pledges as you go.  For example, you can start with pledges “A B C” and after a while make another pledge call for “A C”, effectively dropping the capability B.  However, you cannot gain new capabilities.
32 
33 Unveil is a natural complement of pledge, as is used to limit the portion of the filesystem a process can access.
34 
35 One important aspect of both pledge and unveil is that they are reset upon exec: this is why I’m not going to strictly categorise them as sandboxing method.  Nevertheless, this aspect is, in my opinion, one big demonstration of pragmatism and the reason pledge and unveil are so widespread, even in software not developed with OpenBSD in mind.
36 
37 On UNIX we have various programs that are, or act like, shells.  We constantly fork(2) to exec(2) other programs that do stuff that we don’t want to do.  Also, most programs follow, or can be easily modified to do, an initialisation phase where they require access to various places on the filesystem and a lot of capabilities, and a “main-loop” phase where they only do a couple of things.  This means that it’s actually impossible to sandbox certain programs with capsicum(4) or with seccomp(2), while they’re dead-easy to pledge(2).
38 
39 Take a shell for instance.  You cannot capsicum(4) csh.  You can’t seccomp(2) bash.  But you can pledge(2) ksh:
40 
41 ```
42 ; grep 'if (pledge' /usr/src/bin/ksh/ -RinH
43 /usr/src/bin/ksh/main.c:150:            if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
44 /usr/src/bin/ksh/main.c:156:            if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
45 /usr/src/bin/ksh/misc.c:303:            if (pledge("stdio rpath wpath cpath fattr flock getpw proc "
46 ```
47 
48 OpenBSD is the only OS where BOTH the gmid processes, the listener and executor, are sandboxed.  The listener runs with the “stdio recvfd rpath inet” pledges and can only see the directories that it serves, and the executor runs with “stdio sendfd proc exec”.
49 
50 To conclude, pledge is more like a complement of the compiler, a sort of runtime checks that you really do what you promised to, more than a sandbox technique.
51 
52 
53 ## Seccomp
54 
55 Seccomp is huge.  It’s the most flexible and complex method of sandboxing I know of.  It was also the least pleasant one to work with, but was fun nevertheless.
56 
57 Seccomp allows you to write a script in a particular language, BPF, that gets executed (in the kernel) before EVERY syscall.  The script can decide to allow or disallow the system call, to kill the program or to return an error: it can control the behaviour of your program.  Oh, and they are inherited by the children of your program, so you can control them too.
58 
59 BPF programs are designed to be “secure” to run kernel-side, they aren’t Turing-complete, as the have conditional jumps but you can only jump forward, and a maximum allowed size, so you know for certain that a BPF programs, from now on called filters, will complete and take at-worst n time.  BPF programs are also validated to ensure that every possible code paths ends with a return.
60 
61 These filters can access the system call number and the parameters.  One important restriction is that the filter can read the parameters but not deference pointers: that means that you cannot disallow open(2) if the first argument is “/tmp”, but you can allow ioctl(2) only on the file descriptors 1, 5 and 27.
62 
63 So, how it’s like to write a filter?  Well, I hope you like C macros :)
64 
65 ```C
66 struct sock_filter filter[] = {
67         /* load the *current* architecture */
68         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
69             (offsetof(struct seccomp_data, arch))),
70         /* ensure it's the same that we've been compiled on */
71         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K,
72             SECCOMP_AUDIT_ARCH, 1, 0),
73         /* if not, kill the program */
74         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
75 
76         /* load the syscall number */
77         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
78             (offsetof(struct seccomp_data, nr))),
79 
80         /* allow write */
81         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
82         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
83 
84         /* … */
85 };
86 
87 struct sock_fprog prog = {
88         .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
89         .filter = filter,
90 };
91 ```
92 
93 and later load it with prctl:
94 
95 ```C
96 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == -1) {
97         fprintf(stderr, "%s: prctl(PR_SET_NO_NEW_PRIVS): %s\n",
98             __func__, strerror(errno));
99         exit(1);
100 }
101 
102 if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == -1) {
103         fprintf(stderr, "%s: prctl(PR_SET_SECCOMP): %s\n",
104             __func__, strerror(errno));
105         exit(1);
106 }
107 ```
108 
109 To make things a little bit readable I have defined a SC_ALLOW macro as:
110 
111 ```C
112 /* make the filter more readable */
113 #define SC_ALLOW(nr)                                            \
114         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_##nr, 0, 1),   \
115         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
116 ```
117 
118 so you can write things like
119 
120 ```C
121         /* … */
122         SC_ALLOW(accept),
123         SC_ALLOW(read),
124         SC_ALLOW(openat),
125         SC_ALLOW(fstat),
126         SC_ALLOW(close),
127         SC_ALLOW(lseek),
128         SC_ALLOW(brk),
129         SC_ALLOW(mmap),
130         SC_ALLOW(munmap),
131         /* … */
132 ```
133 
134 As you can see, BPF looks like assembly, and in fact you talk about BPF bytecode.  I’m not going to teach the BPF here, but it’s fairly easy to learn if you have some previous experience with assembly or with the bytecode of some virtual machine.
135 
136 Debugging seccomp is also quite difficult.  When you violate a pledge the OpenBSD kernel will make the program abort and logs something like
137 
138 ```
139 Jan 22 21:38:38 venera /bsd: foo[43103]: pledge "stdio", syscall 5
140 ```
141 
142 so you know a) what pledge you’re missing, “stdio” in this case, and b) what syscall you tried to issue, 5 in this example.  You also get a core dump, so you can check the stacktrace to understand what’s going on.
143 
144 With BPF, your filter can do basically three things:
145 * kill the program with an un-catchable SIGSYS
146 * send a catchable SIGSYS
147 * don’t execute the syscall and return an error (you can choose which)
148 
149 so if you want to debug things you have to implement your debugging strategy by yourself.  I’m doing something similar to what OpenSSH does: at compile-time switch to make the filter raise a catchable SIGSYS and install an handler for it.
150 
151 ```C
152 /* uncomment to enable debugging.  ONLY FOR DEVELOPMENT */
153 /* #define SC_DEBUG */
154 
155 #ifdef SC_DEBUG
156 # define SC_FAIL SECCOMP_RET_TRAP
157 #else
158 # define SC_FAIL SECCOMP_RET_KILL
159 #endif
160 
161 static void
162 sandbox_seccomp_violation(int signum, siginfo_t *info, void *ctx)
163 {
164         fprintf(stderr, "%s: unexpected system call (arch:0x%x,syscall:%d @ %p)\n",
165             __func__, info->si_arch, info->si_syscall, info->si_call_addr);
166         _exit(1);
167 }
168 
169 static void
170 sandbox_seccomp_catch_sigsys(void)
171 {
172         struct sigaction act;
173         sigset_t mask;
174 
175         memset(&act, 0, sizeof(act));
176         sigemptyset(&mask);
177         sigaddset(&mask, SIGSYS);
178 
179         act.sa_sigaction = &sandbox_seccomp_violation;
180         act.sa_flags = SA_SIGINFO;
181         if (sigaction(SIGSYS, &act, NULL) == -1) {
182                 fprintf(stderr, "%s: sigaction(SIGSYS): %s\n",
183                     __func__, strerror(errno));
184                 exit(1);
185         }
186         if (sigprocmask(SIG_UNBLOCK, &mask, NULL) == -1) {
187                 fprintf(stderr, "%s: sigprocmask(SIGSYS): %s\n",
188                     __func__, strerror(errno));
189                 exit(1);
190         }
191 }
192 
193 /* … */
194 #ifdef SC_DEBUG
195         sandbox_seccomp_catch_sigsys();
196 #endif
197 ```
198 
199 This way you can at least know what forbidden syscall you tried to run.
200 
201 
202 ## Wrapping up
203 
204 I’m not a security expert, so you should take my words with a huge grain of salt, but I think that if we want to build secure systems, we should try to make these important security mechanisms as easy as possible without defeating their purposes.
205 
206 If a security mechanism is easy enough to understand, to apply and to debug, we can expect to be picked up by a large number of people and everyone benefits from it.  This is what I like about the OpenBSD system: over the years the tried to come up with simpler solution to common problems, so now you have things like reallocarray, strlcat & strlcpy, strtonum, etc.  Small things that make errors difficult to code.
207 
208 You may criticise pledge(2) and unveil(2), but one important — and objective — point to note is how easy they are to add to a pre-existing program.  You have window managers, shells, servers, utilities that runs under pledge, but I don’t know of a single window manager that runs under seccomp(2).
209 
210 Talking in particular about linux, only the current seccomp implementation in gmid is half of the lines of code of the first version of the daemon itself.
211 
212 Just as you cannot achieve security throughout obscurity, you cannot realise it with complexity either: at the end of the day, there isn’t really a considerable difference between obscurity and complexity.
213 
214 Anyway, thanks for reading!  It was a really fun journey: I learned a lot and I had a blast.  If you want to report something, please do so by sending me a mail at <op at omarpolo dot com>, or by sending a message to op2 on freenode.  Bye and see you next time!