Blob


1 => /post/inspecting-zips.gmi The first part “Inspecting zip files”
3 => //git.omarpolo.com/zip-view/ The code for the whole series; see ‘zipview.c’ for this post in particular.
5 Edit 2021/08/21: Stefan Sperling (thanks!) noticed an error in the ‘next’ function. After that I found that a wrong check in ‘next’ caused an invalid memory access. The ‘next‘ and ‘ls’ functions were corrected.
7 Now that we know how to navigate inside a zip file let’s see how to extract files from it. But before looking into the decompression routines (spoiler: we’ll need zlib, so make sure it’s installed) we need to do a bit of refactoring, the reason will be clear in a second.
9 The ‘next’ function returns a pointer to the next file record in the central directory, or NULL if none found:
11 ```
12 void *
13 next(uint8_t *zip, size_t len, uint8_t *entry)
14 {
15 uint16_t flen, xlen, clen;
16 uint8_t *next, *end;
18 memcpy(&flen, entry + 28, sizeof(flen));
19 memcpy(&xlen, entry + 28 + 2, sizeof(xlen));
20 memcpy(&clen, entry + 28 + 2 + 2, sizeof(xlen));
22 flen = le16toh(flen);
23 xlen = le16toh(xlen);
24 clen = le16toh(clen);
26 next = entry + 46 + flen + xlen + clen;
27 end = zip + len;
28 if (next >= end - 46 ||
29 memcmp(next, "\x50\x4b\x01\x02", 4) != 0)
30 return NULL;
31 return next;
32 }
33 ```
35 It’s very similar to the code we had in the ‘ls’ function. It computes the pointer to the next entry and does a bit of validation.
37 The ‘filename’ function extracts the filename given a pointer to a file record in the central directory:
39 ```
40 void
41 filename(uint8_t *zip, size_t len, uint8_t *entry, char *buf,
42 size_t size)
43 {
44 uint16_t flen;
45 size_t s;
47 memcpy(&flen, entry + 28, sizeof(flen));
48 flen = le16toh(flen);
50 s = MIN(size-1, flen);
51 memcpy(buf, entry + 46, s);
52 buf[s] = '\0';
53 }
54 ```
56 With these two functions we can now rewrite the ‘ls’ function more easily as:
58 ```
59 void
60 ls(uint8_t *zip, size_t len, uint8_t *cd)
61 {
62 char name[PATH_MAX];
64 do {
65 filename(zip, len, cd, name, sizeof(name));
66 printf("%s\n", name);
67 } while ((cd = next(zip, len, cd)) != NULL);
68 }
69 ```
71 I also want to modify the main a bit:
73 ```
74 int
75 main(int argc, char **argv)
76 {
77 int i, fd;
78 void *zip, *cd;
79 size_t len;
81 if (argc < 2) {
82 fprintf(stderr, "Usage: %s archive.zip [files...]",
83 *argv);
84 return 1;
85 }
87 if ((fd = open(argv[1], O_RDONLY)) == -1)
88 err(1, "can't open %s", argv[1]);
90 zip = map_file(fd, &len);
92 #ifdef __OpenBSD__
93 if (pledge("stdio", NULL) == -1)
94 err(1, "pledge");
95 #endif
97 if ((cd = find_central_directory(zip, len)) == NULL)
98 errx(1, "can't find the central directory");
100 if (argc == 2)
101 ls(zip, len, cd);
102 else {
103 for (i = 2; i < argc; ++i)
104 extract_file(zip, len, cd, argv[i]);
107 munmap(zip, len);
108 close(fd);
110 return 0;
112 ```
114 The difference is that now it accepts a variable number of files to extract after the name of the archive.
116 Since I’m a bit of a OpenBSD fanboy myself, I’ve added a call to pledge(2) right before the main logic of the program: this way, even if we open a faulty zip files that tricks us into doing nasty stuff, the kernel will only allows us to write to *already* opened files and nothing more. On FreeBSD a call to capsicum(4) would be more or less the same in this case. On linux you can waste some hours writing a seccomp(4) filter hoping it doesn’t break on weird architectures or libc implementation :P
118 (I’ve said already that I’m a bit of a OpenBSD fanboy myself right?)
120 => https://man.openbsd.org/pledge pledge(2) manpage
121 => https://www.freebsd.org/cgi/man.cgi?capsicum capsicum(4) manpage
122 => /posts/gmid-sandbox.gmi Comparing sandboxing techniques
124 To implement ‘extract_file’ I’ve used a small helper function called ‘find_file’ that given a file name returns the pointer to its file entry in the central directory. It’s very similar to ‘ls’:
126 ```
127 void *
128 find_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
130 char name[PATH_MAX];
132 do {
133 filename(zip, len, cd, name, sizeof(name));
134 if (!strcmp(name, target))
135 return cd;
136 } while ((cd = next(zip, len, cd)) != NULL);
138 return NULL;
140 ```
142 Then extract_file is really easy:
144 ```
145 int
146 extract_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
148 if ((cd = find_file(zip, len, cd, target)) == NULL)
149 return -1;
151 unzip(zip, len, cd);
152 return 0;
154 ```
156 OK, I’ve cheated a bit, this isn’t the real decompress routine, extract_file only finds the correct offset and call ‘unzip’. Initially I hooked ‘unzip’ into ls but was a bit messy, hence the refactor.
158 Small recap of the last post: in a zip file the file entry in the central directory contains a pointer to the file record inside the zip. The file record is a header followed by the (usually) compressed data. The interesting thing about zip files is that several compression algorithms (including none at all) can be used to compress files inside the same archive. You may have file A store as-is, file B compressed with deflate and file C compressed with God knows what.
160 The good news is that usually most zip applications use deflate and that’s all we care about here. Also, given that it’s easy, I’m going to support also files stored without compression. I have yet to find a zip with not compressed files thought, so that code path is completely untested.
162 Here’s the two constants for the compression methods
164 ```
165 #define COMPRESSION_NONE 0x00
166 #define COMPRESSION_DEFLATE 0x08
167 ```
169 The other algorithms and their codes are described at length in the zip documentation.
171 The unzip functions takes the zip and the pointer to the file entry in the central directory, then finds the offset inside the file and computes the pointer to the start of the actual data. The file record header has a variable width: it’s made by 46 bytes followed by two variable-width fields “file name” and “extra field”.
173 To know the compression method we need to read the compression field, an integer two bytes long starting at offset 8. (see the previous post or the official documentation for the structure of the headers)
175 ```
176 void
177 unzip(uint8_t *zip, size_t len, uint8_t *entry)
179 uint32_t size, crc, off;
180 uint16_t compression;
181 uint16_t flen, xlen;
182 uint8_t *data, *offset;
184 /* read the offset of the file record */
185 memcpy(&off, entry + 42, sizeof(off));
186 offset = zip + le32toh(off);
188 if (offset > zip + len - 46 ||
189 memcmp(offset, "\x50\x4b\x03\x04", 4) != 0)
190 errx(1, "invalid offset or file header signature");
192 memcpy(&compression, offset + 8, sizeof(compression));
193 compression = le16toh(compression);
195 memcpy(&crc, entry + 16, sizeof(crc));
196 memcpy(&size, entry + 20, sizeof(size));
198 crc = le32toh(crc);
199 size = le32toh(size);
201 memcpy(&flen, offset + 26, sizeof(flen));
202 memcpy(&xlen, offset + 28, sizeof(xlen));
204 flen = le16toh(flen);
205 xlen = le16toh(xlen);
207 data = offset + 30 + flen + xlen;
208 if (data + size > zip + len)
209 errx(1, "corrupted zip, offset out of file");
211 switch (compression) {
212 case COMPRESSION_NONE:
213 unzip_none(data, size, crc);
214 break;
215 case COMPRESSION_DEFLATE:
216 unzip_deflate(data, size, crc);
217 break;
218 default:
219 errx(1, "unknown compression method 0x%02x",
220 compression);
223 ```
225 ‘unzip_none’ handles the case of a file stored as-is, without compression. It just copies the data to stdout and checks the CRC32.
227 CRC stands for “Cyclic Redundancy Check” and is widely used to guard against accidental corruption. The math behind it is really interesting, it uses Galois fields and has some really cool properties. It’s also easy to compute, even by hand, but since we’re already using zlib I’ll leave the handling of that to the ‘crc32’ function provided by the library.
229 => https://en.wikipedia.org/wiki/Cyclic_redundancy_check “Cyclic Redundancy Check” at Wikipedia
231 ```the implementation of the unzip_none procedure
232 void
233 unzip_none(uint8_t *data, size_t size, unsigned long ocrc)
235 unsigned long crc = 0;
237 fwrite(data, 1, size, stdout);
239 crc = crc32(0, data, size);
240 if (crc != ocrc)
241 errx(1, "CRC mismatch");
243 ```
245 ‘unzip_deflate’ handles the case of a deflate-compressed file, and I’m going to rely on zlib to decompress the deflated stream.
247 At least for the decompression, zlib doesn’t seem too bad to use. (I don’t know why but I’ve always got this impression that zlib had terrible APIs… While they’re not the prettiest, they’re not *exaggeratedly* bad either).
249 We need to prepare a z_stream “object” with inflateInit, then run the decompression loop by repeatedly call ‘inflate’ and finally free the storage with ‘inflateEnd’.
251 To get back at what I was blabbing before about APIs, zlib has a weird way to convey some bits of information. A bare ‘inflateInit’ will assume a zlib or gz stream while zip archives store a bare deflate. The way to inform zlib about this is to call ‘inflateInit2’ instead and passing a negative number in the -15…-8 range for the sliding window size parameter. Yep, a negative window size means a deflate stream. (The way to require a gz header is also cool, by adding 16 to the desired sliding window size…)
253 When writing this function I stumbled upon this issue for a while, as it’s not exactly intuitive in my opinion.
255 Anyway, the question now becomes what sliding window size choose. From what I’ve understood, it should be computed as
257 ```pseudo code to compute the sliding window size
258 size = log2(file_size)
259 if (size < 8)
260 size = 8
261 if (size > 15)
262 size = 15;
263 return -1 * size
264 ```
266 But for the zip file I’m using as a test, this doesn’t work. I found that using unconditionally -15 seems to work on all cases: it should use a bit more memory but it’s also the default value so it isn’t a bad choice I guess.
268 If you happen to know more about the subject, feel free to correct me so I can update the post.
270 ```
271 void
272 unzip_deflate(uint8_t *data, size_t size, unsigned long ocrc)
274 z_stream stream;
275 size_t have;
276 unsigned long crc = 0;
277 char buf[BUFSIZ];
279 stream.zalloc = Z_NULL;
280 stream.zfree = Z_NULL;
281 stream.opaque = Z_NULL;
282 stream.next_in = data;
283 stream.avail_in = size;
284 stream.next_out = Z_NULL;
285 stream.avail_out = 0;
286 if (inflateInit2(&stream, -15) != Z_OK)
287 err(1, "inflateInit failed");
289 do {
290 stream.next_out = buf;
291 stream.avail_out = sizeof(buf);
293 switch (inflate(&stream, Z_BLOCK)) {
294 case Z_STREAM_ERROR:
295 errx(1, "stream error");
296 case Z_NEED_DICT:
297 errx(1, "need dict");
298 case Z_DATA_ERROR:
299 errx(1, "data error: %s", stream.msg);
300 case Z_MEM_ERROR:
301 errx(1, "memory error");
304 have = sizeof(buf) - stream.avail_out;
305 fwrite(buf, 1, have, stdout);
306 crc = crc32(crc, buf, have);
307 } while (stream.avail_out == 0);
309 inflateEnd(&stream);
311 if (crc != ocrc)
312 errx(1, "CRC mismatch");
314 ```
316 Also note the beauty of the CRC: it can be computed chunk by chunk! The downside is that we don’t know whether the CRC matches or not until we’ve extracted all the file contents. We could probably run the loop twice, but it would be a waste of computing, especially for big files.
318 Now, to test all the code written so far:
320 ```
321 % cc zipview.c -o zipview -lz
322 % ./zipview star_maker_olaf_stapledon.gpub metadata.txt
323 title: Star Maker
324 author: William Olaf Stapledon
325 published: 1937
326 language: en
327 gpubVersion: 0.0.1
329 ```
331 yay! it works!
333 In the next post I’ll add proper support for the ZIP64 spec and some final considerations.