Blame


1 abac85a7 2021-08-21 op => /post/inspecting-zips.gmi The first part “Inspecting zip files”
2 341bd50d 2021-08-21 op
3 fd8fc65c 2021-08-22 op => //git.omarpolo.com/zip-utils/ The code for the whole series; see ‘zipview.c’ for this post in particular.
4 914af851 2021-08-21 op
5 721a8068 2021-08-21 op Edit 2021/08/21: Stefan Sperling (thanks!) noticed an error in the ‘next’ function. After that I found that a wrong check in ‘next’ caused an invalid memory access. The ‘next‘ and ‘ls’ functions were corrected.
6 721a8068 2021-08-21 op
7 341bd50d 2021-08-21 op Now that we know how to navigate inside a zip file let’s see how to extract files from it. But before looking into the decompression routines (spoiler: we’ll need zlib, so make sure it’s installed) we need to do a bit of refactoring, the reason will be clear in a second.
8 341bd50d 2021-08-21 op
9 341bd50d 2021-08-21 op The ‘next’ function returns a pointer to the next file record in the central directory, or NULL if none found:
10 341bd50d 2021-08-21 op
11 341bd50d 2021-08-21 op ```
12 341bd50d 2021-08-21 op void *
13 341bd50d 2021-08-21 op next(uint8_t *zip, size_t len, uint8_t *entry)
14 341bd50d 2021-08-21 op {
15 341bd50d 2021-08-21 op uint16_t flen, xlen, clen;
16 341bd50d 2021-08-21 op uint8_t *next, *end;
17 341bd50d 2021-08-21 op
18 341bd50d 2021-08-21 op memcpy(&flen, entry + 28, sizeof(flen));
19 341bd50d 2021-08-21 op memcpy(&xlen, entry + 28 + 2, sizeof(xlen));
20 341bd50d 2021-08-21 op memcpy(&clen, entry + 28 + 2 + 2, sizeof(xlen));
21 341bd50d 2021-08-21 op
22 341bd50d 2021-08-21 op flen = le16toh(flen);
23 341bd50d 2021-08-21 op xlen = le16toh(xlen);
24 341bd50d 2021-08-21 op clen = le16toh(clen);
25 341bd50d 2021-08-21 op
26 341bd50d 2021-08-21 op next = entry + 46 + flen + xlen + clen;
27 341bd50d 2021-08-21 op end = zip + len;
28 721a8068 2021-08-21 op if (next >= end - 46 ||
29 721a8068 2021-08-21 op memcmp(next, "\x50\x4b\x01\x02", 4) != 0)
30 341bd50d 2021-08-21 op return NULL;
31 341bd50d 2021-08-21 op return next;
32 341bd50d 2021-08-21 op }
33 341bd50d 2021-08-21 op ```
34 341bd50d 2021-08-21 op
35 341bd50d 2021-08-21 op It’s very similar to the code we had in the ‘ls’ function. It computes the pointer to the next entry and does a bit of validation.
36 341bd50d 2021-08-21 op
37 341bd50d 2021-08-21 op The ‘filename’ function extracts the filename given a pointer to a file record in the central directory:
38 341bd50d 2021-08-21 op
39 341bd50d 2021-08-21 op ```
40 341bd50d 2021-08-21 op void
41 341bd50d 2021-08-21 op filename(uint8_t *zip, size_t len, uint8_t *entry, char *buf,
42 341bd50d 2021-08-21 op size_t size)
43 341bd50d 2021-08-21 op {
44 341bd50d 2021-08-21 op uint16_t flen;
45 341bd50d 2021-08-21 op size_t s;
46 341bd50d 2021-08-21 op
47 341bd50d 2021-08-21 op memcpy(&flen, entry + 28, sizeof(flen));
48 341bd50d 2021-08-21 op flen = le16toh(flen);
49 341bd50d 2021-08-21 op
50 341bd50d 2021-08-21 op s = MIN(size-1, flen);
51 341bd50d 2021-08-21 op memcpy(buf, entry + 46, s);
52 341bd50d 2021-08-21 op buf[s] = '\0';
53 341bd50d 2021-08-21 op }
54 341bd50d 2021-08-21 op ```
55 341bd50d 2021-08-21 op
56 341bd50d 2021-08-21 op With these two functions we can now rewrite the ‘ls’ function more easily as:
57 341bd50d 2021-08-21 op
58 341bd50d 2021-08-21 op ```
59 341bd50d 2021-08-21 op void
60 341bd50d 2021-08-21 op ls(uint8_t *zip, size_t len, uint8_t *cd)
61 341bd50d 2021-08-21 op {
62 341bd50d 2021-08-21 op char name[PATH_MAX];
63 341bd50d 2021-08-21 op
64 341bd50d 2021-08-21 op do {
65 341bd50d 2021-08-21 op filename(zip, len, cd, name, sizeof(name));
66 341bd50d 2021-08-21 op printf("%s\n", name);
67 341bd50d 2021-08-21 op } while ((cd = next(zip, len, cd)) != NULL);
68 341bd50d 2021-08-21 op }
69 341bd50d 2021-08-21 op ```
70 341bd50d 2021-08-21 op
71 341bd50d 2021-08-21 op I also want to modify the main a bit:
72 341bd50d 2021-08-21 op
73 341bd50d 2021-08-21 op ```
74 341bd50d 2021-08-21 op int
75 341bd50d 2021-08-21 op main(int argc, char **argv)
76 341bd50d 2021-08-21 op {
77 341bd50d 2021-08-21 op int i, fd;
78 341bd50d 2021-08-21 op void *zip, *cd;
79 341bd50d 2021-08-21 op size_t len;
80 341bd50d 2021-08-21 op
81 341bd50d 2021-08-21 op if (argc < 2) {
82 341bd50d 2021-08-21 op fprintf(stderr, "Usage: %s archive.zip [files...]",
83 341bd50d 2021-08-21 op *argv);
84 341bd50d 2021-08-21 op return 1;
85 341bd50d 2021-08-21 op }
86 341bd50d 2021-08-21 op
87 341bd50d 2021-08-21 op if ((fd = open(argv[1], O_RDONLY)) == -1)
88 341bd50d 2021-08-21 op err(1, "can't open %s", argv[1]);
89 341bd50d 2021-08-21 op
90 341bd50d 2021-08-21 op zip = map_file(fd, &len);
91 341bd50d 2021-08-21 op
92 341bd50d 2021-08-21 op #ifdef __OpenBSD__
93 341bd50d 2021-08-21 op if (pledge("stdio", NULL) == -1)
94 341bd50d 2021-08-21 op err(1, "pledge");
95 341bd50d 2021-08-21 op #endif
96 341bd50d 2021-08-21 op
97 341bd50d 2021-08-21 op if ((cd = find_central_directory(zip, len)) == NULL)
98 341bd50d 2021-08-21 op errx(1, "can't find the central directory");
99 341bd50d 2021-08-21 op
100 341bd50d 2021-08-21 op if (argc == 2)
101 341bd50d 2021-08-21 op ls(zip, len, cd);
102 341bd50d 2021-08-21 op else {
103 341bd50d 2021-08-21 op for (i = 2; i < argc; ++i)
104 341bd50d 2021-08-21 op extract_file(zip, len, cd, argv[i]);
105 341bd50d 2021-08-21 op }
106 341bd50d 2021-08-21 op
107 341bd50d 2021-08-21 op munmap(zip, len);
108 341bd50d 2021-08-21 op close(fd);
109 341bd50d 2021-08-21 op
110 341bd50d 2021-08-21 op return 0;
111 341bd50d 2021-08-21 op }
112 341bd50d 2021-08-21 op ```
113 341bd50d 2021-08-21 op
114 341bd50d 2021-08-21 op The difference is that now it accepts a variable number of files to extract after the name of the archive.
115 341bd50d 2021-08-21 op
116 720cfb13 2021-08-23 op Since I’m a bit of a OpenBSD fanboy myself, I’ve added a call to pledge(2) right before the main logic of the program: this way, even if we open a faulty zip files that tricks us into doing nasty stuff, the kernel will only allows us to write to *already* opened files and nothing more. On FreeBSD a call to capsicum(4) would be more or less the same in this case. On linux you can waste some hours writing a seccomp(2) filter hoping it doesn’t break on weird architectures or libc implementation :P
117 341bd50d 2021-08-21 op
118 341bd50d 2021-08-21 op (I’ve said already that I’m a bit of a OpenBSD fanboy myself right?)
119 341bd50d 2021-08-21 op
120 341bd50d 2021-08-21 op => https://man.openbsd.org/pledge pledge(2) manpage
121 341bd50d 2021-08-21 op => https://www.freebsd.org/cgi/man.cgi?capsicum capsicum(4) manpage
122 247fb3d0 2021-08-25 op => /post/gmid-sandbox.gmi Comparing sandboxing techniques
123 341bd50d 2021-08-21 op
124 341bd50d 2021-08-21 op To implement ‘extract_file’ I’ve used a small helper function called ‘find_file’ that given a file name returns the pointer to its file entry in the central directory. It’s very similar to ‘ls’:
125 341bd50d 2021-08-21 op
126 341bd50d 2021-08-21 op ```
127 341bd50d 2021-08-21 op void *
128 341bd50d 2021-08-21 op find_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
129 341bd50d 2021-08-21 op {
130 341bd50d 2021-08-21 op char name[PATH_MAX];
131 341bd50d 2021-08-21 op
132 341bd50d 2021-08-21 op do {
133 341bd50d 2021-08-21 op filename(zip, len, cd, name, sizeof(name));
134 341bd50d 2021-08-21 op if (!strcmp(name, target))
135 341bd50d 2021-08-21 op return cd;
136 341bd50d 2021-08-21 op } while ((cd = next(zip, len, cd)) != NULL);
137 341bd50d 2021-08-21 op
138 341bd50d 2021-08-21 op return NULL;
139 341bd50d 2021-08-21 op }
140 341bd50d 2021-08-21 op ```
141 341bd50d 2021-08-21 op
142 341bd50d 2021-08-21 op Then extract_file is really easy:
143 341bd50d 2021-08-21 op
144 341bd50d 2021-08-21 op ```
145 341bd50d 2021-08-21 op int
146 341bd50d 2021-08-21 op extract_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
147 341bd50d 2021-08-21 op {
148 341bd50d 2021-08-21 op if ((cd = find_file(zip, len, cd, target)) == NULL)
149 341bd50d 2021-08-21 op return -1;
150 341bd50d 2021-08-21 op
151 341bd50d 2021-08-21 op unzip(zip, len, cd);
152 341bd50d 2021-08-21 op return 0;
153 341bd50d 2021-08-21 op }
154 341bd50d 2021-08-21 op ```
155 341bd50d 2021-08-21 op
156 341bd50d 2021-08-21 op OK, I’ve cheated a bit, this isn’t the real decompress routine, extract_file only finds the correct offset and call ‘unzip’. Initially I hooked ‘unzip’ into ls but was a bit messy, hence the refactor.
157 341bd50d 2021-08-21 op
158 341bd50d 2021-08-21 op Small recap of the last post: in a zip file the file entry in the central directory contains a pointer to the file record inside the zip. The file record is a header followed by the (usually) compressed data. The interesting thing about zip files is that several compression algorithms (including none at all) can be used to compress files inside the same archive. You may have file A store as-is, file B compressed with deflate and file C compressed with God knows what.
159 341bd50d 2021-08-21 op
160 341bd50d 2021-08-21 op The good news is that usually most zip applications use deflate and that’s all we care about here. Also, given that it’s easy, I’m going to support also files stored without compression. I have yet to find a zip with not compressed files thought, so that code path is completely untested.
161 fd8fc65c 2021-08-22 op
162 fd8fc65c 2021-08-22 op Edit 2021/08/22: nytpu (thanks!) pointed out that the epubs specification mandates that the first file in the archive is an uncompressed one called “mimetype”. I’ve tested with some epubs I had around and it seems to work as intended.
163 341bd50d 2021-08-21 op
164 fd8fc65c 2021-08-22 op => https://www.w3.org/publishing/epub3/epub-ocf.html#sec-zip-container-mime The Epub Specification
165 fd8fc65c 2021-08-22 op
166 341bd50d 2021-08-21 op Here’s the two constants for the compression methods
167 341bd50d 2021-08-21 op
168 341bd50d 2021-08-21 op ```
169 341bd50d 2021-08-21 op #define COMPRESSION_NONE 0x00
170 341bd50d 2021-08-21 op #define COMPRESSION_DEFLATE 0x08
171 341bd50d 2021-08-21 op ```
172 341bd50d 2021-08-21 op
173 341bd50d 2021-08-21 op The other algorithms and their codes are described at length in the zip documentation.
174 341bd50d 2021-08-21 op
175 16e0c5fb 2021-08-21 op The unzip functions takes the zip and the pointer to the file entry in the central directory, then finds the offset inside the file and computes the pointer to the start of the actual data. The file record header has a variable width: it’s made by 46 bytes followed by two variable-width fields “file name” and “extra field”.
176 341bd50d 2021-08-21 op
177 341bd50d 2021-08-21 op To know the compression method we need to read the compression field, an integer two bytes long starting at offset 8. (see the previous post or the official documentation for the structure of the headers)
178 341bd50d 2021-08-21 op
179 341bd50d 2021-08-21 op ```
180 341bd50d 2021-08-21 op void
181 341bd50d 2021-08-21 op unzip(uint8_t *zip, size_t len, uint8_t *entry)
182 341bd50d 2021-08-21 op {
183 341bd50d 2021-08-21 op uint32_t size, crc, off;
184 341bd50d 2021-08-21 op uint16_t compression;
185 341bd50d 2021-08-21 op uint16_t flen, xlen;
186 341bd50d 2021-08-21 op uint8_t *data, *offset;
187 341bd50d 2021-08-21 op
188 341bd50d 2021-08-21 op /* read the offset of the file record */
189 341bd50d 2021-08-21 op memcpy(&off, entry + 42, sizeof(off));
190 341bd50d 2021-08-21 op offset = zip + le32toh(off);
191 341bd50d 2021-08-21 op
192 341bd50d 2021-08-21 op if (offset > zip + len - 46 ||
193 341bd50d 2021-08-21 op memcmp(offset, "\x50\x4b\x03\x04", 4) != 0)
194 341bd50d 2021-08-21 op errx(1, "invalid offset or file header signature");
195 341bd50d 2021-08-21 op
196 341bd50d 2021-08-21 op memcpy(&compression, offset + 8, sizeof(compression));
197 341bd50d 2021-08-21 op compression = le16toh(compression);
198 341bd50d 2021-08-21 op
199 341bd50d 2021-08-21 op memcpy(&crc, entry + 16, sizeof(crc));
200 341bd50d 2021-08-21 op memcpy(&size, entry + 20, sizeof(size));
201 341bd50d 2021-08-21 op
202 341bd50d 2021-08-21 op crc = le32toh(crc);
203 341bd50d 2021-08-21 op size = le32toh(size);
204 341bd50d 2021-08-21 op
205 341bd50d 2021-08-21 op memcpy(&flen, offset + 26, sizeof(flen));
206 341bd50d 2021-08-21 op memcpy(&xlen, offset + 28, sizeof(xlen));
207 341bd50d 2021-08-21 op
208 341bd50d 2021-08-21 op flen = le16toh(flen);
209 341bd50d 2021-08-21 op xlen = le16toh(xlen);
210 341bd50d 2021-08-21 op
211 341bd50d 2021-08-21 op data = offset + 30 + flen + xlen;
212 341bd50d 2021-08-21 op if (data + size > zip + len)
213 341bd50d 2021-08-21 op errx(1, "corrupted zip, offset out of file");
214 341bd50d 2021-08-21 op
215 341bd50d 2021-08-21 op switch (compression) {
216 341bd50d 2021-08-21 op case COMPRESSION_NONE:
217 341bd50d 2021-08-21 op unzip_none(data, size, crc);
218 341bd50d 2021-08-21 op break;
219 341bd50d 2021-08-21 op case COMPRESSION_DEFLATE:
220 341bd50d 2021-08-21 op unzip_deflate(data, size, crc);
221 341bd50d 2021-08-21 op break;
222 341bd50d 2021-08-21 op default:
223 341bd50d 2021-08-21 op errx(1, "unknown compression method 0x%02x",
224 341bd50d 2021-08-21 op compression);
225 341bd50d 2021-08-21 op }
226 341bd50d 2021-08-21 op }
227 341bd50d 2021-08-21 op ```
228 341bd50d 2021-08-21 op
229 341bd50d 2021-08-21 op ‘unzip_none’ handles the case of a file stored as-is, without compression. It just copies the data to stdout and checks the CRC32.
230 341bd50d 2021-08-21 op
231 341bd50d 2021-08-21 op CRC stands for “Cyclic Redundancy Check” and is widely used to guard against accidental corruption. The math behind it is really interesting, it uses Galois fields and has some really cool properties. It’s also easy to compute, even by hand, but since we’re already using zlib I’ll leave the handling of that to the ‘crc32’ function provided by the library.
232 341bd50d 2021-08-21 op
233 341bd50d 2021-08-21 op => https://en.wikipedia.org/wiki/Cyclic_redundancy_check “Cyclic Redundancy Check” at Wikipedia
234 341bd50d 2021-08-21 op
235 341bd50d 2021-08-21 op ```the implementation of the unzip_none procedure
236 341bd50d 2021-08-21 op void
237 341bd50d 2021-08-21 op unzip_none(uint8_t *data, size_t size, unsigned long ocrc)
238 341bd50d 2021-08-21 op {
239 341bd50d 2021-08-21 op unsigned long crc = 0;
240 341bd50d 2021-08-21 op
241 341bd50d 2021-08-21 op fwrite(data, 1, size, stdout);
242 341bd50d 2021-08-21 op
243 341bd50d 2021-08-21 op crc = crc32(0, data, size);
244 341bd50d 2021-08-21 op if (crc != ocrc)
245 341bd50d 2021-08-21 op errx(1, "CRC mismatch");
246 341bd50d 2021-08-21 op }
247 341bd50d 2021-08-21 op ```
248 341bd50d 2021-08-21 op
249 341bd50d 2021-08-21 op ‘unzip_deflate’ handles the case of a deflate-compressed file, and I’m going to rely on zlib to decompress the deflated stream.
250 341bd50d 2021-08-21 op
251 341bd50d 2021-08-21 op At least for the decompression, zlib doesn’t seem too bad to use. (I don’t know why but I’ve always got this impression that zlib had terrible APIs… While they’re not the prettiest, they’re not *exaggeratedly* bad either).
252 341bd50d 2021-08-21 op
253 341bd50d 2021-08-21 op We need to prepare a z_stream “object” with inflateInit, then run the decompression loop by repeatedly call ‘inflate’ and finally free the storage with ‘inflateEnd’.
254 341bd50d 2021-08-21 op
255 341bd50d 2021-08-21 op To get back at what I was blabbing before about APIs, zlib has a weird way to convey some bits of information. A bare ‘inflateInit’ will assume a zlib or gz stream while zip archives store a bare deflate. The way to inform zlib about this is to call ‘inflateInit2’ instead and passing a negative number in the -15…-8 range for the sliding window size parameter. Yep, a negative window size means a deflate stream. (The way to require a gz header is also cool, by adding 16 to the desired sliding window size…)
256 341bd50d 2021-08-21 op
257 341bd50d 2021-08-21 op When writing this function I stumbled upon this issue for a while, as it’s not exactly intuitive in my opinion.
258 341bd50d 2021-08-21 op
259 341bd50d 2021-08-21 op Anyway, the question now becomes what sliding window size choose. From what I’ve understood, it should be computed as
260 341bd50d 2021-08-21 op
261 341bd50d 2021-08-21 op ```pseudo code to compute the sliding window size
262 341bd50d 2021-08-21 op size = log2(file_size)
263 341bd50d 2021-08-21 op if (size < 8)
264 341bd50d 2021-08-21 op size = 8
265 341bd50d 2021-08-21 op if (size > 15)
266 341bd50d 2021-08-21 op size = 15;
267 341bd50d 2021-08-21 op return -1 * size
268 341bd50d 2021-08-21 op ```
269 341bd50d 2021-08-21 op
270 341bd50d 2021-08-21 op But for the zip file I’m using as a test, this doesn’t work. I found that using unconditionally -15 seems to work on all cases: it should use a bit more memory but it’s also the default value so it isn’t a bad choice I guess.
271 341bd50d 2021-08-21 op
272 341bd50d 2021-08-21 op If you happen to know more about the subject, feel free to correct me so I can update the post.
273 341bd50d 2021-08-21 op
274 341bd50d 2021-08-21 op ```
275 341bd50d 2021-08-21 op void
276 341bd50d 2021-08-21 op unzip_deflate(uint8_t *data, size_t size, unsigned long ocrc)
277 341bd50d 2021-08-21 op {
278 341bd50d 2021-08-21 op z_stream stream;
279 341bd50d 2021-08-21 op size_t have;
280 341bd50d 2021-08-21 op unsigned long crc = 0;
281 341bd50d 2021-08-21 op char buf[BUFSIZ];
282 341bd50d 2021-08-21 op
283 341bd50d 2021-08-21 op stream.zalloc = Z_NULL;
284 341bd50d 2021-08-21 op stream.zfree = Z_NULL;
285 341bd50d 2021-08-21 op stream.opaque = Z_NULL;
286 341bd50d 2021-08-21 op stream.next_in = data;
287 341bd50d 2021-08-21 op stream.avail_in = size;
288 341bd50d 2021-08-21 op stream.next_out = Z_NULL;
289 341bd50d 2021-08-21 op stream.avail_out = 0;
290 341bd50d 2021-08-21 op if (inflateInit2(&stream, -15) != Z_OK)
291 341bd50d 2021-08-21 op err(1, "inflateInit failed");
292 341bd50d 2021-08-21 op
293 341bd50d 2021-08-21 op do {
294 341bd50d 2021-08-21 op stream.next_out = buf;
295 341bd50d 2021-08-21 op stream.avail_out = sizeof(buf);
296 341bd50d 2021-08-21 op
297 341bd50d 2021-08-21 op switch (inflate(&stream, Z_BLOCK)) {
298 341bd50d 2021-08-21 op case Z_STREAM_ERROR:
299 341bd50d 2021-08-21 op errx(1, "stream error");
300 341bd50d 2021-08-21 op case Z_NEED_DICT:
301 341bd50d 2021-08-21 op errx(1, "need dict");
302 341bd50d 2021-08-21 op case Z_DATA_ERROR:
303 341bd50d 2021-08-21 op errx(1, "data error: %s", stream.msg);
304 341bd50d 2021-08-21 op case Z_MEM_ERROR:
305 341bd50d 2021-08-21 op errx(1, "memory error");
306 341bd50d 2021-08-21 op }
307 341bd50d 2021-08-21 op
308 341bd50d 2021-08-21 op have = sizeof(buf) - stream.avail_out;
309 341bd50d 2021-08-21 op fwrite(buf, 1, have, stdout);
310 341bd50d 2021-08-21 op crc = crc32(crc, buf, have);
311 341bd50d 2021-08-21 op } while (stream.avail_out == 0);
312 341bd50d 2021-08-21 op
313 341bd50d 2021-08-21 op inflateEnd(&stream);
314 341bd50d 2021-08-21 op
315 341bd50d 2021-08-21 op if (crc != ocrc)
316 341bd50d 2021-08-21 op errx(1, "CRC mismatch");
317 341bd50d 2021-08-21 op }
318 341bd50d 2021-08-21 op ```
319 341bd50d 2021-08-21 op
320 341bd50d 2021-08-21 op Also note the beauty of the CRC: it can be computed chunk by chunk! The downside is that we don’t know whether the CRC matches or not until we’ve extracted all the file contents. We could probably run the loop twice, but it would be a waste of computing, especially for big files.
321 341bd50d 2021-08-21 op
322 341bd50d 2021-08-21 op Now, to test all the code written so far:
323 341bd50d 2021-08-21 op
324 341bd50d 2021-08-21 op ```
325 914af851 2021-08-21 op % cc zipview.c -o zipview -lz
326 914af851 2021-08-21 op % ./zipview star_maker_olaf_stapledon.gpub metadata.txt
327 341bd50d 2021-08-21 op title: Star Maker
328 341bd50d 2021-08-21 op author: William Olaf Stapledon
329 341bd50d 2021-08-21 op published: 1937
330 341bd50d 2021-08-21 op language: en
331 341bd50d 2021-08-21 op gpubVersion: 0.0.1
332 341bd50d 2021-08-21 op %
333 341bd50d 2021-08-21 op ```
334 341bd50d 2021-08-21 op
335 341bd50d 2021-08-21 op yay! it works!
336 341bd50d 2021-08-21 op
337 341bd50d 2021-08-21 op In the next post I’ll add proper support for the ZIP64 spec and some final considerations.