op public repos

Blob

Date:: Wed Aug 25 09:20:54 2021 UTC
Message:: fix link
Actions:: History | Blame | Raw File
1 => /post/inspecting-zips.gmi The first part “Inspecting zip files”
2 
3 => //git.omarpolo.com/zip-utils/ The code for the whole series; see ‘zipview.c’ for this post in particular.
4 
5 Edit 2021/08/21: Stefan Sperling (thanks!) noticed an error in the ‘next’ function.  After that I found that a wrong check in ‘next’ caused an invalid memory access.  The ‘next‘ and ‘ls’ functions were corrected.
6 
7 Now that we know how to navigate inside a zip file let’s see how to extract files from it.  But before looking into the decompression routines (spoiler: we’ll need zlib, so make sure it’s installed) we need to do a bit of refactoring, the reason will be clear in a second.
8 
9 The ‘next’ function returns a pointer to the next file record in the central directory, or NULL if none found:
10 
11 ```
12 void *
13 next(uint8_t *zip, size_t len, uint8_t *entry)
14 {
15 	uint16_t	 flen, xlen, clen;
16 	uint8_t		*next, *end;
17 
18 	memcpy(&flen, entry + 28, sizeof(flen));
19 	memcpy(&xlen, entry + 28 + 2, sizeof(xlen));
20 	memcpy(&clen, entry + 28 + 2 + 2, sizeof(xlen));
21 
22 	flen = le16toh(flen);
23 	xlen = le16toh(xlen);
24 	clen = le16toh(clen);
25 
26 	next = entry + 46 + flen + xlen + clen;
27 	end = zip + len;
28 	if (next >= end - 46 ||
29 	    memcmp(next, "\x50\x4b\x01\x02", 4) != 0)
30 		return NULL;
31 	return next;
32 }
33 ```
34 
35 It’s very similar to the code we had in the ‘ls’ function.  It computes the pointer to the next entry and does a bit of validation.
36 
37 The ‘filename’ function extracts the filename given a pointer to a file record in the central directory:
38 
39 ```
40 void
41 filename(uint8_t *zip, size_t len, uint8_t *entry, char *buf,
42     size_t size)
43 {
44 	uint16_t	flen;
45 	size_t		s;
46 
47 	memcpy(&flen, entry + 28, sizeof(flen));
48 	flen = le16toh(flen);
49 
50         s = MIN(size-1, flen);
51 	memcpy(buf, entry + 46, s);
52 	buf[s] = '\0';
53 }
54 ```
55 
56 With these two functions we can now rewrite the ‘ls’ function more easily as:
57 
58 ```
59 void
60 ls(uint8_t *zip, size_t len, uint8_t *cd)
61 {
62 	char	name[PATH_MAX];
63 
64 	do {
65 		filename(zip, len, cd, name, sizeof(name));
66 		printf("%s\n", name);
67 	} while ((cd = next(zip, len, cd)) != NULL);
68 }
69 ```
70 
71 I also want to modify the main a bit:
72 
73 ```
74 int
75 main(int argc, char **argv)
76 {
77 	int	 i, fd;
78 	void	*zip, *cd;
79 	size_t	 len;
80 
81 	if (argc < 2) {
82 		fprintf(stderr, "Usage: %s archive.zip [files...]",
83 		    *argv);
84 		return 1;
85 	}
86 
87 	if ((fd = open(argv[1], O_RDONLY)) == -1)
88 		err(1, "can't open %s", argv[1]);
89 
90 	zip = map_file(fd, &len);
91 
92 #ifdef __OpenBSD__
93 	if (pledge("stdio", NULL) == -1)
94 		err(1, "pledge");
95 #endif
96 
97 	if ((cd = find_central_directory(zip, len)) == NULL)
98 		errx(1, "can't find the central directory");
99 
100         if (argc == 2)
101 		ls(zip, len, cd);
102         else {
103                 for (i = 2; i < argc; ++i)
104 			extract_file(zip, len, cd, argv[i]);
105 	}
106 
107 	munmap(zip, len);
108 	close(fd);
109 
110 	return 0;
111 }
112 ```
113 
114 The difference is that now it accepts a variable number of files to extract after the name of the archive.
115 
116 Since I’m a bit of a OpenBSD fanboy myself, I’ve added a call to pledge(2) right before the main logic of the program: this way, even if we open a faulty zip files that tricks us into doing nasty stuff, the kernel will only allows us to write to *already* opened files and nothing more.  On FreeBSD a call to capsicum(4) would be more or less the same in this case.  On linux you can waste some hours writing a seccomp(2) filter hoping it doesn’t break on weird architectures or libc implementation :P
117 
118 (I’ve said already that I’m a bit of a OpenBSD fanboy myself right?)
119 
120 => https://man.openbsd.org/pledge		pledge(2) manpage
121 => https://www.freebsd.org/cgi/man.cgi?capsicum	capsicum(4) manpage
122 => /post/gmid-sandbox.gmi			Comparing sandboxing techniques
123 
124 To implement ‘extract_file’ I’ve used a small helper function called ‘find_file’ that given a file name returns the pointer to its file entry in the central directory.  It’s very similar to ‘ls’:
125 
126 ```
127 void *
128 find_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
129 {
130 	char	name[PATH_MAX];
131 
132 	do {
133 		filename(zip, len, cd, name, sizeof(name));
134 		if (!strcmp(name, target))
135 			return cd;
136 	} while ((cd = next(zip, len, cd)) != NULL);
137 
138 	return NULL;
139 }
140 ```
141 
142 Then extract_file is really easy:
143 
144 ```
145 int
146 extract_file(uint8_t *zip, size_t len, uint8_t *cd, const char *target)
147 {
148 	if ((cd = find_file(zip, len, cd, target)) == NULL)
149 		return -1;
150 
151 	unzip(zip, len, cd);
152 	return 0;
153 }
154 ```
155 
156 OK, I’ve cheated a bit, this isn’t the real decompress routine, extract_file only finds the correct offset and call ‘unzip’.  Initially I hooked ‘unzip’ into ls but was a bit messy, hence the refactor.
157 
158 Small recap of the last post: in a zip file the file entry in the central directory contains a pointer to the file record inside the zip.  The file record is a header followed by the (usually) compressed data.  The interesting thing about zip files is that several compression algorithms (including none at all) can be used to compress files inside the same archive.  You may have file A store as-is, file B compressed with deflate and file C compressed with God knows what.
159 
160 The good news is that usually most zip applications use deflate and that’s all we care about here.  Also, given that it’s easy, I’m going to support also files stored without compression.  I have yet to find a zip with not compressed files thought, so that code path is completely untested.
161 
162 Edit 2021/08/22: nytpu (thanks!) pointed out that the epubs specification mandates that the first file in the archive is an uncompressed one called “mimetype”.  I’ve tested with some epubs I had around and it seems to work as intended.
163 
164 => https://www.w3.org/publishing/epub3/epub-ocf.html#sec-zip-container-mime  The Epub Specification
165 
166 Here’s the two constants for the compression methods
167 
168 ```
169 #define COMPRESSION_NONE	0x00
170 #define COMPRESSION_DEFLATE	0x08
171 ```
172 
173 The other algorithms and their codes are described at length in the zip documentation.
174 
175 The unzip functions takes the zip and the pointer to the file entry in the central directory, then finds the offset inside the file and computes the pointer to the start of the actual data.  The file record header has a variable width: it’s made by 46 bytes followed by two variable-width fields “file name” and “extra field”.
176 
177 To know the compression method we need to read the compression field, an integer two bytes long starting at offset 8.  (see the previous post or the official documentation for the structure of the headers)
178 
179 ```
180 void
181 unzip(uint8_t *zip, size_t len, uint8_t *entry)
182 {
183 	uint32_t	 size, crc, off;
184 	uint16_t	 compression;
185 	uint16_t	 flen, xlen;
186 	uint8_t		*data, *offset;
187 
188 	/* read the offset of the file record */
189 	memcpy(&off, entry + 42, sizeof(off));
190 	offset = zip + le32toh(off);
191 
192 	if (offset > zip + len - 46 ||
193 	    memcmp(offset, "\x50\x4b\x03\x04", 4) != 0)
194 		errx(1, "invalid offset or file header signature");
195 
196 	memcpy(&compression, offset + 8, sizeof(compression));
197 	compression = le16toh(compression);
198 
199 	memcpy(&crc, entry + 16, sizeof(crc));
200 	memcpy(&size, entry + 20, sizeof(size));
201 
202 	crc = le32toh(crc);
203 	size = le32toh(size);
204 
205 	memcpy(&flen, offset + 26, sizeof(flen));
206 	memcpy(&xlen, offset + 28, sizeof(xlen));
207 
208 	flen = le16toh(flen);
209 	xlen = le16toh(xlen);
210 
211 	data = offset + 30 + flen + xlen;
212 	if (data + size > zip + len)
213 		errx(1, "corrupted zip, offset out of file");
214 
215 	switch (compression) {
216 	case COMPRESSION_NONE:
217                 unzip_none(data, size, crc);
218 		break;
219 	case COMPRESSION_DEFLATE:
220                 unzip_deflate(data, size, crc);
221 		break;
222 	default:
223 		errx(1, "unknown compression method 0x%02x",
224 		    compression);
225 	}
226 }
227 ```
228 
229 ‘unzip_none’ handles the case of a file stored as-is, without compression.  It just copies the data to stdout and checks the CRC32.
230 
231 CRC stands for “Cyclic Redundancy Check” and is widely used to guard against accidental corruption.  The math behind it is really interesting, it uses Galois fields and has some really cool properties.  It’s also easy to compute, even by hand, but since we’re already using zlib I’ll leave the handling of that to the ‘crc32’ function provided by the library.
232 
233 => https://en.wikipedia.org/wiki/Cyclic_redundancy_check “Cyclic Redundancy Check” at Wikipedia
234 
235 ```the implementation of the unzip_none procedure
236 void
237 unzip_none(uint8_t *data, size_t size, unsigned long ocrc)
238 {
239 	unsigned long crc = 0;
240 
241 	fwrite(data, 1, size, stdout);
242 
243 	crc = crc32(0, data, size);
244 	if (crc != ocrc)
245 		errx(1, "CRC mismatch");
246 }
247 ```
248 
249 ‘unzip_deflate’ handles the case of a deflate-compressed file, and I’m going to rely on zlib to decompress the deflated stream.
250 
251 At least for the decompression, zlib doesn’t seem too bad to use.  (I don’t know why but I’ve always got this impression that zlib had terrible APIs…  While they’re not the prettiest, they’re not *exaggeratedly* bad either).
252 
253 We need to prepare a z_stream “object” with inflateInit, then run the decompression loop by repeatedly call ‘inflate’ and finally free the storage with ‘inflateEnd’.
254 
255 To get back at what I was blabbing before about APIs, zlib has a weird way to convey some bits of information.  A bare ‘inflateInit’ will assume a zlib or gz stream while zip archives store a bare deflate.  The way to inform zlib about this is to call ‘inflateInit2’ instead and passing a negative number in the -15…-8 range for the sliding window size parameter.  Yep, a negative window size means a deflate stream.  (The way to require a gz header is also cool, by adding 16 to the desired sliding window size…)
256 
257 When writing this function I stumbled upon this issue for a while, as it’s not exactly intuitive in my opinion.
258 
259 Anyway, the question now becomes what sliding window size choose.  From what I’ve understood, it should be computed as
260 
261 ```pseudo code to compute the sliding window size
262 size = log2(file_size)
263 if (size < 8)
264 	size = 8
265 if (size > 15)
266 	size = 15;
267 return -1 * size
268 ```
269 
270 But for the zip file I’m using as a test, this doesn’t work.  I found that using unconditionally -15 seems to work on all cases: it should use a bit more memory but it’s also the default value so it isn’t a bad choice I guess.
271 
272 If you happen to know more about the subject, feel free to correct me so I can update the post.
273 
274 ```
275 void
276 unzip_deflate(uint8_t *data, size_t size, unsigned long ocrc)
277 {
278 	z_stream	stream;
279 	size_t		have;
280 	unsigned long	crc = 0;
281 	char		buf[BUFSIZ];
282 
283 	stream.zalloc = Z_NULL;
284 	stream.zfree = Z_NULL;
285 	stream.opaque = Z_NULL;
286 	stream.next_in = data;
287 	stream.avail_in = size;
288 	stream.next_out = Z_NULL;
289 	stream.avail_out = 0;
290 	if (inflateInit2(&stream, -15) != Z_OK)
291 		err(1, "inflateInit failed");
292 
293 	do {
294 		stream.next_out = buf;
295 		stream.avail_out = sizeof(buf);
296 
297 		switch (inflate(&stream, Z_BLOCK)) {
298 		case Z_STREAM_ERROR:
299 			errx(1, "stream error");
300 		case Z_NEED_DICT:
301 			errx(1, "need dict");
302 		case Z_DATA_ERROR:
303 			errx(1, "data error: %s", stream.msg);
304 		case Z_MEM_ERROR:
305 			errx(1, "memory error");
306 		}
307 
308 		have = sizeof(buf) - stream.avail_out;
309 		fwrite(buf, 1, have, stdout);
310 		crc = crc32(crc, buf, have);
311 	} while (stream.avail_out == 0);
312 
313 	inflateEnd(&stream);
314 
315 	if (crc != ocrc)
316 		errx(1, "CRC mismatch");
317 }
318 ```
319 
320 Also note the beauty of the CRC: it can be computed chunk by chunk!  The downside is that we don’t know whether the CRC matches or not until we’ve extracted all the file contents.  We could probably run the loop twice, but it would be a waste of computing, especially for big files.
321 
322 Now, to test all the code written so far:
323 
324 ```
325 % cc zipview.c -o zipview -lz
326 % ./zipview star_maker_olaf_stapledon.gpub metadata.txt
327 title: Star Maker
328 author: William Olaf Stapledon
329 published: 1937
330 language: en
331 gpubVersion: 0.0.1
332 %
333 ```
334 
335 yay!  it works!
336 
337 In the next post I’ll add proper support for the ZIP64 spec and some final considerations.