Blame


1 a15a77bc 2021-08-21 op => /post/extracting-from-zips.gmi Part two: “Extracting files from zips”
2 a15a77bc 2021-08-21 op
3 720cfb13 2021-08-23 op => //git.omarpolo.com/zip-utils/ The code for the whole series; see ‘zipls.c’ for this post in particular.
4 1e50170d 2021-08-20 op
5 914af851 2021-08-21 op Edit 2021/08/20: some edits to improve the code and the commentary.
6 914af851 2021-08-21 op Edit 2021/08/21: stricter while condition for ‘ls’ and added links to the code
7 914af851 2021-08-21 op
8 e2d166b9 2021-08-19 op Disclaimer: before today I didn’t knew anything about how zip files are structured, so take everything here with a huge grain of salt. The good news is that the code I wrote seems to be coherent with what I’ve read online and to actually work against some zips files I had around.
9 e2d166b9 2021-08-19 op
10 e2d166b9 2021-08-19 op Background: I’d like to add support for gempubs to Telescope, the Gemini client I’m writing. gempubs are basically a directory of text/gemini files plus other assets (metadata.txt and images presumably) all zipped in a single archive.
11 e2d166b9 2021-08-19 op
12 e2d166b9 2021-08-19 op => https://codeberg.org/oppenlab/gempub gempub: a new eBook format based on text/gemini
13 e2d166b9 2021-08-19 op => //telescope.omarpolo.com Telescope
14 e2d166b9 2021-08-19 op
15 e2d166b9 2021-08-19 op There are a lot of libraries to handle zip files, but I decided to give it a shot a writing something from scratch. After all, I don’t need to edit zips or do fancy stuff, I only need to read files from the archive, that’s all.
16 e2d166b9 2021-08-19 op
17 e2d166b9 2021-08-19 op To start, in this entry we’ll only see how to dump the list of files in a zip archive. Maybe future entries will deal with more zip stuff.
18 e2d166b9 2021-08-19 op
19 e2d166b9 2021-08-19 op From what I’ve gathered from APPNOTE.TXT and other sources, a zip file is a sequence of file “records” (a header followed by the file content) and a trailing “central directory” that holds the information about all the files.
20 e2d166b9 2021-08-19 op
21 e2d166b9 2021-08-19 op => https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT APPNOTE.TXT
22 e2d166b9 2021-08-19 op => https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip.html The structure of a PKZip file
23 e2d166b9 2021-08-19 op => https://en.wikipedia.org/wiki/ZIP_(file_format) ZIP (Wikipedia)
24 e2d166b9 2021-08-19 op
25 e2d166b9 2021-08-19 op Having the central directory at the end of the file instead that at the beginning seems to be a choice to waste people time^W^W^W allow embedding zips into other file formats, such as GIFs or EXE. I guess in some cases this may be an invaluable property, I just fail to see where, but anyway.
26 e2d166b9 2021-08-19 op
27 1e50170d 2021-08-20 op Edit 2021/08/20: Another advantage of having the central directory at the end is that is probably possible to build up a zip on-the-fly, maybe outputting to standard output or to a similar non-seekable device, without having to build all the zip in memory first.
28 1e50170d 2021-08-20 op
29 e2d166b9 2021-08-19 op One may think that it’s possible to scan a zip by reading these “records”, but it’s not the case unfortunately: the only source of truth for the actual files stored in the archive is the central directory. Applications that modify the zip may reuse or leave dummy file headers around, especially if they delete or replace files.
30 e2d166b9 2021-08-19 op
31 e2d166b9 2021-08-19 op To aggravate the situation, it’s not obvious how to find the start of the central directory. Zip are truly wonderful, huh? I guess that adding a trailing 4-byte offset that points to the start of the central directory wouldn’t be bad, but we’re a bit too late.
32 e2d166b9 2021-08-19 op
33 e2d166b9 2021-08-19 op The central directory is a sequence of record that identifies the files in the archive followed by a digital signature, two ZIP64 fields and the end of the central directory record. I still haven’t wrapped my head around the digital signature and the ZIP64 fields, but they don’t seem necessary to access the list of files.
34 e2d166b9 2021-08-19 op
35 e2d166b9 2021-08-19 op The last part of the central directory, the end record, contains a handy pointer to the start of the content directory. Unfortunately, it also contains a trailing variable-width comment area that complicate things a bit.
36 e2d166b9 2021-08-19 op
37 e2d166b9 2021-08-19 op But enough with the talks, let’s jump to the code. Since Telescope is written in C, the small toy program object of this entry will also be written in C. The main function is pretty straightforward:
38 e2d166b9 2021-08-19 op
39 e2d166b9 2021-08-19 op ```main function
40 e2d166b9 2021-08-19 op int
41 e2d166b9 2021-08-19 op main(int argc, char **argv)
42 e2d166b9 2021-08-19 op {
43 e2d166b9 2021-08-19 op int fd;
44 e2d166b9 2021-08-19 op void *zip, *cd;
45 e2d166b9 2021-08-19 op size_t len;
46 e2d166b9 2021-08-19 op
47 e2d166b9 2021-08-19 op if (argc != 2)
48 e2d166b9 2021-08-19 op errx(1, "missing file to inspect");
49 e2d166b9 2021-08-19 op
50 e2d166b9 2021-08-19 op if ((fd = open(argv[1], O_RDONLY)) == -1)
51 e2d166b9 2021-08-19 op err(1, "open %s", argv[1]);
52 e2d166b9 2021-08-19 op
53 e2d166b9 2021-08-19 op zip = map_file(fd, &len);
54 e2d166b9 2021-08-19 op if ((cd = find_central_directory(zip, len)) == NULL)
55 e2d166b9 2021-08-19 op errx(1, "can't find central directory");
56 e2d166b9 2021-08-19 op
57 e2d166b9 2021-08-19 op ls(zip, len, cd);
58 e2d166b9 2021-08-19 op
59 e2d166b9 2021-08-19 op munmap(zip, len);
60 e2d166b9 2021-08-19 op close(fd);
61 e2d166b9 2021-08-19 op
62 e2d166b9 2021-08-19 op return 0;
63 e2d166b9 2021-08-19 op }
64 e2d166b9 2021-08-19 op ```
65 e2d166b9 2021-08-19 op
66 e2d166b9 2021-08-19 op I think it would be easier for us to just mmap(2) the file into memory rather than moving back and forward by means of lseek(2). map_file is a thin wrapper around mmap(2):
67 e2d166b9 2021-08-19 op
68 e2d166b9 2021-08-19 op ```implementation of the map_file function
69 e2d166b9 2021-08-19 op void *
70 e2d166b9 2021-08-19 op map_file(int fd, size_t *len)
71 e2d166b9 2021-08-19 op {
72 e2d166b9 2021-08-19 op off_t jump;
73 e2d166b9 2021-08-19 op void *addr;
74 e2d166b9 2021-08-19 op
75 e2d166b9 2021-08-19 op if ((jump = lseek(fd, 0, SEEK_END)) == -1)
76 e2d166b9 2021-08-19 op err(1, "lseek");
77 e2d166b9 2021-08-19 op
78 e2d166b9 2021-08-19 op if (lseek(fd, 0, SEEK_SET) == -1)
79 e2d166b9 2021-08-19 op err(1, "lseek");
80 e2d166b9 2021-08-19 op
81 e2d166b9 2021-08-19 op if ((addr = mmap(NULL, jump, PROT_READ, MAP_PRIVATE, fd, 0))
82 e2d166b9 2021-08-19 op == MAP_FAILED)
83 e2d166b9 2021-08-19 op err(1, "mmap");
84 e2d166b9 2021-08-19 op
85 e2d166b9 2021-08-19 op *len = jump;
86 e2d166b9 2021-08-19 op return addr;
87 e2d166b9 2021-08-19 op }
88 e2d166b9 2021-08-19 op ```
89 e2d166b9 2021-08-19 op
90 e2d166b9 2021-08-19 op Just as we were discussing before, to locate the central directory we must first locate the “end of central directory record”. Its structure is as follows (the numbers inside the brackets indicates the byte count)
91 e2d166b9 2021-08-19 op
92 e2d166b9 2021-08-19 op ```structure of the end of central directory record
93 e2d166b9 2021-08-19 op signature[4] disk_number[2] disk_cd_number[2] disk_entries[2]
94 e2d166b9 2021-08-19 op total_entrie[2] central_directory_size[4] cd_offset[4]
95 e2d166b9 2021-08-19 op comment_len[2] comment…
96 e2d166b9 2021-08-19 op ```
97 e2d166b9 2021-08-19 op
98 e2d166b9 2021-08-19 op The signature is always “\x50\x4b\x05\x06”, which helps in finding the record. We still need to be careful, since I haven’t seen anywhere that the signature MUST NOT appear inside the comment.
99 e2d166b9 2021-08-19 op
100 e2d166b9 2021-08-19 op To be sure that we’ve actually found the real start of the end record, there’s a explicit check: the comment length plus the size of the non-variable part of the header must be equal to how far we have travelled from the end of the file. Granted, this is not completely bulletproof, since a specially-crafted comment may appear like a proper end of central directory record, but I’m not sure what could we do better to protect against faulty files.
101 e2d166b9 2021-08-19 op
102 e2d166b9 2021-08-19 op Side note: as always, I’m treating these files as untrusted and do all the possible checks. You don’t want a malformed file to crash your program, don’t you?
103 e2d166b9 2021-08-19 op
104 e2d166b9 2021-08-19 op One last thing: I’m totally fine with a very light and sparse usage of gotos. In find_central_directory I’m using a ‘goto again’ when we find a false signature inside a comment. A while loop would also do that, but it’d be a bit uglier.
105 e2d166b9 2021-08-19 op
106 e2d166b9 2021-08-19 op ```the find_central_directory procedure
107 e2d166b9 2021-08-19 op void *
108 e2d166b9 2021-08-19 op find_central_directory(uint8_t *addr, size_t len)
109 e2d166b9 2021-08-19 op {
110 e2d166b9 2021-08-19 op uint32_t offset;
111 e2d166b9 2021-08-19 op uint16_t clen;
112 e2d166b9 2021-08-19 op uint8_t *p, *end;
113 e2d166b9 2021-08-19 op
114 e2d166b9 2021-08-19 op /*
115 e2d166b9 2021-08-19 op * At -22 bytes from the end there is the end of the central
116 e2d166b9 2021-08-19 op * directory assuming an empty comment. It's a sensible place
117 e2d166b9 2021-08-19 op * from which start.
118 e2d166b9 2021-08-19 op */
119 e2d166b9 2021-08-19 op if (len < 22)
120 e2d166b9 2021-08-19 op return NULL;
121 e2d166b9 2021-08-19 op end = addr + len;
122 e2d166b9 2021-08-19 op p = end - 22;
123 e2d166b9 2021-08-19 op
124 e2d166b9 2021-08-19 op again:
125 e2d166b9 2021-08-19 op for (; p > addr; --p)
126 1e50170d 2021-08-20 op if (memcmp(p, "\x50\x4b\x05\x06", 4) == 0)
127 e2d166b9 2021-08-19 op break;
128 e2d166b9 2021-08-19 op
129 e2d166b9 2021-08-19 op if (p == addr)
130 e2d166b9 2021-08-19 op return NULL;
131 e2d166b9 2021-08-19 op
132 e2d166b9 2021-08-19 op /* read comment length */
133 e2d166b9 2021-08-19 op memcpy(&clen, p + 20, sizeof(clen));
134 1e50170d 2021-08-20 op clen = le16toh(clen);
135 e2d166b9 2021-08-19 op
136 e2d166b9 2021-08-19 op /* false signature inside a comment? */
137 e2d166b9 2021-08-19 op if (clen + 22 != end - p) {
138 e2d166b9 2021-08-19 op p--;
139 e2d166b9 2021-08-19 op goto again;
140 e2d166b9 2021-08-19 op }
141 e2d166b9 2021-08-19 op
142 e2d166b9 2021-08-19 op /* read the offset for the central directory */
143 e2d166b9 2021-08-19 op memcpy(&offset, p + 16, sizeof(offset));
144 1e50170d 2021-08-20 op offset = le32toh(offset);
145 e2d166b9 2021-08-19 op
146 e2d166b9 2021-08-19 op if (addr + offset > p)
147 e2d166b9 2021-08-19 op return NULL;
148 e2d166b9 2021-08-19 op
149 e2d166b9 2021-08-19 op return addr + offset;
150 e2d166b9 2021-08-19 op }
151 e2d166b9 2021-08-19 op ```
152 e2d166b9 2021-08-19 op
153 1e50170d 2021-08-20 op Edit 2021/08/20: there’s a space for a little optimisation: the end record MUST be in the last 64kb (plus some bytes), so for big files there’s no need to continue searching back until the start. Why 64kb? The comment length is a 16 bit integer, so the biggest end of record possible is 22 bytes plus 64kb of comment.
154 1e50170d 2021-08-20 op
155 e2d166b9 2021-08-19 op If everything went well, we’ve found the pointer to the start of the central directory. It’s made by a sequence of file header records:
156 e2d166b9 2021-08-19 op
157 e2d166b9 2021-08-19 op ```
158 e2d166b9 2021-08-19 op signature[4] version[2] vers_needed[2] flags[2] compression[2]
159 e2d166b9 2021-08-19 op mod_time[2] mod_date[2] crc32[4]
160 e2d166b9 2021-08-19 op compressed_size[4] uncompressed_size[4]
161 e2d166b9 2021-08-19 op filename_len[2] extra_field_len[2] file_comment_len[2]
162 e2d166b9 2021-08-19 op disk_number[2] internal_attrs[2] offset[4]
163 e2d166b9 2021-08-19 op filename… extra_field… file_comment…
164 e2d166b9 2021-08-19 op ```
165 e2d166b9 2021-08-19 op
166 e2d166b9 2021-08-19 op The signature field is always "\x50\x4b\x01\x02", which is different from the end record and the other records fortunately. To list the files we just have to read the file headers record until we find one with a different signature:
167 e2d166b9 2021-08-19 op
168 e2d166b9 2021-08-19 op ```ls: traverse the file headers and print the filenames
169 e2d166b9 2021-08-19 op void
170 e2d166b9 2021-08-19 op ls(uint8_t *zip, size_t len, uint8_t *cd)
171 e2d166b9 2021-08-19 op {
172 e2d166b9 2021-08-19 op uint32_t offset;
173 e2d166b9 2021-08-19 op uint16_t flen, xlen, clen;
174 e2d166b9 2021-08-19 op uint8_t *end;
175 e2d166b9 2021-08-19 op char filename[PATH_MAX];
176 e2d166b9 2021-08-19 op
177 e2d166b9 2021-08-19 op end = zip + len;
178 914af851 2021-08-21 op while (cd < end - 46 && memcmp(cd, "\x50\x4b\x01\x02", 4) == 0) {
179 e2d166b9 2021-08-19 op memcpy(&flen, cd + 28, sizeof(flen));
180 e2d166b9 2021-08-19 op memcpy(&xlen, cd + 28 + 2, sizeof(xlen));
181 e2d166b9 2021-08-19 op memcpy(&clen, cd + 28 + 2 + 2, sizeof(xlen));
182 e2d166b9 2021-08-19 op
183 1e50170d 2021-08-20 op flen = le16toh(flen);
184 1e50170d 2021-08-20 op xlen = le16toh(xlen);
185 1e50170d 2021-08-20 op clen = le16toh(clen);
186 1e50170d 2021-08-20 op
187 e2d166b9 2021-08-19 op memcpy(&offset, cd + 42, sizeof(offset));
188 1e50170d 2021-08-20 op offset = le32toh(offset);
189 e2d166b9 2021-08-19 op
190 e2d166b9 2021-08-19 op memset(filename, 0, sizeof(filename));
191 1e50170d 2021-08-20 op memcpy(filename, cd + 46, MIN(sizeof(filename)-1, flen));
192 e2d166b9 2021-08-19 op
193 e2d166b9 2021-08-19 op printf("%s [%d]\n", filename, offset);
194 e2d166b9 2021-08-19 op
195 e2d166b9 2021-08-19 op cd += 46 + flen + xlen + clen;
196 e2d166b9 2021-08-19 op }
197 e2d166b9 2021-08-19 op }
198 e2d166b9 2021-08-19 op ```
199 e2d166b9 2021-08-19 op
200 e2d166b9 2021-08-19 op As always, there are some magic numbers hardcoded, a real program would probably have some constants defined, but for this simple toy program I’m fine with things as is. Also, note the pedantry in ensuring we don’t end up reading out-of-bounds in the while condition, I don’t want faulty zip files to cause invalid memory access.
201 e2d166b9 2021-08-19 op
202 e2d166b9 2021-08-19 op Now, to compile it and run:
203 e2d166b9 2021-08-19 op
204 e2d166b9 2021-08-19 op ```
205 e2d166b9 2021-08-19 op % cc zipls.c -o zipls && ./zipls star_maker_olaf_stapledon.gpub
206 e2d166b9 2021-08-19 op 0_preface.gmi [0]
207 e2d166b9 2021-08-19 op chapter_1_1_the_starting_point.gmi [2957]
208 e2d166b9 2021-08-19 op chapter_1_2_earth_among_the_stars.gmi [6932]
209 e2d166b9 2021-08-19 op chapter_2_1_interstellar_travel.gmi [11041]
210 e2d166b9 2021-08-19 op chapter_3_1_on_the_other_earth.gmi [20382]
211 e2d166b9 2021-08-19 op
212 e2d166b9 2021-08-19 op ```
213 e2d166b9 2021-08-19 op
214 e2d166b9 2021-08-19 op and voila, it works!
215 e2d166b9 2021-08-19 op
216 e2d166b9 2021-08-19 op To conclude this entry, one of the things that I’m still not sure about is the endiannes of the numbers. I’m guessing they should be little endian, but it’s always that or only because the zip files were produced on a little endian machine?
217 e2d166b9 2021-08-19 op
218 1e50170d 2021-08-20 op Edit 2021/08/20: The majority of the number are stored in little-endian. There are some exception, so check the documentation, but is mostly for fields like the MSDOS-like time and date and stuff like that. The code was updated with the calls to leXYtoh() from ‘endian.h’.
219 1e50170d 2021-08-20 op
220 e2d166b9 2021-08-19 op Otherwise I’m pretty happy with the result. In a short time I went from knowing nothing about zips to being able to at least inspect them, using only the C standard library (well, assuming POSIX). I’ll leave the files decoding for a next time.