Blob


1 .TH VENTI 7
2 .SH NAME
3 venti \- archival storage server
4 .SH DESCRIPTION
5 Venti is a block storage server intended for archival data.
6 In a Venti server, the SHA1 hash of a block's contents acts
7 as the block identifier for read and write operations.
8 This approach enforces a write-once policy, preventing
9 accidental or malicious destruction of data. In addition,
10 duplicate copies of a block are coalesced, reducing the
11 consumption of storage and simplifying the implementation
12 of clients.
13 .PP
14 This manual page documents the basic concepts of
15 block storage using Venti as well as the Venti network protocol.
16 .PP
17 .MR Venti (1)
18 documents some simple clients.
19 .MR Vac (1) ,
20 .MR vacfs (4) ,
21 and
22 .MR vbackup (8)
23 are more complex clients.
24 .PP
25 .MR Venti (3)
26 describes a C library interface for accessing
27 Venti servers and manipulating Venti data structures.
28 .PP
29 .MR Venti (8)
30 describes the programs used to run a Venti server.
31 .PP
32 .SS "Scores
33 The SHA1 hash that identifies a block is called its
34 .IR score .
35 The score of the zero-length block is called the
36 .IR "zero score" .
37 .PP
38 Scores may have an optional
39 .IB label :
40 prefix, typically used to
41 describe the format of the data.
42 For example,
43 .MR vac (1)
44 uses a
45 .B vac:
46 prefix, while
47 .MR vbackup (8)
48 uses prefixes corresponding to the file system
49 types:
50 .BR ext2: ,
51 .BR ffs: ,
52 and so on.
53 .SS "Files and Directories
54 Venti accepts blocks up to 56 kilobytes in size.
55 By convention, Venti clients use hash trees of blocks to
56 represent arbitrary-size data
57 .IR files .
58 The data to be stored is split into fixed-size
59 blocks and written to the server, producing a list
60 of scores.
61 The resulting list of scores is split into fixed-size pointer
62 blocks (using only an integral number of scores per block)
63 and written to the server, producing a smaller list
64 of scores.
65 The process continues, eventually ending with the
66 score for the hash tree's top-most block.
67 Each file stored this way is summarized by
68 a
69 .B VtEntry
70 structure recording the top-most score, the depth
71 of the tree, the data block size, and the pointer block size.
72 One or more
73 .B VtEntry
74 structures can be concatenated
75 and stored as a special file called a
76 .IR directory .
77 In this
78 manner, arbitrary trees of files can be constructed
79 and stored.
80 .PP
81 Scores passed between programs conventionally refer
82 to
83 .B VtRoot
84 blocks, which contain descriptive information
85 as well as the score of a directory block containing a small number
86 of directory entries.
87 .PP
88 Conventionally, programs do not mix data and directory entries
89 in the same file. Instead, they keep two separate files, one with
90 directory entries and one with metadata referencing those
91 entries by position.
92 Keeping this parallel representation is a minor annoyance
93 but makes it possible for general programs like
94 .I venti/copy
95 (see
96 .MR venti (1) )
97 to traverse the block tree without knowing the specific details
98 of any particular program's data.
99 .SS "Block Types
100 To allow programs to traverse these structures without
101 needing to understand their higher-level meanings,
102 Venti tags each block with a type. The types are:
103 .PP
104 .nf
105 .ft L
106 VtDataType 000 \f1data\fL
107 VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL
108 VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL
109 \fR\&...\fL
110 VtDirType 010 VtEntry\fR structures\fL
111 VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL
112 VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL
113 \fR\&...\fL
114 VtRootType 020 VtRoot\fR structure\fL
115 .fi
116 .PP
117 The octal numbers listed are the type numbers used
118 by the commands below.
119 (For historical reasons, the type numbers used on
120 disk and on the wire are different from the above.
121 They do not distinguish
122 .BI VtDataType+ n
123 blocks from
124 .BI VtDirType+ n
125 blocks.)
126 .SS "Zero Truncation
127 To avoid storing the same short data blocks padded with
128 differing numbers of zeros, Venti clients working with fixed-size
129 blocks conventionally
130 `zero truncate' the blocks before writing them to the server.
131 For example, if a 1024-byte data block contains the
132 11-byte string
133 .RB ` hello " " world '
134 followed by 1013 zero bytes,
135 a client would store only the 11-byte block.
136 When the client later read the block from the server,
137 it would append zero bytes to the end as necessary to
138 reach the expected size.
139 .PP
140 When truncating pointer blocks
141 .RB ( VtDataType+ \fIn
142 and
143 .BI VtDirType+ n
144 blocks),
145 trailing zero scores are removed
146 instead of trailing zero bytes.
147 .PP
148 Because of the truncation convention,
149 any file consisting entirely of zero bytes,
150 no matter what its length, will be represented by the zero score:
151 the data blocks contain all zeros and are thus truncated
152 to the empty block, and the pointer blocks contain all zero scores
153 and are thus also truncated to the empty block,
154 and so on up the hash tree.
155 .SS Network Protocol
156 A Venti session begins when a
157 .I client
158 connects to the network address served by a Venti
159 .IR server ;
160 the conventional address is
161 .BI tcp! server !venti
162 (the
163 .B venti
164 port is 17034).
165 Both client and server begin by sending a version
166 string of the form
167 .BI venti- versions - comment \en \fR.
168 The
169 .I versions
170 field is a list of acceptable versions separated by
171 colons.
172 The protocol described here is version
173 .BR 02 .
174 The client is responsible for choosing a common
175 version and sending it in the
176 .B VtThello
177 message, described below.
178 .PP
179 After the initial version exchange, the client transmits
180 .I requests
181 .RI ( T-messages )
182 to the server, which subsequently returns
183 .I replies
184 .RI ( R-messages )
185 to the client.
186 The combined act of transmitting (receiving) a request
187 of a particular type, and receiving (transmitting) its reply
188 is called a
189 .I transaction
190 of that type.
191 .PP
192 Each message consists of a sequence of bytes.
193 Two-byte fields hold unsigned integers represented
194 in big-endian order (most significant byte first).
195 Data items of variable lengths are represented by
196 a one-byte field specifying a count,
197 .IR n ,
198 followed by
199 .I n
200 bytes of data.
201 Text strings are represented similarly,
202 using a two-byte count with
203 the text itself stored as a UTF-encoded sequence
204 of Unicode characters (see
205 .MR utf (7) ).
206 Text strings are not
207 .SM NUL\c
208 -terminated:
209 .I n
210 counts the bytes of UTF data, which include no final
211 zero byte.
212 The
213 .SM NUL
214 character is illegal in text strings in the Venti protocol.
215 The maximum string length in Venti is 1024 bytes.
216 .PP
217 Each Venti message begins with a two-byte size field
218 specifying the length in bytes of the message,
219 not including the length field itself.
220 The next byte is the message type, one of the constants
221 in the enumeration in the include file
222 .BR <venti.h> .
223 The next byte is an identifying
224 .IR tag ,
225 used to match responses to requests.
226 The remaining bytes are parameters of different sizes.
227 In the message descriptions, the number of bytes in a field
228 is given in brackets after the field name.
229 The notation
230 .IR parameter [ n ]
231 where
232 .I n
233 is not a constant represents a variable-length parameter:
234 .IR n [1]
235 followed by
236 .I n
237 bytes of data forming the
238 .IR parameter .
239 The notation
240 .IR string [ s ]
241 (using a literal
242 .I s
243 character)
244 is shorthand for
245 .IR s [2]
246 followed by
247 .I s
248 bytes of UTF-8 text.
249 The notation
250 .IR parameter []
251 where
252 .I parameter
253 is the last field in the message represents a
254 variable-length field that comprises all remaining
255 bytes in the message.
256 .PP
257 All Venti RPC messages are prefixed with a field
258 .IR size [2]
259 giving the length of the message that follows
260 (not including the
261 .I size
262 field itself).
263 The message bodies are:
264 .ta \w'\fLVtTgoodbye 'u
265 .IP
266 .ne 2v
267 .B VtThello
268 .IR tag [1]
269 .IR version [ s ]
270 .IR uid [ s ]
271 .IR strength [1]
272 .IR crypto [ n ]
273 .IR codec [ n ]
274 .br
275 .B VtRhello
276 .IR tag [1]
277 .IR sid [ s ]
278 .IR rcrypto [1]
279 .IR rcodec [1]
280 .IP
281 .ne 2v
282 .B VtTping
283 .IR tag [1]
284 .br
285 .B VtRping
286 .IR tag [1]
287 .IP
288 .ne 2v
289 .B VtTread
290 .IR tag [1]
291 .IR score [20]
292 .IR type [1]
293 .IR pad [1]
294 .IR count [2]
295 .br
296 .B VtRread
297 .IR tag [1]
298 .IR data []
299 .IP
300 .ne 2v
301 .B VtTwrite
302 .IR tag [1]
303 .IR type [1]
304 .IR pad [3]
305 .IR data []
306 .br
307 .B VtRwrite
308 .IR tag [1]
309 .IR score [20]
310 .IP
311 .ne 2v
312 .B VtTsync
313 .IR tag [1]
314 .br
315 .B VtRsync
316 .IR tag [1]
317 .IP
318 .ne 2v
319 .B VtRerror
320 .IR tag [1]
321 .IR error [ s ]
322 .IP
323 .ne 2v
324 .B VtTgoodbye
325 .IR tag [1]
326 .PP
327 Each T-message has a one-byte
328 .I tag
329 field, chosen and used by the client to identify the message.
330 The server will echo the request's
331 .I tag
332 field in the reply.
333 Clients should arrange that no two outstanding
334 messages have the same tag field so that responses
335 can be distinguished.
336 .PP
337 The type of an R-message will either be one greater than
338 the type of the corresponding T-message or
339 .BR Rerror ,
340 indicating that the request failed.
341 In the latter case, the
342 .I error
343 field contains a string describing the reason for failure.
344 .PP
345 Venti connections must begin with a
346 .B hello
347 transaction.
348 The
349 .B VtThello
350 message contains the protocol
351 .I version
352 that the client has chosen to use.
353 The fields
354 .IR strength ,
355 .IR crypto ,
356 and
357 .IR codec
358 could be used to add authentication, encryption,
359 and compression to the Venti session
360 but are currently ignored.
361 The
362 .IR rcrypto ,
363 and
364 .I rcodec
365 fields in the
366 .B VtRhello
367 response are similarly ignored.
368 The
369 .IR uid
370 and
371 .IR sid
372 fields are intended to be the identity
373 of the client and server but, given the lack of
374 authentication, should be treated only as advisory.
375 The initial
376 .B hello
377 should be the only
378 .B hello
379 transaction during the session.
380 .PP
381 The
382 .B ping
383 message has no effect and
384 is used mainly for debugging.
385 Servers should respond immediately to pings.
386 .PP
387 The
388 .B read
389 message requests a block with the given
390 .I score
391 and
392 .IR type .
393 Use
394 .I vttodisktype
395 and
396 .I vtfromdisktype
397 (see
398 .MR venti (3) )
399 to convert a block type enumeration value
400 .RB ( VtDataType ,
401 etc.)
402 to the
403 .I type
404 used on disk and in the protocol.
405 The
406 .I count
407 field specifies the maximum expected size
408 of the block.
409 The
410 .I data
411 in the reply is the block's contents.
412 .PP
413 The
414 .B write
415 message writes a new block of the given
416 .I type
417 with contents
418 .I data
419 to the server.
420 The response includes the
421 .I score
422 to use to read the block,
423 which should be the SHA1 hash of
424 .IR data .
425 .PP
426 The Venti server may buffer written blocks in memory,
427 waiting until after responding to the
428 .B write
429 message before writing them to
430 permanent storage.
431 The server will delay the response to a
432 .B sync
433 message until after all blocks in earlier
434 .B write
435 messages have been written to permanent storage.
436 .PP
437 The
438 .B goodbye
439 message ends a session. There is no
440 .BR VtRgoodbye :
441 upon receiving the
442 .BR VtTgoodbye
443 message, the server terminates up the connection.
444 .PP
445 Version
446 .B 04
447 of the Venti protocol is similar to version
448 .B 02
449 (described above)
450 but has two changes to accomodates larger payloads.
451 First, it replaces the leading 2-byte packet size with
452 a 4-byte size.
453 Second, the
454 .I count
455 in the
456 .B VtTread
457 packet may be either 2 or 4 bytes;
458 the total packet length distinguishes the two cases.
459 .SH SEE ALSO
460 .MR venti (1) ,
461 .MR venti (3) ,
462 .MR venti (8)
463 .br
464 Sean Quinlan and Sean Dorward,
465 ``Venti: a new approach to archival storage'',
466 .I "Usenix Conference on File and Storage Technologies" ,
467 2002.