Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/doc/xz/xz-file-format.txt @ 68:5028fdace37b
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 16:23:26 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 68:5028fdace37b |
---|---|
1 | |
2 The .xz File Format | |
3 =================== | |
4 | |
5 Version 1.2.1 (2024-04-08) | |
6 | |
7 | |
8 0. Preface | |
9 0.1. Notices and Acknowledgements | |
10 0.2. Getting the Latest Version | |
11 0.3. Version History | |
12 1. Conventions | |
13 1.1. Byte and Its Representation | |
14 1.2. Multibyte Integers | |
15 2. Overall Structure of .xz File | |
16 2.1. Stream | |
17 2.1.1. Stream Header | |
18 2.1.1.1. Header Magic Bytes | |
19 2.1.1.2. Stream Flags | |
20 2.1.1.3. CRC32 | |
21 2.1.2. Stream Footer | |
22 2.1.2.1. CRC32 | |
23 2.1.2.2. Backward Size | |
24 2.1.2.3. Stream Flags | |
25 2.1.2.4. Footer Magic Bytes | |
26 2.2. Stream Padding | |
27 3. Block | |
28 3.1. Block Header | |
29 3.1.1. Block Header Size | |
30 3.1.2. Block Flags | |
31 3.1.3. Compressed Size | |
32 3.1.4. Uncompressed Size | |
33 3.1.5. List of Filter Flags | |
34 3.1.6. Header Padding | |
35 3.1.7. CRC32 | |
36 3.2. Compressed Data | |
37 3.3. Block Padding | |
38 3.4. Check | |
39 4. Index | |
40 4.1. Index Indicator | |
41 4.2. Number of Records | |
42 4.3. List of Records | |
43 4.3.1. Unpadded Size | |
44 4.3.2. Uncompressed Size | |
45 4.4. Index Padding | |
46 4.5. CRC32 | |
47 5. Filter Chains | |
48 5.1. Alignment | |
49 5.2. Security | |
50 5.3. Filters | |
51 5.3.1. LZMA2 | |
52 5.3.2. Branch/Call/Jump Filters for Executables | |
53 5.3.3. Delta | |
54 5.3.3.1. Format of the Encoded Output | |
55 5.4. Custom Filter IDs | |
56 5.4.1. Reserved Custom Filter ID Ranges | |
57 6. Cyclic Redundancy Checks | |
58 7. References | |
59 | |
60 | |
61 0. Preface | |
62 | |
63 This document describes the .xz file format (filename suffix | |
64 ".xz", MIME type "application/x-xz"). It is intended that this | |
65 this format replace the old .lzma format used by LZMA SDK and | |
66 LZMA Utils. | |
67 | |
68 | |
69 0.1. Notices and Acknowledgements | |
70 | |
71 This file format was designed by Lasse Collin | |
72 <lasse.collin@tukaani.org> and Igor Pavlov. | |
73 | |
74 Special thanks for helping with this document goes to | |
75 Ville Koskinen. Thanks for helping with this document goes to | |
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius. | |
77 | |
78 This document has been put into the public domain. | |
79 | |
80 | |
81 0.2. Getting the Latest Version | |
82 | |
83 The latest official version of this document can be downloaded | |
84 from <https://tukaani.org/xz/xz-file-format.txt>. | |
85 | |
86 Specific versions of this document have a filename | |
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number. | |
88 For example, the version 1.0.0 of this document is available | |
89 at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>. | |
90 | |
91 | |
92 0.3. Version History | |
93 | |
94 Version Date Description | |
95 | |
96 1.2.1 2024-04-08 The URLs of this specification and | |
97 XZ Utils were changed back to the | |
98 original ones in Sections 0.2 and 7. | |
99 | |
100 1.2.0 2024-01-19 Added RISC-V filter and updated URLs in | |
101 Sections 0.2 and 7. The URL of this | |
102 specification was changed. | |
103 | |
104 1.1.0 2022-12-11 Added ARM64 filter and clarified 32-bit | |
105 ARM endianness in Section 5.3.2, | |
106 language improvements in Section 5.4 | |
107 | |
108 1.0.4 2009-08-27 Language improvements in Sections 1.2, | |
109 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1 | |
110 | |
111 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4 | |
112 | |
113 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1 | |
114 | |
115 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor | |
116 clarifications to Sections 2, 2.2, | |
117 3.3, 4.4, and 5.3.2 | |
118 | |
119 1.0.0 2009-01-14 The first official version | |
120 | |
121 | |
122 1. Conventions | |
123 | |
124 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", | |
125 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |
126 document are to be interpreted as described in [RFC-2119]. | |
127 | |
128 Indicating a warning means displaying a message, returning | |
129 appropriate exit status, or doing something else to let the | |
130 user know that something worth warning occurred. The operation | |
131 SHOULD still finish if a warning is indicated. | |
132 | |
133 Indicating an error means displaying a message, returning | |
134 appropriate exit status, or doing something else to let the | |
135 user know that something prevented successfully finishing the | |
136 operation. The operation MUST be aborted once an error has | |
137 been indicated. | |
138 | |
139 | |
140 1.1. Byte and Its Representation | |
141 | |
142 In this document, byte is always 8 bits. | |
143 | |
144 A "null byte" has all bits unset. That is, the value of a null | |
145 byte is 0x00. | |
146 | |
147 To represent byte blocks, this document uses notation that | |
148 is similar to the notation used in [RFC-1952]: | |
149 | |
150 +-------+ | |
151 | Foo | One byte. | |
152 +-------+ | |
153 | |
154 +---+---+ | |
155 | Foo | Two bytes; that is, some of the vertical bars | |
156 +---+---+ can be missing. | |
157 | |
158 +=======+ | |
159 | Foo | Zero or more bytes. | |
160 +=======+ | |
161 | |
162 In this document, a boxed byte or a byte sequence declared | |
163 using this notation is called "a field". The example field | |
164 above would be called "the Foo field" or plain "Foo". | |
165 | |
166 If there are many fields, they may be split to multiple lines. | |
167 This is indicated with an arrow ("--->"): | |
168 | |
169 +=====+ | |
170 | Foo | | |
171 +=====+ | |
172 | |
173 +=====+ | |
174 ---> | Bar | | |
175 +=====+ | |
176 | |
177 The above is equivalent to this: | |
178 | |
179 +=====+=====+ | |
180 | Foo | Bar | | |
181 +=====+=====+ | |
182 | |
183 | |
184 1.2. Multibyte Integers | |
185 | |
186 Multibyte integers of static length, such as CRC values, | |
187 are stored in little endian byte order (least significant | |
188 byte first). | |
189 | |
190 When smaller values are more likely than bigger values (for | |
191 example file sizes), multibyte integers are encoded in a | |
192 variable-length representation: | |
193 - Numbers in the range [0, 127] are copied as is, and take | |
194 one byte of space. | |
195 - Bigger numbers will occupy two or more bytes. All but the | |
196 last byte of the multibyte representation have the highest | |
197 (eighth) bit set. | |
198 | |
199 For now, the value of the variable-length integers is limited | |
200 to 63 bits, which limits the encoded size of the integer to | |
201 nine bytes. These limits may be increased in the future if | |
202 needed. | |
203 | |
204 The following C code illustrates encoding and decoding of | |
205 variable-length integers. The functions return the number of | |
206 bytes occupied by the integer (1-9), or zero on error. | |
207 | |
208 #include <stddef.h> | |
209 #include <inttypes.h> | |
210 | |
211 size_t | |
212 encode(uint8_t buf[static 9], uint64_t num) | |
213 { | |
214 if (num > UINT64_MAX / 2) | |
215 return 0; | |
216 | |
217 size_t i = 0; | |
218 | |
219 while (num >= 0x80) { | |
220 buf[i++] = (uint8_t)(num) | 0x80; | |
221 num >>= 7; | |
222 } | |
223 | |
224 buf[i++] = (uint8_t)(num); | |
225 | |
226 return i; | |
227 } | |
228 | |
229 size_t | |
230 decode(const uint8_t buf[], size_t size_max, uint64_t *num) | |
231 { | |
232 if (size_max == 0) | |
233 return 0; | |
234 | |
235 if (size_max > 9) | |
236 size_max = 9; | |
237 | |
238 *num = buf[0] & 0x7F; | |
239 size_t i = 0; | |
240 | |
241 while (buf[i++] & 0x80) { | |
242 if (i >= size_max || buf[i] == 0x00) | |
243 return 0; | |
244 | |
245 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7); | |
246 } | |
247 | |
248 return i; | |
249 } | |
250 | |
251 | |
252 2. Overall Structure of .xz File | |
253 | |
254 A standalone .xz files consist of one or more Streams which may | |
255 have Stream Padding between or after them: | |
256 | |
257 +========+================+========+================+ | |
258 | Stream | Stream Padding | Stream | Stream Padding | ... | |
259 +========+================+========+================+ | |
260 | |
261 The sizes of Stream and Stream Padding are always multiples | |
262 of four bytes, thus the size of every valid .xz file MUST be | |
263 a multiple of four bytes. | |
264 | |
265 While a typical file contains only one Stream and no Stream | |
266 Padding, a decoder handling standalone .xz files SHOULD support | |
267 files that have more than one Stream or Stream Padding. | |
268 | |
269 In contrast to standalone .xz files, when the .xz file format | |
270 is used as an internal part of some other file format or | |
271 communication protocol, it usually is expected that the decoder | |
272 stops after the first Stream, and doesn't look for Stream | |
273 Padding or possibly other Streams. | |
274 | |
275 | |
276 2.1. Stream | |
277 | |
278 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ | |
279 | Stream Header | Block | Block | ... | Block | | |
280 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ | |
281 | |
282 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ | |
283 ---> | Index | Stream Footer | | |
284 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ | |
285 | |
286 All the above fields have a size that is a multiple of four. If | |
287 Stream is used as an internal part of another file format, it | |
288 is RECOMMENDED to make the Stream start at an offset that is | |
289 a multiple of four bytes. | |
290 | |
291 Stream Header, Index, and Stream Footer are always present in | |
292 a Stream. The maximum size of the Index field is 16 GiB (2^34). | |
293 | |
294 There are zero or more Blocks. The maximum number of Blocks is | |
295 limited only by the maximum size of the Index field. | |
296 | |
297 Total size of a Stream MUST be less than 8 EiB (2^63 bytes). | |
298 The same limit applies to the total amount of uncompressed | |
299 data stored in a Stream. | |
300 | |
301 If an implementation supports handling .xz files with multiple | |
302 concatenated Streams, it MAY apply the above limits to the file | |
303 as a whole instead of limiting per Stream basis. | |
304 | |
305 | |
306 2.1.1. Stream Header | |
307 | |
308 +---+---+---+---+---+---+-------+------+--+--+--+--+ | |
309 | Header Magic Bytes | Stream Flags | CRC32 | | |
310 +---+---+---+---+---+---+-------+------+--+--+--+--+ | |
311 | |
312 | |
313 2.1.1.1. Header Magic Bytes | |
314 | |
315 The first six (6) bytes of the Stream are so called Header | |
316 Magic Bytes. They can be used to identify the file type. | |
317 | |
318 Using a C array and ASCII: | |
319 const uint8_t HEADER_MAGIC[6] | |
320 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 }; | |
321 | |
322 In plain hexadecimal: | |
323 FD 37 7A 58 5A 00 | |
324 | |
325 Notes: | |
326 - The first byte (0xFD) was chosen so that the files cannot | |
327 be erroneously detected as being in .lzma format, in which | |
328 the first byte is in the range [0x00, 0xE0]. | |
329 - The sixth byte (0x00) was chosen to prevent applications | |
330 from misdetecting the file as a text file. | |
331 | |
332 If the Header Magic Bytes don't match, the decoder MUST | |
333 indicate an error. | |
334 | |
335 | |
336 2.1.1.2. Stream Flags | |
337 | |
338 The first byte of Stream Flags is always a null byte. In the | |
339 future, this byte may be used to indicate a new Stream version | |
340 or other Stream properties. | |
341 | |
342 The second byte of Stream Flags is a bit field: | |
343 | |
344 Bit(s) Mask Description | |
345 0-3 0x0F Type of Check (see Section 3.4): | |
346 ID Size Check name | |
347 0x00 0 bytes None | |
348 0x01 4 bytes CRC32 | |
349 0x02 4 bytes (Reserved) | |
350 0x03 4 bytes (Reserved) | |
351 0x04 8 bytes CRC64 | |
352 0x05 8 bytes (Reserved) | |
353 0x06 8 bytes (Reserved) | |
354 0x07 16 bytes (Reserved) | |
355 0x08 16 bytes (Reserved) | |
356 0x09 16 bytes (Reserved) | |
357 0x0A 32 bytes SHA-256 | |
358 0x0B 32 bytes (Reserved) | |
359 0x0C 32 bytes (Reserved) | |
360 0x0D 64 bytes (Reserved) | |
361 0x0E 64 bytes (Reserved) | |
362 0x0F 64 bytes (Reserved) | |
363 4-7 0xF0 Reserved for future use; MUST be zero for now. | |
364 | |
365 Implementations SHOULD support at least the Check IDs 0x00 | |
366 (None) and 0x01 (CRC32). Supporting other Check IDs is | |
367 OPTIONAL. If an unsupported Check is used, the decoder SHOULD | |
368 indicate a warning or error. | |
369 | |
370 If any reserved bit is set, the decoder MUST indicate an error. | |
371 It is possible that there is a new field present which the | |
372 decoder is not aware of, and can thus parse the Stream Header | |
373 incorrectly. | |
374 | |
375 | |
376 2.1.1.3. CRC32 | |
377 | |
378 The CRC32 is calculated from the Stream Flags field. It is | |
379 stored as an unsigned 32-bit little endian integer. If the | |
380 calculated value does not match the stored one, the decoder | |
381 MUST indicate an error. | |
382 | |
383 The idea is that Stream Flags would always be two bytes, even | |
384 if new features are needed. This way old decoders will be able | |
385 to verify the CRC32 calculated from Stream Flags, and thus | |
386 distinguish between corrupt files (CRC32 doesn't match) and | |
387 files that the decoder doesn't support (CRC32 matches but | |
388 Stream Flags has reserved bits set). | |
389 | |
390 | |
391 2.1.2. Stream Footer | |
392 | |
393 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ | |
394 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes | | |
395 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ | |
396 | |
397 | |
398 2.1.2.1. CRC32 | |
399 | |
400 The CRC32 is calculated from the Backward Size and Stream Flags | |
401 fields. It is stored as an unsigned 32-bit little endian | |
402 integer. If the calculated value does not match the stored one, | |
403 the decoder MUST indicate an error. | |
404 | |
405 The reason to have the CRC32 field before the Backward Size and | |
406 Stream Flags fields is to keep the four-byte fields aligned to | |
407 a multiple of four bytes. | |
408 | |
409 | |
410 2.1.2.2. Backward Size | |
411 | |
412 Backward Size is stored as a 32-bit little endian integer, | |
413 which indicates the size of the Index field as multiple of | |
414 four bytes, minimum value being four bytes: | |
415 | |
416 real_backward_size = (stored_backward_size + 1) * 4; | |
417 | |
418 If the stored value does not match the real size of the Index | |
419 field, the decoder MUST indicate an error. | |
420 | |
421 Using a fixed-size integer to store Backward Size makes | |
422 it slightly simpler to parse the Stream Footer when the | |
423 application needs to parse the Stream backwards. | |
424 | |
425 | |
426 2.1.2.3. Stream Flags | |
427 | |
428 This is a copy of the Stream Flags field from the Stream | |
429 Header. The information stored to Stream Flags is needed | |
430 when parsing the Stream backwards. The decoder MUST compare | |
431 the Stream Flags fields in both Stream Header and Stream | |
432 Footer, and indicate an error if they are not identical. | |
433 | |
434 | |
435 2.1.2.4. Footer Magic Bytes | |
436 | |
437 As the last step of the decoding process, the decoder MUST | |
438 verify the existence of Footer Magic Bytes. If they don't | |
439 match, an error MUST be indicated. | |
440 | |
441 Using a C array and ASCII: | |
442 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; | |
443 | |
444 In hexadecimal: | |
445 59 5A | |
446 | |
447 The primary reason to have Footer Magic Bytes is to make | |
448 it easier to detect incomplete files quickly, without | |
449 uncompressing. If the file does not end with Footer Magic Bytes | |
450 (excluding Stream Padding described in Section 2.2), it cannot | |
451 be undamaged, unless someone has intentionally appended garbage | |
452 after the end of the Stream. | |
453 | |
454 | |
455 2.2. Stream Padding | |
456 | |
457 Only the decoders that support decoding of concatenated Streams | |
458 MUST support Stream Padding. | |
459 | |
460 Stream Padding MUST contain only null bytes. To preserve the | |
461 four-byte alignment of consecutive Streams, the size of Stream | |
462 Padding MUST be a multiple of four bytes. Empty Stream Padding | |
463 is allowed. If these requirements are not met, the decoder MUST | |
464 indicate an error. | |
465 | |
466 Note that non-empty Stream Padding is allowed at the end of the | |
467 file; there doesn't need to be a new Stream after non-empty | |
468 Stream Padding. This can be convenient in certain situations | |
469 [GNU-tar]. | |
470 | |
471 The possibility of Stream Padding MUST be taken into account | |
472 when designing an application that parses Streams backwards, | |
473 and the application supports concatenated Streams. | |
474 | |
475 | |
476 3. Block | |
477 | |
478 +==============+=================+===============+=======+ | |
479 | Block Header | Compressed Data | Block Padding | Check | | |
480 +==============+=================+===============+=======+ | |
481 | |
482 | |
483 3.1. Block Header | |
484 | |
485 +-------------------+-------------+=================+ | |
486 | Block Header Size | Block Flags | Compressed Size | | |
487 +-------------------+-------------+=================+ | |
488 | |
489 +===================+======================+ | |
490 ---> | Uncompressed Size | List of Filter Flags | | |
491 +===================+======================+ | |
492 | |
493 +================+--+--+--+--+ | |
494 ---> | Header Padding | CRC32 | | |
495 +================+--+--+--+--+ | |
496 | |
497 | |
498 3.1.1. Block Header Size | |
499 | |
500 This field overlaps with the Index Indicator field (see | |
501 Section 4.1). | |
502 | |
503 This field contains the size of the Block Header field, | |
504 including the Block Header Size field itself. Valid values are | |
505 in the range [0x01, 0xFF], which indicate the size of the Block | |
506 Header as multiples of four bytes, minimum size being eight | |
507 bytes: | |
508 | |
509 real_header_size = (encoded_header_size + 1) * 4; | |
510 | |
511 If a Block Header bigger than 1024 bytes is needed in the | |
512 future, a new field can be added between the Block Header and | |
513 Compressed Data fields. The presence of this new field would | |
514 be indicated in the Block Header field. | |
515 | |
516 | |
517 3.1.2. Block Flags | |
518 | |
519 The Block Flags field is a bit field: | |
520 | |
521 Bit(s) Mask Description | |
522 0-1 0x03 Number of filters (1-4) | |
523 2-5 0x3C Reserved for future use; MUST be zero for now. | |
524 6 0x40 The Compressed Size field is present. | |
525 7 0x80 The Uncompressed Size field is present. | |
526 | |
527 If any reserved bit is set, the decoder MUST indicate an error. | |
528 It is possible that there is a new field present which the | |
529 decoder is not aware of, and can thus parse the Block Header | |
530 incorrectly. | |
531 | |
532 | |
533 3.1.3. Compressed Size | |
534 | |
535 This field is present only if the appropriate bit is set in | |
536 the Block Flags field (see Section 3.1.2). | |
537 | |
538 The Compressed Size field contains the size of the Compressed | |
539 Data field, which MUST be non-zero. Compressed Size is stored | |
540 using the encoding described in Section 1.2. If the Compressed | |
541 Size doesn't match the size of the Compressed Data field, the | |
542 decoder MUST indicate an error. | |
543 | |
544 | |
545 3.1.4. Uncompressed Size | |
546 | |
547 This field is present only if the appropriate bit is set in | |
548 the Block Flags field (see Section 3.1.2). | |
549 | |
550 The Uncompressed Size field contains the size of the Block | |
551 after uncompressing. Uncompressed Size is stored using the | |
552 encoding described in Section 1.2. If the Uncompressed Size | |
553 does not match the real uncompressed size, the decoder MUST | |
554 indicate an error. | |
555 | |
556 Storing the Compressed Size and Uncompressed Size fields serves | |
557 several purposes: | |
558 - The decoder knows how much memory it needs to allocate | |
559 for a temporary buffer in multithreaded mode. | |
560 - Simple error detection: wrong size indicates a broken file. | |
561 - Seeking forwards to a specific location in streamed mode. | |
562 | |
563 It should be noted that the only reliable way to determine | |
564 the real uncompressed size is to uncompress the Block, | |
565 because the Block Header and Index fields may contain | |
566 (intentionally or unintentionally) invalid information. | |
567 | |
568 | |
569 3.1.5. List of Filter Flags | |
570 | |
571 +================+================+ +================+ | |
572 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | | |
573 +================+================+ +================+ | |
574 | |
575 The number of Filter Flags fields is stored in the Block Flags | |
576 field (see Section 3.1.2). | |
577 | |
578 The format of each Filter Flags field is as follows: | |
579 | |
580 +===========+====================+===================+ | |
581 | Filter ID | Size of Properties | Filter Properties | | |
582 +===========+====================+===================+ | |
583 | |
584 Both Filter ID and Size of Properties are stored using the | |
585 encoding described in Section 1.2. Size of Properties indicates | |
586 the size of the Filter Properties field as bytes. The list of | |
587 officially defined Filter IDs and the formats of their Filter | |
588 Properties are described in Section 5.3. | |
589 | |
590 Filter IDs greater than or equal to 0x4000_0000_0000_0000 | |
591 (2^62) are reserved for implementation-specific internal use. | |
592 These Filter IDs MUST never be used in List of Filter Flags. | |
593 | |
594 | |
595 3.1.6. Header Padding | |
596 | |
597 This field contains as many null byte as it is needed to make | |
598 the Block Header have the size specified in Block Header Size. | |
599 If any of the bytes are not null bytes, the decoder MUST | |
600 indicate an error. It is possible that there is a new field | |
601 present which the decoder is not aware of, and can thus parse | |
602 the Block Header incorrectly. | |
603 | |
604 | |
605 3.1.7. CRC32 | |
606 | |
607 The CRC32 is calculated over everything in the Block Header | |
608 field except the CRC32 field itself. It is stored as an | |
609 unsigned 32-bit little endian integer. If the calculated | |
610 value does not match the stored one, the decoder MUST indicate | |
611 an error. | |
612 | |
613 By verifying the CRC32 of the Block Header before parsing the | |
614 actual contents allows the decoder to distinguish between | |
615 corrupt and unsupported files. | |
616 | |
617 | |
618 3.2. Compressed Data | |
619 | |
620 The format of Compressed Data depends on Block Flags and List | |
621 of Filter Flags. Excluding the descriptions of the simplest | |
622 filters in Section 5.3, the format of the filter-specific | |
623 encoded data is out of scope of this document. | |
624 | |
625 | |
626 3.3. Block Padding | |
627 | |
628 Block Padding MUST contain 0-3 null bytes to make the size of | |
629 the Block a multiple of four bytes. This can be needed when | |
630 the size of Compressed Data is not a multiple of four. If any | |
631 of the bytes in Block Padding are not null bytes, the decoder | |
632 MUST indicate an error. | |
633 | |
634 | |
635 3.4. Check | |
636 | |
637 The type and size of the Check field depends on which bits | |
638 are set in the Stream Flags field (see Section 2.1.1.2). | |
639 | |
640 The Check, when used, is calculated from the original | |
641 uncompressed data. If the calculated Check does not match the | |
642 stored one, the decoder MUST indicate an error. If the selected | |
643 type of Check is not supported by the decoder, it SHOULD | |
644 indicate a warning or error. | |
645 | |
646 | |
647 4. Index | |
648 | |
649 +-----------------+===================+ | |
650 | Index Indicator | Number of Records | | |
651 +-----------------+===================+ | |
652 | |
653 +=================+===============+-+-+-+-+ | |
654 ---> | List of Records | Index Padding | CRC32 | | |
655 +=================+===============+-+-+-+-+ | |
656 | |
657 Index serves several purposes. Using it, one can | |
658 - verify that all Blocks in a Stream have been processed; | |
659 - find out the uncompressed size of a Stream; and | |
660 - quickly access the beginning of any Block (random access). | |
661 | |
662 | |
663 4.1. Index Indicator | |
664 | |
665 This field overlaps with the Block Header Size field (see | |
666 Section 3.1.1). The value of Index Indicator is always 0x00. | |
667 | |
668 | |
669 4.2. Number of Records | |
670 | |
671 This field indicates how many Records there are in the List | |
672 of Records field, and thus how many Blocks there are in the | |
673 Stream. The value is stored using the encoding described in | |
674 Section 1.2. If the decoder has decoded all the Blocks of the | |
675 Stream, and then notices that the Number of Records doesn't | |
676 match the real number of Blocks, the decoder MUST indicate an | |
677 error. | |
678 | |
679 | |
680 4.3. List of Records | |
681 | |
682 List of Records consists of as many Records as indicated by the | |
683 Number of Records field: | |
684 | |
685 +========+========+ | |
686 | Record | Record | ... | |
687 +========+========+ | |
688 | |
689 Each Record contains information about one Block: | |
690 | |
691 +===============+===================+ | |
692 | Unpadded Size | Uncompressed Size | | |
693 +===============+===================+ | |
694 | |
695 If the decoder has decoded all the Blocks of the Stream, it | |
696 MUST verify that the contents of the Records match the real | |
697 Unpadded Size and Uncompressed Size of the respective Blocks. | |
698 | |
699 Implementation hint: It is possible to verify the Index with | |
700 constant memory usage by calculating for example SHA-256 of | |
701 both the real size values and the List of Records, then | |
702 comparing the hash values. Implementing this using | |
703 non-cryptographic hash like CRC32 SHOULD be avoided unless | |
704 small code size is important. | |
705 | |
706 If the decoder supports random-access reading, it MUST verify | |
707 that Unpadded Size and Uncompressed Size of every completely | |
708 decoded Block match the sizes stored in the Index. If only | |
709 partial Block is decoded, the decoder MUST verify that the | |
710 processed sizes don't exceed the sizes stored in the Index. | |
711 | |
712 | |
713 4.3.1. Unpadded Size | |
714 | |
715 This field indicates the size of the Block excluding the Block | |
716 Padding field. That is, Unpadded Size is the size of the Block | |
717 Header, Compressed Data, and Check fields. Unpadded Size is | |
718 stored using the encoding described in Section 1.2. The value | |
719 MUST never be zero; with the current structure of Blocks, the | |
720 actual minimum value for Unpadded Size is five. | |
721 | |
722 Implementation note: Because the size of the Block Padding | |
723 field is not included in Unpadded Size, calculating the total | |
724 size of a Stream or doing random-access reading requires | |
725 calculating the actual size of the Blocks by rounding Unpadded | |
726 Sizes up to the next multiple of four. | |
727 | |
728 The reason to exclude Block Padding from Unpadded Size is to | |
729 ease making a raw copy of Compressed Data without Block | |
730 Padding. This can be useful, for example, if someone wants | |
731 to convert Streams to some other file format quickly. | |
732 | |
733 | |
734 4.3.2. Uncompressed Size | |
735 | |
736 This field indicates the Uncompressed Size of the respective | |
737 Block as bytes. The value is stored using the encoding | |
738 described in Section 1.2. | |
739 | |
740 | |
741 4.4. Index Padding | |
742 | |
743 This field MUST contain 0-3 null bytes to pad the Index to | |
744 a multiple of four bytes. If any of the bytes are not null | |
745 bytes, the decoder MUST indicate an error. | |
746 | |
747 | |
748 4.5. CRC32 | |
749 | |
750 The CRC32 is calculated over everything in the Index field | |
751 except the CRC32 field itself. The CRC32 is stored as an | |
752 unsigned 32-bit little endian integer. If the calculated | |
753 value does not match the stored one, the decoder MUST indicate | |
754 an error. | |
755 | |
756 | |
757 5. Filter Chains | |
758 | |
759 The Block Flags field defines how many filters are used. When | |
760 more than one filter is used, the filters are chained; that is, | |
761 the output of one filter is the input of another filter. The | |
762 following figure illustrates the direction of data flow. | |
763 | |
764 v Uncompressed Data ^ | |
765 | Filter 0 | | |
766 Encoder | Filter 1 | Decoder | |
767 | Filter n | | |
768 v Compressed Data ^ | |
769 | |
770 | |
771 5.1. Alignment | |
772 | |
773 Alignment of uncompressed input data is usually the job of | |
774 the application producing the data. For example, to get the | |
775 best results, an archiver tool should make sure that all | |
776 PowerPC executable files in the archive stream start at | |
777 offsets that are multiples of four bytes. | |
778 | |
779 Some filters, for example LZMA2, can be configured to take | |
780 advantage of specified alignment of input data. Note that | |
781 taking advantage of aligned input can be beneficial also when | |
782 a filter is not the first filter in the chain. For example, | |
783 if you compress PowerPC executables, you may want to use the | |
784 PowerPC filter and chain that with the LZMA2 filter. Because | |
785 not only the input but also the output alignment of the PowerPC | |
786 filter is four bytes, it is now beneficial to set LZMA2 | |
787 settings so that the LZMA2 encoder can take advantage of its | |
788 four-byte-aligned input data. | |
789 | |
790 The output of the last filter in the chain is stored to the | |
791 Compressed Data field, which is is guaranteed to be aligned | |
792 to a multiple of four bytes relative to the beginning of the | |
793 Stream. This can increase | |
794 - speed, if the filtered data is handled multiple bytes at | |
795 a time by the filter-specific encoder and decoder, | |
796 because accessing aligned data in computer memory is | |
797 usually faster; and | |
798 - compression ratio, if the output data is later compressed | |
799 with an external compression tool. | |
800 | |
801 | |
802 5.2. Security | |
803 | |
804 If filters would be allowed to be chained freely, it would be | |
805 possible to create malicious files, that would be very slow to | |
806 decode. Such files could be used to create denial of service | |
807 attacks. | |
808 | |
809 Slow files could occur when multiple filters are chained: | |
810 | |
811 v Compressed input data | |
812 | Filter 1 decoder (last filter) | |
813 | Filter 0 decoder (non-last filter) | |
814 v Uncompressed output data | |
815 | |
816 The decoder of the last filter in the chain produces a lot of | |
817 output from little input. Another filter in the chain takes the | |
818 output of the last filter, and produces very little output | |
819 while consuming a lot of input. As a result, a lot of data is | |
820 moved inside the filter chain, but the filter chain as a whole | |
821 gets very little work done. | |
822 | |
823 To prevent this kind of slow files, there are restrictions on | |
824 how the filters can be chained. These restrictions MUST be | |
825 taken into account when designing new filters. | |
826 | |
827 The maximum number of filters in the chain has been limited to | |
828 four, thus there can be at maximum of three non-last filters. | |
829 Of these three non-last filters, only two are allowed to change | |
830 the size of the data. | |
831 | |
832 The non-last filters, that change the size of the data, MUST | |
833 have a limit how much the decoder can compress the data: the | |
834 decoder SHOULD produce at least n bytes of output when the | |
835 filter is given 2n bytes of input. This limit is not | |
836 absolute, but significant deviations MUST be avoided. | |
837 | |
838 The above limitations guarantee that if the last filter in the | |
839 chain produces 4n bytes of output, the chain as a whole will | |
840 produce at least n bytes of output. | |
841 | |
842 | |
843 5.3. Filters | |
844 | |
845 5.3.1. LZMA2 | |
846 | |
847 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose | |
848 compression algorithm with high compression ratio and fast | |
849 decompression. LZMA is based on LZ77 and range coding | |
850 algorithms. | |
851 | |
852 LZMA2 is an extension on top of the original LZMA. LZMA2 uses | |
853 LZMA internally, but adds support for flushing the encoder, | |
854 uncompressed chunks, eases stateful decoder implementations, | |
855 and improves support for multithreading. Thus, the plain LZMA | |
856 will not be supported in this file format. | |
857 | |
858 Filter ID: 0x21 | |
859 Size of Filter Properties: 1 byte | |
860 Changes size of data: Yes | |
861 Allow as a non-last filter: No | |
862 Allow as the last filter: Yes | |
863 | |
864 Preferred alignment: | |
865 Input data: Adjustable to 1/2/4/8/16 byte(s) | |
866 Output data: 1 byte | |
867 | |
868 The format of the one-byte Filter Properties field is as | |
869 follows: | |
870 | |
871 Bits Mask Description | |
872 0-5 0x3F Dictionary Size | |
873 6-7 0xC0 Reserved for future use; MUST be zero for now. | |
874 | |
875 Dictionary Size is encoded with one-bit mantissa and five-bit | |
876 exponent. The smallest dictionary size is 4 KiB and the biggest | |
877 is 4 GiB. | |
878 | |
879 Raw value Mantissa Exponent Dictionary size | |
880 0 2 11 4 KiB | |
881 1 3 11 6 KiB | |
882 2 2 12 8 KiB | |
883 3 3 12 12 KiB | |
884 4 2 13 16 KiB | |
885 5 3 13 24 KiB | |
886 6 2 14 32 KiB | |
887 ... ... ... ... | |
888 35 3 27 768 MiB | |
889 36 2 28 1024 MiB | |
890 37 3 29 1536 MiB | |
891 38 2 30 2048 MiB | |
892 39 3 30 3072 MiB | |
893 40 2 31 4096 MiB - 1 B | |
894 | |
895 Instead of having a table in the decoder, the dictionary size | |
896 can be decoded using the following C code: | |
897 | |
898 const uint8_t bits = get_dictionary_flags() & 0x3F; | |
899 if (bits > 40) | |
900 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB | |
901 | |
902 uint32_t dictionary_size; | |
903 if (bits == 40) { | |
904 dictionary_size = UINT32_MAX; | |
905 } else { | |
906 dictionary_size = 2 | (bits & 1); | |
907 dictionary_size <<= bits / 2 + 11; | |
908 } | |
909 | |
910 | |
911 5.3.2. Branch/Call/Jump Filters for Executables | |
912 | |
913 These filters convert relative branch, call, and jump | |
914 instructions to their absolute counterparts in executable | |
915 files. This conversion increases redundancy and thus | |
916 compression ratio. | |
917 | |
918 Size of Filter Properties: 0 or 4 bytes | |
919 Changes size of data: No | |
920 Allow as a non-last filter: Yes | |
921 Allow as the last filter: No | |
922 | |
923 Below is the list of filters in this category. The alignment | |
924 is the same for both input and output data. | |
925 | |
926 Filter ID Alignment Description | |
927 0x04 1 byte x86 filter (BCJ) | |
928 0x05 4 bytes PowerPC (big endian) filter | |
929 0x06 16 bytes IA64 filter | |
930 0x07 4 bytes ARM filter [1] | |
931 0x08 2 bytes ARM Thumb filter [1] | |
932 0x09 4 bytes SPARC filter | |
933 0x0A 4 bytes ARM64 filter [2] | |
934 0x0B 2 bytes RISC-V filter | |
935 | |
936 [1] These are for little endian instruction encoding. | |
937 This must not be confused with data endianness. | |
938 A processor configured for big endian data access | |
939 may still use little endian instruction encoding. | |
940 The filters don't care about the data endianness. | |
941 | |
942 [2] 4096-byte alignment gives the best results | |
943 because the address in the ADRP instruction | |
944 is a multiple of 4096 bytes. | |
945 | |
946 If the size of Filter Properties is four bytes, the Filter | |
947 Properties field contains the start offset used for address | |
948 conversions. It is stored as an unsigned 32-bit little endian | |
949 integer. The start offset MUST be a multiple of the alignment | |
950 of the filter as listed in the table above; if it isn't, the | |
951 decoder MUST indicate an error. If the size of Filter | |
952 Properties is zero, the start offset is zero. | |
953 | |
954 Setting the start offset may be useful if an executable has | |
955 multiple sections, and there are many cross-section calls. | |
956 Taking advantage of this feature usually requires usage of | |
957 the Subblock filter, whose design is not complete yet. | |
958 | |
959 | |
960 5.3.3. Delta | |
961 | |
962 The Delta filter may increase compression ratio when the value | |
963 of the next byte correlates with the value of an earlier byte | |
964 at specified distance. | |
965 | |
966 Filter ID: 0x03 | |
967 Size of Filter Properties: 1 byte | |
968 Changes size of data: No | |
969 Allow as a non-last filter: Yes | |
970 Allow as the last filter: No | |
971 | |
972 Preferred alignment: | |
973 Input data: 1 byte | |
974 Output data: Same as the original input data | |
975 | |
976 The Properties byte indicates the delta distance, which can be | |
977 1-256 bytes backwards from the current byte: 0x00 indicates | |
978 distance of 1 byte and 0xFF distance of 256 bytes. | |
979 | |
980 | |
981 5.3.3.1. Format of the Encoded Output | |
982 | |
983 The code below illustrates both encoding and decoding with | |
984 the Delta filter. | |
985 | |
986 // Distance is in the range [1, 256]. | |
987 const unsigned int distance = get_properties_byte() + 1; | |
988 uint8_t pos = 0; | |
989 uint8_t delta[256]; | |
990 | |
991 memset(delta, 0, sizeof(delta)); | |
992 | |
993 while (1) { | |
994 const int byte = read_byte(); | |
995 if (byte == EOF) | |
996 break; | |
997 | |
998 uint8_t tmp = delta[(uint8_t)(distance + pos)]; | |
999 if (is_encoder) { | |
1000 tmp = (uint8_t)(byte) - tmp; | |
1001 delta[pos] = (uint8_t)(byte); | |
1002 } else { | |
1003 tmp = (uint8_t)(byte) + tmp; | |
1004 delta[pos] = tmp; | |
1005 } | |
1006 | |
1007 write_byte(tmp); | |
1008 --pos; | |
1009 } | |
1010 | |
1011 | |
1012 5.4. Custom Filter IDs | |
1013 | |
1014 If a developer wants to use custom Filter IDs, there are two | |
1015 choices. The first choice is to contact Lasse Collin and ask | |
1016 him to allocate a range of IDs for the developer. | |
1017 | |
1018 The second choice is to generate a 40-bit random integer | |
1019 which the developer can use as a personal Developer ID. | |
1020 To minimize the risk of collisions, Developer ID has to be | |
1021 a randomly generated integer, not manually selected "hex word". | |
1022 The following command, which works on many free operating | |
1023 systems, can be used to generate Developer ID: | |
1024 | |
1025 dd if=/dev/urandom bs=5 count=1 | hexdump | |
1026 | |
1027 The developer can then use the Developer ID to create unique | |
1028 (well, hopefully unique) Filter IDs. | |
1029 | |
1030 Bits Mask Description | |
1031 0-15 0x0000_0000_0000_FFFF Filter ID | |
1032 16-55 0x00FF_FFFF_FFFF_0000 Developer ID | |
1033 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F | |
1034 | |
1035 The resulting 63-bit integer will use 9 bytes of space when | |
1036 stored using the encoding described in Section 1.2. To get | |
1037 a shorter ID, see the beginning of this Section how to | |
1038 request a custom ID range. | |
1039 | |
1040 | |
1041 5.4.1. Reserved Custom Filter ID Ranges | |
1042 | |
1043 Range Description | |
1044 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility | |
1045 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility | |
1046 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility | |
1047 | |
1048 | |
1049 6. Cyclic Redundancy Checks | |
1050 | |
1051 There are several incompatible variations to calculate CRC32 | |
1052 and CRC64. For simplicity and clarity, complete examples are | |
1053 provided to calculate the checks as they are used in this file | |
1054 format. Implementations MAY use different code as long as it | |
1055 gives identical results. | |
1056 | |
1057 The program below reads data from standard input, calculates | |
1058 the CRC32 and CRC64 values, and prints the calculated values | |
1059 as big endian hexadecimal strings to standard output. | |
1060 | |
1061 #include <stddef.h> | |
1062 #include <inttypes.h> | |
1063 #include <stdio.h> | |
1064 | |
1065 uint32_t crc32_table[256]; | |
1066 uint64_t crc64_table[256]; | |
1067 | |
1068 void | |
1069 init(void) | |
1070 { | |
1071 static const uint32_t poly32 = UINT32_C(0xEDB88320); | |
1072 static const uint64_t poly64 | |
1073 = UINT64_C(0xC96C5795D7870F42); | |
1074 | |
1075 for (size_t i = 0; i < 256; ++i) { | |
1076 uint32_t crc32 = i; | |
1077 uint64_t crc64 = i; | |
1078 | |
1079 for (size_t j = 0; j < 8; ++j) { | |
1080 if (crc32 & 1) | |
1081 crc32 = (crc32 >> 1) ^ poly32; | |
1082 else | |
1083 crc32 >>= 1; | |
1084 | |
1085 if (crc64 & 1) | |
1086 crc64 = (crc64 >> 1) ^ poly64; | |
1087 else | |
1088 crc64 >>= 1; | |
1089 } | |
1090 | |
1091 crc32_table[i] = crc32; | |
1092 crc64_table[i] = crc64; | |
1093 } | |
1094 } | |
1095 | |
1096 uint32_t | |
1097 crc32(const uint8_t *buf, size_t size, uint32_t crc) | |
1098 { | |
1099 crc = ~crc; | |
1100 for (size_t i = 0; i < size; ++i) | |
1101 crc = crc32_table[buf[i] ^ (crc & 0xFF)] | |
1102 ^ (crc >> 8); | |
1103 return ~crc; | |
1104 } | |
1105 | |
1106 uint64_t | |
1107 crc64(const uint8_t *buf, size_t size, uint64_t crc) | |
1108 { | |
1109 crc = ~crc; | |
1110 for (size_t i = 0; i < size; ++i) | |
1111 crc = crc64_table[buf[i] ^ (crc & 0xFF)] | |
1112 ^ (crc >> 8); | |
1113 return ~crc; | |
1114 } | |
1115 | |
1116 int | |
1117 main() | |
1118 { | |
1119 init(); | |
1120 | |
1121 uint32_t value32 = 0; | |
1122 uint64_t value64 = 0; | |
1123 uint64_t total_size = 0; | |
1124 uint8_t buf[8192]; | |
1125 | |
1126 while (1) { | |
1127 const size_t buf_size | |
1128 = fread(buf, 1, sizeof(buf), stdin); | |
1129 if (buf_size == 0) | |
1130 break; | |
1131 | |
1132 total_size += buf_size; | |
1133 value32 = crc32(buf, buf_size, value32); | |
1134 value64 = crc64(buf, buf_size, value64); | |
1135 } | |
1136 | |
1137 printf("Bytes: %" PRIu64 "\n", total_size); | |
1138 printf("CRC-32: 0x%08" PRIX32 "\n", value32); | |
1139 printf("CRC-64: 0x%016" PRIX64 "\n", value64); | |
1140 | |
1141 return 0; | |
1142 } | |
1143 | |
1144 | |
1145 7. References | |
1146 | |
1147 LZMA SDK - The original LZMA implementation | |
1148 https://7-zip.org/sdk.html | |
1149 | |
1150 LZMA Utils - LZMA adapted to POSIX-like systems | |
1151 https://tukaani.org/lzma/ | |
1152 | |
1153 XZ Utils - The next generation of LZMA Utils | |
1154 https://tukaani.org/xz/ | |
1155 | |
1156 [RFC-1952] | |
1157 GZIP file format specification version 4.3 | |
1158 https://www.ietf.org/rfc/rfc1952.txt | |
1159 - Notation of byte boxes in section "2.1. Overall conventions" | |
1160 | |
1161 [RFC-2119] | |
1162 Key words for use in RFCs to Indicate Requirement Levels | |
1163 https://www.ietf.org/rfc/rfc2119.txt | |
1164 | |
1165 [GNU-tar] | |
1166 GNU tar 1.35 manual | |
1167 https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html | |
1168 - Node 9.4.2 "Blocking Factor", paragraph that begins | |
1169 "gzip will complain about trailing garbage" | |
1170 - Note that this URL points to the latest version of the | |
1171 manual, and may some day not contain the note which is in | |
1172 1.35. For the exact version of the manual, download GNU | |
1173 tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz | |
1174 |