jpayne@68
|
1
|
jpayne@68
|
2 The .xz File Format
|
jpayne@68
|
3 ===================
|
jpayne@68
|
4
|
jpayne@68
|
5 Version 1.2.1 (2024-04-08)
|
jpayne@68
|
6
|
jpayne@68
|
7
|
jpayne@68
|
8 0. Preface
|
jpayne@68
|
9 0.1. Notices and Acknowledgements
|
jpayne@68
|
10 0.2. Getting the Latest Version
|
jpayne@68
|
11 0.3. Version History
|
jpayne@68
|
12 1. Conventions
|
jpayne@68
|
13 1.1. Byte and Its Representation
|
jpayne@68
|
14 1.2. Multibyte Integers
|
jpayne@68
|
15 2. Overall Structure of .xz File
|
jpayne@68
|
16 2.1. Stream
|
jpayne@68
|
17 2.1.1. Stream Header
|
jpayne@68
|
18 2.1.1.1. Header Magic Bytes
|
jpayne@68
|
19 2.1.1.2. Stream Flags
|
jpayne@68
|
20 2.1.1.3. CRC32
|
jpayne@68
|
21 2.1.2. Stream Footer
|
jpayne@68
|
22 2.1.2.1. CRC32
|
jpayne@68
|
23 2.1.2.2. Backward Size
|
jpayne@68
|
24 2.1.2.3. Stream Flags
|
jpayne@68
|
25 2.1.2.4. Footer Magic Bytes
|
jpayne@68
|
26 2.2. Stream Padding
|
jpayne@68
|
27 3. Block
|
jpayne@68
|
28 3.1. Block Header
|
jpayne@68
|
29 3.1.1. Block Header Size
|
jpayne@68
|
30 3.1.2. Block Flags
|
jpayne@68
|
31 3.1.3. Compressed Size
|
jpayne@68
|
32 3.1.4. Uncompressed Size
|
jpayne@68
|
33 3.1.5. List of Filter Flags
|
jpayne@68
|
34 3.1.6. Header Padding
|
jpayne@68
|
35 3.1.7. CRC32
|
jpayne@68
|
36 3.2. Compressed Data
|
jpayne@68
|
37 3.3. Block Padding
|
jpayne@68
|
38 3.4. Check
|
jpayne@68
|
39 4. Index
|
jpayne@68
|
40 4.1. Index Indicator
|
jpayne@68
|
41 4.2. Number of Records
|
jpayne@68
|
42 4.3. List of Records
|
jpayne@68
|
43 4.3.1. Unpadded Size
|
jpayne@68
|
44 4.3.2. Uncompressed Size
|
jpayne@68
|
45 4.4. Index Padding
|
jpayne@68
|
46 4.5. CRC32
|
jpayne@68
|
47 5. Filter Chains
|
jpayne@68
|
48 5.1. Alignment
|
jpayne@68
|
49 5.2. Security
|
jpayne@68
|
50 5.3. Filters
|
jpayne@68
|
51 5.3.1. LZMA2
|
jpayne@68
|
52 5.3.2. Branch/Call/Jump Filters for Executables
|
jpayne@68
|
53 5.3.3. Delta
|
jpayne@68
|
54 5.3.3.1. Format of the Encoded Output
|
jpayne@68
|
55 5.4. Custom Filter IDs
|
jpayne@68
|
56 5.4.1. Reserved Custom Filter ID Ranges
|
jpayne@68
|
57 6. Cyclic Redundancy Checks
|
jpayne@68
|
58 7. References
|
jpayne@68
|
59
|
jpayne@68
|
60
|
jpayne@68
|
61 0. Preface
|
jpayne@68
|
62
|
jpayne@68
|
63 This document describes the .xz file format (filename suffix
|
jpayne@68
|
64 ".xz", MIME type "application/x-xz"). It is intended that this
|
jpayne@68
|
65 this format replace the old .lzma format used by LZMA SDK and
|
jpayne@68
|
66 LZMA Utils.
|
jpayne@68
|
67
|
jpayne@68
|
68
|
jpayne@68
|
69 0.1. Notices and Acknowledgements
|
jpayne@68
|
70
|
jpayne@68
|
71 This file format was designed by Lasse Collin
|
jpayne@68
|
72 <lasse.collin@tukaani.org> and Igor Pavlov.
|
jpayne@68
|
73
|
jpayne@68
|
74 Special thanks for helping with this document goes to
|
jpayne@68
|
75 Ville Koskinen. Thanks for helping with this document goes to
|
jpayne@68
|
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
|
jpayne@68
|
77
|
jpayne@68
|
78 This document has been put into the public domain.
|
jpayne@68
|
79
|
jpayne@68
|
80
|
jpayne@68
|
81 0.2. Getting the Latest Version
|
jpayne@68
|
82
|
jpayne@68
|
83 The latest official version of this document can be downloaded
|
jpayne@68
|
84 from <https://tukaani.org/xz/xz-file-format.txt>.
|
jpayne@68
|
85
|
jpayne@68
|
86 Specific versions of this document have a filename
|
jpayne@68
|
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
|
jpayne@68
|
88 For example, the version 1.0.0 of this document is available
|
jpayne@68
|
89 at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>.
|
jpayne@68
|
90
|
jpayne@68
|
91
|
jpayne@68
|
92 0.3. Version History
|
jpayne@68
|
93
|
jpayne@68
|
94 Version Date Description
|
jpayne@68
|
95
|
jpayne@68
|
96 1.2.1 2024-04-08 The URLs of this specification and
|
jpayne@68
|
97 XZ Utils were changed back to the
|
jpayne@68
|
98 original ones in Sections 0.2 and 7.
|
jpayne@68
|
99
|
jpayne@68
|
100 1.2.0 2024-01-19 Added RISC-V filter and updated URLs in
|
jpayne@68
|
101 Sections 0.2 and 7. The URL of this
|
jpayne@68
|
102 specification was changed.
|
jpayne@68
|
103
|
jpayne@68
|
104 1.1.0 2022-12-11 Added ARM64 filter and clarified 32-bit
|
jpayne@68
|
105 ARM endianness in Section 5.3.2,
|
jpayne@68
|
106 language improvements in Section 5.4
|
jpayne@68
|
107
|
jpayne@68
|
108 1.0.4 2009-08-27 Language improvements in Sections 1.2,
|
jpayne@68
|
109 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
|
jpayne@68
|
110
|
jpayne@68
|
111 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4
|
jpayne@68
|
112
|
jpayne@68
|
113 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1
|
jpayne@68
|
114
|
jpayne@68
|
115 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
|
jpayne@68
|
116 clarifications to Sections 2, 2.2,
|
jpayne@68
|
117 3.3, 4.4, and 5.3.2
|
jpayne@68
|
118
|
jpayne@68
|
119 1.0.0 2009-01-14 The first official version
|
jpayne@68
|
120
|
jpayne@68
|
121
|
jpayne@68
|
122 1. Conventions
|
jpayne@68
|
123
|
jpayne@68
|
124 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
|
jpayne@68
|
125 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
|
jpayne@68
|
126 document are to be interpreted as described in [RFC-2119].
|
jpayne@68
|
127
|
jpayne@68
|
128 Indicating a warning means displaying a message, returning
|
jpayne@68
|
129 appropriate exit status, or doing something else to let the
|
jpayne@68
|
130 user know that something worth warning occurred. The operation
|
jpayne@68
|
131 SHOULD still finish if a warning is indicated.
|
jpayne@68
|
132
|
jpayne@68
|
133 Indicating an error means displaying a message, returning
|
jpayne@68
|
134 appropriate exit status, or doing something else to let the
|
jpayne@68
|
135 user know that something prevented successfully finishing the
|
jpayne@68
|
136 operation. The operation MUST be aborted once an error has
|
jpayne@68
|
137 been indicated.
|
jpayne@68
|
138
|
jpayne@68
|
139
|
jpayne@68
|
140 1.1. Byte and Its Representation
|
jpayne@68
|
141
|
jpayne@68
|
142 In this document, byte is always 8 bits.
|
jpayne@68
|
143
|
jpayne@68
|
144 A "null byte" has all bits unset. That is, the value of a null
|
jpayne@68
|
145 byte is 0x00.
|
jpayne@68
|
146
|
jpayne@68
|
147 To represent byte blocks, this document uses notation that
|
jpayne@68
|
148 is similar to the notation used in [RFC-1952]:
|
jpayne@68
|
149
|
jpayne@68
|
150 +-------+
|
jpayne@68
|
151 | Foo | One byte.
|
jpayne@68
|
152 +-------+
|
jpayne@68
|
153
|
jpayne@68
|
154 +---+---+
|
jpayne@68
|
155 | Foo | Two bytes; that is, some of the vertical bars
|
jpayne@68
|
156 +---+---+ can be missing.
|
jpayne@68
|
157
|
jpayne@68
|
158 +=======+
|
jpayne@68
|
159 | Foo | Zero or more bytes.
|
jpayne@68
|
160 +=======+
|
jpayne@68
|
161
|
jpayne@68
|
162 In this document, a boxed byte or a byte sequence declared
|
jpayne@68
|
163 using this notation is called "a field". The example field
|
jpayne@68
|
164 above would be called "the Foo field" or plain "Foo".
|
jpayne@68
|
165
|
jpayne@68
|
166 If there are many fields, they may be split to multiple lines.
|
jpayne@68
|
167 This is indicated with an arrow ("--->"):
|
jpayne@68
|
168
|
jpayne@68
|
169 +=====+
|
jpayne@68
|
170 | Foo |
|
jpayne@68
|
171 +=====+
|
jpayne@68
|
172
|
jpayne@68
|
173 +=====+
|
jpayne@68
|
174 ---> | Bar |
|
jpayne@68
|
175 +=====+
|
jpayne@68
|
176
|
jpayne@68
|
177 The above is equivalent to this:
|
jpayne@68
|
178
|
jpayne@68
|
179 +=====+=====+
|
jpayne@68
|
180 | Foo | Bar |
|
jpayne@68
|
181 +=====+=====+
|
jpayne@68
|
182
|
jpayne@68
|
183
|
jpayne@68
|
184 1.2. Multibyte Integers
|
jpayne@68
|
185
|
jpayne@68
|
186 Multibyte integers of static length, such as CRC values,
|
jpayne@68
|
187 are stored in little endian byte order (least significant
|
jpayne@68
|
188 byte first).
|
jpayne@68
|
189
|
jpayne@68
|
190 When smaller values are more likely than bigger values (for
|
jpayne@68
|
191 example file sizes), multibyte integers are encoded in a
|
jpayne@68
|
192 variable-length representation:
|
jpayne@68
|
193 - Numbers in the range [0, 127] are copied as is, and take
|
jpayne@68
|
194 one byte of space.
|
jpayne@68
|
195 - Bigger numbers will occupy two or more bytes. All but the
|
jpayne@68
|
196 last byte of the multibyte representation have the highest
|
jpayne@68
|
197 (eighth) bit set.
|
jpayne@68
|
198
|
jpayne@68
|
199 For now, the value of the variable-length integers is limited
|
jpayne@68
|
200 to 63 bits, which limits the encoded size of the integer to
|
jpayne@68
|
201 nine bytes. These limits may be increased in the future if
|
jpayne@68
|
202 needed.
|
jpayne@68
|
203
|
jpayne@68
|
204 The following C code illustrates encoding and decoding of
|
jpayne@68
|
205 variable-length integers. The functions return the number of
|
jpayne@68
|
206 bytes occupied by the integer (1-9), or zero on error.
|
jpayne@68
|
207
|
jpayne@68
|
208 #include <stddef.h>
|
jpayne@68
|
209 #include <inttypes.h>
|
jpayne@68
|
210
|
jpayne@68
|
211 size_t
|
jpayne@68
|
212 encode(uint8_t buf[static 9], uint64_t num)
|
jpayne@68
|
213 {
|
jpayne@68
|
214 if (num > UINT64_MAX / 2)
|
jpayne@68
|
215 return 0;
|
jpayne@68
|
216
|
jpayne@68
|
217 size_t i = 0;
|
jpayne@68
|
218
|
jpayne@68
|
219 while (num >= 0x80) {
|
jpayne@68
|
220 buf[i++] = (uint8_t)(num) | 0x80;
|
jpayne@68
|
221 num >>= 7;
|
jpayne@68
|
222 }
|
jpayne@68
|
223
|
jpayne@68
|
224 buf[i++] = (uint8_t)(num);
|
jpayne@68
|
225
|
jpayne@68
|
226 return i;
|
jpayne@68
|
227 }
|
jpayne@68
|
228
|
jpayne@68
|
229 size_t
|
jpayne@68
|
230 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
|
jpayne@68
|
231 {
|
jpayne@68
|
232 if (size_max == 0)
|
jpayne@68
|
233 return 0;
|
jpayne@68
|
234
|
jpayne@68
|
235 if (size_max > 9)
|
jpayne@68
|
236 size_max = 9;
|
jpayne@68
|
237
|
jpayne@68
|
238 *num = buf[0] & 0x7F;
|
jpayne@68
|
239 size_t i = 0;
|
jpayne@68
|
240
|
jpayne@68
|
241 while (buf[i++] & 0x80) {
|
jpayne@68
|
242 if (i >= size_max || buf[i] == 0x00)
|
jpayne@68
|
243 return 0;
|
jpayne@68
|
244
|
jpayne@68
|
245 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
|
jpayne@68
|
246 }
|
jpayne@68
|
247
|
jpayne@68
|
248 return i;
|
jpayne@68
|
249 }
|
jpayne@68
|
250
|
jpayne@68
|
251
|
jpayne@68
|
252 2. Overall Structure of .xz File
|
jpayne@68
|
253
|
jpayne@68
|
254 A standalone .xz files consist of one or more Streams which may
|
jpayne@68
|
255 have Stream Padding between or after them:
|
jpayne@68
|
256
|
jpayne@68
|
257 +========+================+========+================+
|
jpayne@68
|
258 | Stream | Stream Padding | Stream | Stream Padding | ...
|
jpayne@68
|
259 +========+================+========+================+
|
jpayne@68
|
260
|
jpayne@68
|
261 The sizes of Stream and Stream Padding are always multiples
|
jpayne@68
|
262 of four bytes, thus the size of every valid .xz file MUST be
|
jpayne@68
|
263 a multiple of four bytes.
|
jpayne@68
|
264
|
jpayne@68
|
265 While a typical file contains only one Stream and no Stream
|
jpayne@68
|
266 Padding, a decoder handling standalone .xz files SHOULD support
|
jpayne@68
|
267 files that have more than one Stream or Stream Padding.
|
jpayne@68
|
268
|
jpayne@68
|
269 In contrast to standalone .xz files, when the .xz file format
|
jpayne@68
|
270 is used as an internal part of some other file format or
|
jpayne@68
|
271 communication protocol, it usually is expected that the decoder
|
jpayne@68
|
272 stops after the first Stream, and doesn't look for Stream
|
jpayne@68
|
273 Padding or possibly other Streams.
|
jpayne@68
|
274
|
jpayne@68
|
275
|
jpayne@68
|
276 2.1. Stream
|
jpayne@68
|
277
|
jpayne@68
|
278 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
|
jpayne@68
|
279 | Stream Header | Block | Block | ... | Block |
|
jpayne@68
|
280 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
|
jpayne@68
|
281
|
jpayne@68
|
282 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
|
jpayne@68
|
283 ---> | Index | Stream Footer |
|
jpayne@68
|
284 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
|
jpayne@68
|
285
|
jpayne@68
|
286 All the above fields have a size that is a multiple of four. If
|
jpayne@68
|
287 Stream is used as an internal part of another file format, it
|
jpayne@68
|
288 is RECOMMENDED to make the Stream start at an offset that is
|
jpayne@68
|
289 a multiple of four bytes.
|
jpayne@68
|
290
|
jpayne@68
|
291 Stream Header, Index, and Stream Footer are always present in
|
jpayne@68
|
292 a Stream. The maximum size of the Index field is 16 GiB (2^34).
|
jpayne@68
|
293
|
jpayne@68
|
294 There are zero or more Blocks. The maximum number of Blocks is
|
jpayne@68
|
295 limited only by the maximum size of the Index field.
|
jpayne@68
|
296
|
jpayne@68
|
297 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
|
jpayne@68
|
298 The same limit applies to the total amount of uncompressed
|
jpayne@68
|
299 data stored in a Stream.
|
jpayne@68
|
300
|
jpayne@68
|
301 If an implementation supports handling .xz files with multiple
|
jpayne@68
|
302 concatenated Streams, it MAY apply the above limits to the file
|
jpayne@68
|
303 as a whole instead of limiting per Stream basis.
|
jpayne@68
|
304
|
jpayne@68
|
305
|
jpayne@68
|
306 2.1.1. Stream Header
|
jpayne@68
|
307
|
jpayne@68
|
308 +---+---+---+---+---+---+-------+------+--+--+--+--+
|
jpayne@68
|
309 | Header Magic Bytes | Stream Flags | CRC32 |
|
jpayne@68
|
310 +---+---+---+---+---+---+-------+------+--+--+--+--+
|
jpayne@68
|
311
|
jpayne@68
|
312
|
jpayne@68
|
313 2.1.1.1. Header Magic Bytes
|
jpayne@68
|
314
|
jpayne@68
|
315 The first six (6) bytes of the Stream are so called Header
|
jpayne@68
|
316 Magic Bytes. They can be used to identify the file type.
|
jpayne@68
|
317
|
jpayne@68
|
318 Using a C array and ASCII:
|
jpayne@68
|
319 const uint8_t HEADER_MAGIC[6]
|
jpayne@68
|
320 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
|
jpayne@68
|
321
|
jpayne@68
|
322 In plain hexadecimal:
|
jpayne@68
|
323 FD 37 7A 58 5A 00
|
jpayne@68
|
324
|
jpayne@68
|
325 Notes:
|
jpayne@68
|
326 - The first byte (0xFD) was chosen so that the files cannot
|
jpayne@68
|
327 be erroneously detected as being in .lzma format, in which
|
jpayne@68
|
328 the first byte is in the range [0x00, 0xE0].
|
jpayne@68
|
329 - The sixth byte (0x00) was chosen to prevent applications
|
jpayne@68
|
330 from misdetecting the file as a text file.
|
jpayne@68
|
331
|
jpayne@68
|
332 If the Header Magic Bytes don't match, the decoder MUST
|
jpayne@68
|
333 indicate an error.
|
jpayne@68
|
334
|
jpayne@68
|
335
|
jpayne@68
|
336 2.1.1.2. Stream Flags
|
jpayne@68
|
337
|
jpayne@68
|
338 The first byte of Stream Flags is always a null byte. In the
|
jpayne@68
|
339 future, this byte may be used to indicate a new Stream version
|
jpayne@68
|
340 or other Stream properties.
|
jpayne@68
|
341
|
jpayne@68
|
342 The second byte of Stream Flags is a bit field:
|
jpayne@68
|
343
|
jpayne@68
|
344 Bit(s) Mask Description
|
jpayne@68
|
345 0-3 0x0F Type of Check (see Section 3.4):
|
jpayne@68
|
346 ID Size Check name
|
jpayne@68
|
347 0x00 0 bytes None
|
jpayne@68
|
348 0x01 4 bytes CRC32
|
jpayne@68
|
349 0x02 4 bytes (Reserved)
|
jpayne@68
|
350 0x03 4 bytes (Reserved)
|
jpayne@68
|
351 0x04 8 bytes CRC64
|
jpayne@68
|
352 0x05 8 bytes (Reserved)
|
jpayne@68
|
353 0x06 8 bytes (Reserved)
|
jpayne@68
|
354 0x07 16 bytes (Reserved)
|
jpayne@68
|
355 0x08 16 bytes (Reserved)
|
jpayne@68
|
356 0x09 16 bytes (Reserved)
|
jpayne@68
|
357 0x0A 32 bytes SHA-256
|
jpayne@68
|
358 0x0B 32 bytes (Reserved)
|
jpayne@68
|
359 0x0C 32 bytes (Reserved)
|
jpayne@68
|
360 0x0D 64 bytes (Reserved)
|
jpayne@68
|
361 0x0E 64 bytes (Reserved)
|
jpayne@68
|
362 0x0F 64 bytes (Reserved)
|
jpayne@68
|
363 4-7 0xF0 Reserved for future use; MUST be zero for now.
|
jpayne@68
|
364
|
jpayne@68
|
365 Implementations SHOULD support at least the Check IDs 0x00
|
jpayne@68
|
366 (None) and 0x01 (CRC32). Supporting other Check IDs is
|
jpayne@68
|
367 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
|
jpayne@68
|
368 indicate a warning or error.
|
jpayne@68
|
369
|
jpayne@68
|
370 If any reserved bit is set, the decoder MUST indicate an error.
|
jpayne@68
|
371 It is possible that there is a new field present which the
|
jpayne@68
|
372 decoder is not aware of, and can thus parse the Stream Header
|
jpayne@68
|
373 incorrectly.
|
jpayne@68
|
374
|
jpayne@68
|
375
|
jpayne@68
|
376 2.1.1.3. CRC32
|
jpayne@68
|
377
|
jpayne@68
|
378 The CRC32 is calculated from the Stream Flags field. It is
|
jpayne@68
|
379 stored as an unsigned 32-bit little endian integer. If the
|
jpayne@68
|
380 calculated value does not match the stored one, the decoder
|
jpayne@68
|
381 MUST indicate an error.
|
jpayne@68
|
382
|
jpayne@68
|
383 The idea is that Stream Flags would always be two bytes, even
|
jpayne@68
|
384 if new features are needed. This way old decoders will be able
|
jpayne@68
|
385 to verify the CRC32 calculated from Stream Flags, and thus
|
jpayne@68
|
386 distinguish between corrupt files (CRC32 doesn't match) and
|
jpayne@68
|
387 files that the decoder doesn't support (CRC32 matches but
|
jpayne@68
|
388 Stream Flags has reserved bits set).
|
jpayne@68
|
389
|
jpayne@68
|
390
|
jpayne@68
|
391 2.1.2. Stream Footer
|
jpayne@68
|
392
|
jpayne@68
|
393 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
|
jpayne@68
|
394 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
|
jpayne@68
|
395 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
|
jpayne@68
|
396
|
jpayne@68
|
397
|
jpayne@68
|
398 2.1.2.1. CRC32
|
jpayne@68
|
399
|
jpayne@68
|
400 The CRC32 is calculated from the Backward Size and Stream Flags
|
jpayne@68
|
401 fields. It is stored as an unsigned 32-bit little endian
|
jpayne@68
|
402 integer. If the calculated value does not match the stored one,
|
jpayne@68
|
403 the decoder MUST indicate an error.
|
jpayne@68
|
404
|
jpayne@68
|
405 The reason to have the CRC32 field before the Backward Size and
|
jpayne@68
|
406 Stream Flags fields is to keep the four-byte fields aligned to
|
jpayne@68
|
407 a multiple of four bytes.
|
jpayne@68
|
408
|
jpayne@68
|
409
|
jpayne@68
|
410 2.1.2.2. Backward Size
|
jpayne@68
|
411
|
jpayne@68
|
412 Backward Size is stored as a 32-bit little endian integer,
|
jpayne@68
|
413 which indicates the size of the Index field as multiple of
|
jpayne@68
|
414 four bytes, minimum value being four bytes:
|
jpayne@68
|
415
|
jpayne@68
|
416 real_backward_size = (stored_backward_size + 1) * 4;
|
jpayne@68
|
417
|
jpayne@68
|
418 If the stored value does not match the real size of the Index
|
jpayne@68
|
419 field, the decoder MUST indicate an error.
|
jpayne@68
|
420
|
jpayne@68
|
421 Using a fixed-size integer to store Backward Size makes
|
jpayne@68
|
422 it slightly simpler to parse the Stream Footer when the
|
jpayne@68
|
423 application needs to parse the Stream backwards.
|
jpayne@68
|
424
|
jpayne@68
|
425
|
jpayne@68
|
426 2.1.2.3. Stream Flags
|
jpayne@68
|
427
|
jpayne@68
|
428 This is a copy of the Stream Flags field from the Stream
|
jpayne@68
|
429 Header. The information stored to Stream Flags is needed
|
jpayne@68
|
430 when parsing the Stream backwards. The decoder MUST compare
|
jpayne@68
|
431 the Stream Flags fields in both Stream Header and Stream
|
jpayne@68
|
432 Footer, and indicate an error if they are not identical.
|
jpayne@68
|
433
|
jpayne@68
|
434
|
jpayne@68
|
435 2.1.2.4. Footer Magic Bytes
|
jpayne@68
|
436
|
jpayne@68
|
437 As the last step of the decoding process, the decoder MUST
|
jpayne@68
|
438 verify the existence of Footer Magic Bytes. If they don't
|
jpayne@68
|
439 match, an error MUST be indicated.
|
jpayne@68
|
440
|
jpayne@68
|
441 Using a C array and ASCII:
|
jpayne@68
|
442 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
|
jpayne@68
|
443
|
jpayne@68
|
444 In hexadecimal:
|
jpayne@68
|
445 59 5A
|
jpayne@68
|
446
|
jpayne@68
|
447 The primary reason to have Footer Magic Bytes is to make
|
jpayne@68
|
448 it easier to detect incomplete files quickly, without
|
jpayne@68
|
449 uncompressing. If the file does not end with Footer Magic Bytes
|
jpayne@68
|
450 (excluding Stream Padding described in Section 2.2), it cannot
|
jpayne@68
|
451 be undamaged, unless someone has intentionally appended garbage
|
jpayne@68
|
452 after the end of the Stream.
|
jpayne@68
|
453
|
jpayne@68
|
454
|
jpayne@68
|
455 2.2. Stream Padding
|
jpayne@68
|
456
|
jpayne@68
|
457 Only the decoders that support decoding of concatenated Streams
|
jpayne@68
|
458 MUST support Stream Padding.
|
jpayne@68
|
459
|
jpayne@68
|
460 Stream Padding MUST contain only null bytes. To preserve the
|
jpayne@68
|
461 four-byte alignment of consecutive Streams, the size of Stream
|
jpayne@68
|
462 Padding MUST be a multiple of four bytes. Empty Stream Padding
|
jpayne@68
|
463 is allowed. If these requirements are not met, the decoder MUST
|
jpayne@68
|
464 indicate an error.
|
jpayne@68
|
465
|
jpayne@68
|
466 Note that non-empty Stream Padding is allowed at the end of the
|
jpayne@68
|
467 file; there doesn't need to be a new Stream after non-empty
|
jpayne@68
|
468 Stream Padding. This can be convenient in certain situations
|
jpayne@68
|
469 [GNU-tar].
|
jpayne@68
|
470
|
jpayne@68
|
471 The possibility of Stream Padding MUST be taken into account
|
jpayne@68
|
472 when designing an application that parses Streams backwards,
|
jpayne@68
|
473 and the application supports concatenated Streams.
|
jpayne@68
|
474
|
jpayne@68
|
475
|
jpayne@68
|
476 3. Block
|
jpayne@68
|
477
|
jpayne@68
|
478 +==============+=================+===============+=======+
|
jpayne@68
|
479 | Block Header | Compressed Data | Block Padding | Check |
|
jpayne@68
|
480 +==============+=================+===============+=======+
|
jpayne@68
|
481
|
jpayne@68
|
482
|
jpayne@68
|
483 3.1. Block Header
|
jpayne@68
|
484
|
jpayne@68
|
485 +-------------------+-------------+=================+
|
jpayne@68
|
486 | Block Header Size | Block Flags | Compressed Size |
|
jpayne@68
|
487 +-------------------+-------------+=================+
|
jpayne@68
|
488
|
jpayne@68
|
489 +===================+======================+
|
jpayne@68
|
490 ---> | Uncompressed Size | List of Filter Flags |
|
jpayne@68
|
491 +===================+======================+
|
jpayne@68
|
492
|
jpayne@68
|
493 +================+--+--+--+--+
|
jpayne@68
|
494 ---> | Header Padding | CRC32 |
|
jpayne@68
|
495 +================+--+--+--+--+
|
jpayne@68
|
496
|
jpayne@68
|
497
|
jpayne@68
|
498 3.1.1. Block Header Size
|
jpayne@68
|
499
|
jpayne@68
|
500 This field overlaps with the Index Indicator field (see
|
jpayne@68
|
501 Section 4.1).
|
jpayne@68
|
502
|
jpayne@68
|
503 This field contains the size of the Block Header field,
|
jpayne@68
|
504 including the Block Header Size field itself. Valid values are
|
jpayne@68
|
505 in the range [0x01, 0xFF], which indicate the size of the Block
|
jpayne@68
|
506 Header as multiples of four bytes, minimum size being eight
|
jpayne@68
|
507 bytes:
|
jpayne@68
|
508
|
jpayne@68
|
509 real_header_size = (encoded_header_size + 1) * 4;
|
jpayne@68
|
510
|
jpayne@68
|
511 If a Block Header bigger than 1024 bytes is needed in the
|
jpayne@68
|
512 future, a new field can be added between the Block Header and
|
jpayne@68
|
513 Compressed Data fields. The presence of this new field would
|
jpayne@68
|
514 be indicated in the Block Header field.
|
jpayne@68
|
515
|
jpayne@68
|
516
|
jpayne@68
|
517 3.1.2. Block Flags
|
jpayne@68
|
518
|
jpayne@68
|
519 The Block Flags field is a bit field:
|
jpayne@68
|
520
|
jpayne@68
|
521 Bit(s) Mask Description
|
jpayne@68
|
522 0-1 0x03 Number of filters (1-4)
|
jpayne@68
|
523 2-5 0x3C Reserved for future use; MUST be zero for now.
|
jpayne@68
|
524 6 0x40 The Compressed Size field is present.
|
jpayne@68
|
525 7 0x80 The Uncompressed Size field is present.
|
jpayne@68
|
526
|
jpayne@68
|
527 If any reserved bit is set, the decoder MUST indicate an error.
|
jpayne@68
|
528 It is possible that there is a new field present which the
|
jpayne@68
|
529 decoder is not aware of, and can thus parse the Block Header
|
jpayne@68
|
530 incorrectly.
|
jpayne@68
|
531
|
jpayne@68
|
532
|
jpayne@68
|
533 3.1.3. Compressed Size
|
jpayne@68
|
534
|
jpayne@68
|
535 This field is present only if the appropriate bit is set in
|
jpayne@68
|
536 the Block Flags field (see Section 3.1.2).
|
jpayne@68
|
537
|
jpayne@68
|
538 The Compressed Size field contains the size of the Compressed
|
jpayne@68
|
539 Data field, which MUST be non-zero. Compressed Size is stored
|
jpayne@68
|
540 using the encoding described in Section 1.2. If the Compressed
|
jpayne@68
|
541 Size doesn't match the size of the Compressed Data field, the
|
jpayne@68
|
542 decoder MUST indicate an error.
|
jpayne@68
|
543
|
jpayne@68
|
544
|
jpayne@68
|
545 3.1.4. Uncompressed Size
|
jpayne@68
|
546
|
jpayne@68
|
547 This field is present only if the appropriate bit is set in
|
jpayne@68
|
548 the Block Flags field (see Section 3.1.2).
|
jpayne@68
|
549
|
jpayne@68
|
550 The Uncompressed Size field contains the size of the Block
|
jpayne@68
|
551 after uncompressing. Uncompressed Size is stored using the
|
jpayne@68
|
552 encoding described in Section 1.2. If the Uncompressed Size
|
jpayne@68
|
553 does not match the real uncompressed size, the decoder MUST
|
jpayne@68
|
554 indicate an error.
|
jpayne@68
|
555
|
jpayne@68
|
556 Storing the Compressed Size and Uncompressed Size fields serves
|
jpayne@68
|
557 several purposes:
|
jpayne@68
|
558 - The decoder knows how much memory it needs to allocate
|
jpayne@68
|
559 for a temporary buffer in multithreaded mode.
|
jpayne@68
|
560 - Simple error detection: wrong size indicates a broken file.
|
jpayne@68
|
561 - Seeking forwards to a specific location in streamed mode.
|
jpayne@68
|
562
|
jpayne@68
|
563 It should be noted that the only reliable way to determine
|
jpayne@68
|
564 the real uncompressed size is to uncompress the Block,
|
jpayne@68
|
565 because the Block Header and Index fields may contain
|
jpayne@68
|
566 (intentionally or unintentionally) invalid information.
|
jpayne@68
|
567
|
jpayne@68
|
568
|
jpayne@68
|
569 3.1.5. List of Filter Flags
|
jpayne@68
|
570
|
jpayne@68
|
571 +================+================+ +================+
|
jpayne@68
|
572 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
|
jpayne@68
|
573 +================+================+ +================+
|
jpayne@68
|
574
|
jpayne@68
|
575 The number of Filter Flags fields is stored in the Block Flags
|
jpayne@68
|
576 field (see Section 3.1.2).
|
jpayne@68
|
577
|
jpayne@68
|
578 The format of each Filter Flags field is as follows:
|
jpayne@68
|
579
|
jpayne@68
|
580 +===========+====================+===================+
|
jpayne@68
|
581 | Filter ID | Size of Properties | Filter Properties |
|
jpayne@68
|
582 +===========+====================+===================+
|
jpayne@68
|
583
|
jpayne@68
|
584 Both Filter ID and Size of Properties are stored using the
|
jpayne@68
|
585 encoding described in Section 1.2. Size of Properties indicates
|
jpayne@68
|
586 the size of the Filter Properties field as bytes. The list of
|
jpayne@68
|
587 officially defined Filter IDs and the formats of their Filter
|
jpayne@68
|
588 Properties are described in Section 5.3.
|
jpayne@68
|
589
|
jpayne@68
|
590 Filter IDs greater than or equal to 0x4000_0000_0000_0000
|
jpayne@68
|
591 (2^62) are reserved for implementation-specific internal use.
|
jpayne@68
|
592 These Filter IDs MUST never be used in List of Filter Flags.
|
jpayne@68
|
593
|
jpayne@68
|
594
|
jpayne@68
|
595 3.1.6. Header Padding
|
jpayne@68
|
596
|
jpayne@68
|
597 This field contains as many null byte as it is needed to make
|
jpayne@68
|
598 the Block Header have the size specified in Block Header Size.
|
jpayne@68
|
599 If any of the bytes are not null bytes, the decoder MUST
|
jpayne@68
|
600 indicate an error. It is possible that there is a new field
|
jpayne@68
|
601 present which the decoder is not aware of, and can thus parse
|
jpayne@68
|
602 the Block Header incorrectly.
|
jpayne@68
|
603
|
jpayne@68
|
604
|
jpayne@68
|
605 3.1.7. CRC32
|
jpayne@68
|
606
|
jpayne@68
|
607 The CRC32 is calculated over everything in the Block Header
|
jpayne@68
|
608 field except the CRC32 field itself. It is stored as an
|
jpayne@68
|
609 unsigned 32-bit little endian integer. If the calculated
|
jpayne@68
|
610 value does not match the stored one, the decoder MUST indicate
|
jpayne@68
|
611 an error.
|
jpayne@68
|
612
|
jpayne@68
|
613 By verifying the CRC32 of the Block Header before parsing the
|
jpayne@68
|
614 actual contents allows the decoder to distinguish between
|
jpayne@68
|
615 corrupt and unsupported files.
|
jpayne@68
|
616
|
jpayne@68
|
617
|
jpayne@68
|
618 3.2. Compressed Data
|
jpayne@68
|
619
|
jpayne@68
|
620 The format of Compressed Data depends on Block Flags and List
|
jpayne@68
|
621 of Filter Flags. Excluding the descriptions of the simplest
|
jpayne@68
|
622 filters in Section 5.3, the format of the filter-specific
|
jpayne@68
|
623 encoded data is out of scope of this document.
|
jpayne@68
|
624
|
jpayne@68
|
625
|
jpayne@68
|
626 3.3. Block Padding
|
jpayne@68
|
627
|
jpayne@68
|
628 Block Padding MUST contain 0-3 null bytes to make the size of
|
jpayne@68
|
629 the Block a multiple of four bytes. This can be needed when
|
jpayne@68
|
630 the size of Compressed Data is not a multiple of four. If any
|
jpayne@68
|
631 of the bytes in Block Padding are not null bytes, the decoder
|
jpayne@68
|
632 MUST indicate an error.
|
jpayne@68
|
633
|
jpayne@68
|
634
|
jpayne@68
|
635 3.4. Check
|
jpayne@68
|
636
|
jpayne@68
|
637 The type and size of the Check field depends on which bits
|
jpayne@68
|
638 are set in the Stream Flags field (see Section 2.1.1.2).
|
jpayne@68
|
639
|
jpayne@68
|
640 The Check, when used, is calculated from the original
|
jpayne@68
|
641 uncompressed data. If the calculated Check does not match the
|
jpayne@68
|
642 stored one, the decoder MUST indicate an error. If the selected
|
jpayne@68
|
643 type of Check is not supported by the decoder, it SHOULD
|
jpayne@68
|
644 indicate a warning or error.
|
jpayne@68
|
645
|
jpayne@68
|
646
|
jpayne@68
|
647 4. Index
|
jpayne@68
|
648
|
jpayne@68
|
649 +-----------------+===================+
|
jpayne@68
|
650 | Index Indicator | Number of Records |
|
jpayne@68
|
651 +-----------------+===================+
|
jpayne@68
|
652
|
jpayne@68
|
653 +=================+===============+-+-+-+-+
|
jpayne@68
|
654 ---> | List of Records | Index Padding | CRC32 |
|
jpayne@68
|
655 +=================+===============+-+-+-+-+
|
jpayne@68
|
656
|
jpayne@68
|
657 Index serves several purposes. Using it, one can
|
jpayne@68
|
658 - verify that all Blocks in a Stream have been processed;
|
jpayne@68
|
659 - find out the uncompressed size of a Stream; and
|
jpayne@68
|
660 - quickly access the beginning of any Block (random access).
|
jpayne@68
|
661
|
jpayne@68
|
662
|
jpayne@68
|
663 4.1. Index Indicator
|
jpayne@68
|
664
|
jpayne@68
|
665 This field overlaps with the Block Header Size field (see
|
jpayne@68
|
666 Section 3.1.1). The value of Index Indicator is always 0x00.
|
jpayne@68
|
667
|
jpayne@68
|
668
|
jpayne@68
|
669 4.2. Number of Records
|
jpayne@68
|
670
|
jpayne@68
|
671 This field indicates how many Records there are in the List
|
jpayne@68
|
672 of Records field, and thus how many Blocks there are in the
|
jpayne@68
|
673 Stream. The value is stored using the encoding described in
|
jpayne@68
|
674 Section 1.2. If the decoder has decoded all the Blocks of the
|
jpayne@68
|
675 Stream, and then notices that the Number of Records doesn't
|
jpayne@68
|
676 match the real number of Blocks, the decoder MUST indicate an
|
jpayne@68
|
677 error.
|
jpayne@68
|
678
|
jpayne@68
|
679
|
jpayne@68
|
680 4.3. List of Records
|
jpayne@68
|
681
|
jpayne@68
|
682 List of Records consists of as many Records as indicated by the
|
jpayne@68
|
683 Number of Records field:
|
jpayne@68
|
684
|
jpayne@68
|
685 +========+========+
|
jpayne@68
|
686 | Record | Record | ...
|
jpayne@68
|
687 +========+========+
|
jpayne@68
|
688
|
jpayne@68
|
689 Each Record contains information about one Block:
|
jpayne@68
|
690
|
jpayne@68
|
691 +===============+===================+
|
jpayne@68
|
692 | Unpadded Size | Uncompressed Size |
|
jpayne@68
|
693 +===============+===================+
|
jpayne@68
|
694
|
jpayne@68
|
695 If the decoder has decoded all the Blocks of the Stream, it
|
jpayne@68
|
696 MUST verify that the contents of the Records match the real
|
jpayne@68
|
697 Unpadded Size and Uncompressed Size of the respective Blocks.
|
jpayne@68
|
698
|
jpayne@68
|
699 Implementation hint: It is possible to verify the Index with
|
jpayne@68
|
700 constant memory usage by calculating for example SHA-256 of
|
jpayne@68
|
701 both the real size values and the List of Records, then
|
jpayne@68
|
702 comparing the hash values. Implementing this using
|
jpayne@68
|
703 non-cryptographic hash like CRC32 SHOULD be avoided unless
|
jpayne@68
|
704 small code size is important.
|
jpayne@68
|
705
|
jpayne@68
|
706 If the decoder supports random-access reading, it MUST verify
|
jpayne@68
|
707 that Unpadded Size and Uncompressed Size of every completely
|
jpayne@68
|
708 decoded Block match the sizes stored in the Index. If only
|
jpayne@68
|
709 partial Block is decoded, the decoder MUST verify that the
|
jpayne@68
|
710 processed sizes don't exceed the sizes stored in the Index.
|
jpayne@68
|
711
|
jpayne@68
|
712
|
jpayne@68
|
713 4.3.1. Unpadded Size
|
jpayne@68
|
714
|
jpayne@68
|
715 This field indicates the size of the Block excluding the Block
|
jpayne@68
|
716 Padding field. That is, Unpadded Size is the size of the Block
|
jpayne@68
|
717 Header, Compressed Data, and Check fields. Unpadded Size is
|
jpayne@68
|
718 stored using the encoding described in Section 1.2. The value
|
jpayne@68
|
719 MUST never be zero; with the current structure of Blocks, the
|
jpayne@68
|
720 actual minimum value for Unpadded Size is five.
|
jpayne@68
|
721
|
jpayne@68
|
722 Implementation note: Because the size of the Block Padding
|
jpayne@68
|
723 field is not included in Unpadded Size, calculating the total
|
jpayne@68
|
724 size of a Stream or doing random-access reading requires
|
jpayne@68
|
725 calculating the actual size of the Blocks by rounding Unpadded
|
jpayne@68
|
726 Sizes up to the next multiple of four.
|
jpayne@68
|
727
|
jpayne@68
|
728 The reason to exclude Block Padding from Unpadded Size is to
|
jpayne@68
|
729 ease making a raw copy of Compressed Data without Block
|
jpayne@68
|
730 Padding. This can be useful, for example, if someone wants
|
jpayne@68
|
731 to convert Streams to some other file format quickly.
|
jpayne@68
|
732
|
jpayne@68
|
733
|
jpayne@68
|
734 4.3.2. Uncompressed Size
|
jpayne@68
|
735
|
jpayne@68
|
736 This field indicates the Uncompressed Size of the respective
|
jpayne@68
|
737 Block as bytes. The value is stored using the encoding
|
jpayne@68
|
738 described in Section 1.2.
|
jpayne@68
|
739
|
jpayne@68
|
740
|
jpayne@68
|
741 4.4. Index Padding
|
jpayne@68
|
742
|
jpayne@68
|
743 This field MUST contain 0-3 null bytes to pad the Index to
|
jpayne@68
|
744 a multiple of four bytes. If any of the bytes are not null
|
jpayne@68
|
745 bytes, the decoder MUST indicate an error.
|
jpayne@68
|
746
|
jpayne@68
|
747
|
jpayne@68
|
748 4.5. CRC32
|
jpayne@68
|
749
|
jpayne@68
|
750 The CRC32 is calculated over everything in the Index field
|
jpayne@68
|
751 except the CRC32 field itself. The CRC32 is stored as an
|
jpayne@68
|
752 unsigned 32-bit little endian integer. If the calculated
|
jpayne@68
|
753 value does not match the stored one, the decoder MUST indicate
|
jpayne@68
|
754 an error.
|
jpayne@68
|
755
|
jpayne@68
|
756
|
jpayne@68
|
757 5. Filter Chains
|
jpayne@68
|
758
|
jpayne@68
|
759 The Block Flags field defines how many filters are used. When
|
jpayne@68
|
760 more than one filter is used, the filters are chained; that is,
|
jpayne@68
|
761 the output of one filter is the input of another filter. The
|
jpayne@68
|
762 following figure illustrates the direction of data flow.
|
jpayne@68
|
763
|
jpayne@68
|
764 v Uncompressed Data ^
|
jpayne@68
|
765 | Filter 0 |
|
jpayne@68
|
766 Encoder | Filter 1 | Decoder
|
jpayne@68
|
767 | Filter n |
|
jpayne@68
|
768 v Compressed Data ^
|
jpayne@68
|
769
|
jpayne@68
|
770
|
jpayne@68
|
771 5.1. Alignment
|
jpayne@68
|
772
|
jpayne@68
|
773 Alignment of uncompressed input data is usually the job of
|
jpayne@68
|
774 the application producing the data. For example, to get the
|
jpayne@68
|
775 best results, an archiver tool should make sure that all
|
jpayne@68
|
776 PowerPC executable files in the archive stream start at
|
jpayne@68
|
777 offsets that are multiples of four bytes.
|
jpayne@68
|
778
|
jpayne@68
|
779 Some filters, for example LZMA2, can be configured to take
|
jpayne@68
|
780 advantage of specified alignment of input data. Note that
|
jpayne@68
|
781 taking advantage of aligned input can be beneficial also when
|
jpayne@68
|
782 a filter is not the first filter in the chain. For example,
|
jpayne@68
|
783 if you compress PowerPC executables, you may want to use the
|
jpayne@68
|
784 PowerPC filter and chain that with the LZMA2 filter. Because
|
jpayne@68
|
785 not only the input but also the output alignment of the PowerPC
|
jpayne@68
|
786 filter is four bytes, it is now beneficial to set LZMA2
|
jpayne@68
|
787 settings so that the LZMA2 encoder can take advantage of its
|
jpayne@68
|
788 four-byte-aligned input data.
|
jpayne@68
|
789
|
jpayne@68
|
790 The output of the last filter in the chain is stored to the
|
jpayne@68
|
791 Compressed Data field, which is is guaranteed to be aligned
|
jpayne@68
|
792 to a multiple of four bytes relative to the beginning of the
|
jpayne@68
|
793 Stream. This can increase
|
jpayne@68
|
794 - speed, if the filtered data is handled multiple bytes at
|
jpayne@68
|
795 a time by the filter-specific encoder and decoder,
|
jpayne@68
|
796 because accessing aligned data in computer memory is
|
jpayne@68
|
797 usually faster; and
|
jpayne@68
|
798 - compression ratio, if the output data is later compressed
|
jpayne@68
|
799 with an external compression tool.
|
jpayne@68
|
800
|
jpayne@68
|
801
|
jpayne@68
|
802 5.2. Security
|
jpayne@68
|
803
|
jpayne@68
|
804 If filters would be allowed to be chained freely, it would be
|
jpayne@68
|
805 possible to create malicious files, that would be very slow to
|
jpayne@68
|
806 decode. Such files could be used to create denial of service
|
jpayne@68
|
807 attacks.
|
jpayne@68
|
808
|
jpayne@68
|
809 Slow files could occur when multiple filters are chained:
|
jpayne@68
|
810
|
jpayne@68
|
811 v Compressed input data
|
jpayne@68
|
812 | Filter 1 decoder (last filter)
|
jpayne@68
|
813 | Filter 0 decoder (non-last filter)
|
jpayne@68
|
814 v Uncompressed output data
|
jpayne@68
|
815
|
jpayne@68
|
816 The decoder of the last filter in the chain produces a lot of
|
jpayne@68
|
817 output from little input. Another filter in the chain takes the
|
jpayne@68
|
818 output of the last filter, and produces very little output
|
jpayne@68
|
819 while consuming a lot of input. As a result, a lot of data is
|
jpayne@68
|
820 moved inside the filter chain, but the filter chain as a whole
|
jpayne@68
|
821 gets very little work done.
|
jpayne@68
|
822
|
jpayne@68
|
823 To prevent this kind of slow files, there are restrictions on
|
jpayne@68
|
824 how the filters can be chained. These restrictions MUST be
|
jpayne@68
|
825 taken into account when designing new filters.
|
jpayne@68
|
826
|
jpayne@68
|
827 The maximum number of filters in the chain has been limited to
|
jpayne@68
|
828 four, thus there can be at maximum of three non-last filters.
|
jpayne@68
|
829 Of these three non-last filters, only two are allowed to change
|
jpayne@68
|
830 the size of the data.
|
jpayne@68
|
831
|
jpayne@68
|
832 The non-last filters, that change the size of the data, MUST
|
jpayne@68
|
833 have a limit how much the decoder can compress the data: the
|
jpayne@68
|
834 decoder SHOULD produce at least n bytes of output when the
|
jpayne@68
|
835 filter is given 2n bytes of input. This limit is not
|
jpayne@68
|
836 absolute, but significant deviations MUST be avoided.
|
jpayne@68
|
837
|
jpayne@68
|
838 The above limitations guarantee that if the last filter in the
|
jpayne@68
|
839 chain produces 4n bytes of output, the chain as a whole will
|
jpayne@68
|
840 produce at least n bytes of output.
|
jpayne@68
|
841
|
jpayne@68
|
842
|
jpayne@68
|
843 5.3. Filters
|
jpayne@68
|
844
|
jpayne@68
|
845 5.3.1. LZMA2
|
jpayne@68
|
846
|
jpayne@68
|
847 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
|
jpayne@68
|
848 compression algorithm with high compression ratio and fast
|
jpayne@68
|
849 decompression. LZMA is based on LZ77 and range coding
|
jpayne@68
|
850 algorithms.
|
jpayne@68
|
851
|
jpayne@68
|
852 LZMA2 is an extension on top of the original LZMA. LZMA2 uses
|
jpayne@68
|
853 LZMA internally, but adds support for flushing the encoder,
|
jpayne@68
|
854 uncompressed chunks, eases stateful decoder implementations,
|
jpayne@68
|
855 and improves support for multithreading. Thus, the plain LZMA
|
jpayne@68
|
856 will not be supported in this file format.
|
jpayne@68
|
857
|
jpayne@68
|
858 Filter ID: 0x21
|
jpayne@68
|
859 Size of Filter Properties: 1 byte
|
jpayne@68
|
860 Changes size of data: Yes
|
jpayne@68
|
861 Allow as a non-last filter: No
|
jpayne@68
|
862 Allow as the last filter: Yes
|
jpayne@68
|
863
|
jpayne@68
|
864 Preferred alignment:
|
jpayne@68
|
865 Input data: Adjustable to 1/2/4/8/16 byte(s)
|
jpayne@68
|
866 Output data: 1 byte
|
jpayne@68
|
867
|
jpayne@68
|
868 The format of the one-byte Filter Properties field is as
|
jpayne@68
|
869 follows:
|
jpayne@68
|
870
|
jpayne@68
|
871 Bits Mask Description
|
jpayne@68
|
872 0-5 0x3F Dictionary Size
|
jpayne@68
|
873 6-7 0xC0 Reserved for future use; MUST be zero for now.
|
jpayne@68
|
874
|
jpayne@68
|
875 Dictionary Size is encoded with one-bit mantissa and five-bit
|
jpayne@68
|
876 exponent. The smallest dictionary size is 4 KiB and the biggest
|
jpayne@68
|
877 is 4 GiB.
|
jpayne@68
|
878
|
jpayne@68
|
879 Raw value Mantissa Exponent Dictionary size
|
jpayne@68
|
880 0 2 11 4 KiB
|
jpayne@68
|
881 1 3 11 6 KiB
|
jpayne@68
|
882 2 2 12 8 KiB
|
jpayne@68
|
883 3 3 12 12 KiB
|
jpayne@68
|
884 4 2 13 16 KiB
|
jpayne@68
|
885 5 3 13 24 KiB
|
jpayne@68
|
886 6 2 14 32 KiB
|
jpayne@68
|
887 ... ... ... ...
|
jpayne@68
|
888 35 3 27 768 MiB
|
jpayne@68
|
889 36 2 28 1024 MiB
|
jpayne@68
|
890 37 3 29 1536 MiB
|
jpayne@68
|
891 38 2 30 2048 MiB
|
jpayne@68
|
892 39 3 30 3072 MiB
|
jpayne@68
|
893 40 2 31 4096 MiB - 1 B
|
jpayne@68
|
894
|
jpayne@68
|
895 Instead of having a table in the decoder, the dictionary size
|
jpayne@68
|
896 can be decoded using the following C code:
|
jpayne@68
|
897
|
jpayne@68
|
898 const uint8_t bits = get_dictionary_flags() & 0x3F;
|
jpayne@68
|
899 if (bits > 40)
|
jpayne@68
|
900 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
|
jpayne@68
|
901
|
jpayne@68
|
902 uint32_t dictionary_size;
|
jpayne@68
|
903 if (bits == 40) {
|
jpayne@68
|
904 dictionary_size = UINT32_MAX;
|
jpayne@68
|
905 } else {
|
jpayne@68
|
906 dictionary_size = 2 | (bits & 1);
|
jpayne@68
|
907 dictionary_size <<= bits / 2 + 11;
|
jpayne@68
|
908 }
|
jpayne@68
|
909
|
jpayne@68
|
910
|
jpayne@68
|
911 5.3.2. Branch/Call/Jump Filters for Executables
|
jpayne@68
|
912
|
jpayne@68
|
913 These filters convert relative branch, call, and jump
|
jpayne@68
|
914 instructions to their absolute counterparts in executable
|
jpayne@68
|
915 files. This conversion increases redundancy and thus
|
jpayne@68
|
916 compression ratio.
|
jpayne@68
|
917
|
jpayne@68
|
918 Size of Filter Properties: 0 or 4 bytes
|
jpayne@68
|
919 Changes size of data: No
|
jpayne@68
|
920 Allow as a non-last filter: Yes
|
jpayne@68
|
921 Allow as the last filter: No
|
jpayne@68
|
922
|
jpayne@68
|
923 Below is the list of filters in this category. The alignment
|
jpayne@68
|
924 is the same for both input and output data.
|
jpayne@68
|
925
|
jpayne@68
|
926 Filter ID Alignment Description
|
jpayne@68
|
927 0x04 1 byte x86 filter (BCJ)
|
jpayne@68
|
928 0x05 4 bytes PowerPC (big endian) filter
|
jpayne@68
|
929 0x06 16 bytes IA64 filter
|
jpayne@68
|
930 0x07 4 bytes ARM filter [1]
|
jpayne@68
|
931 0x08 2 bytes ARM Thumb filter [1]
|
jpayne@68
|
932 0x09 4 bytes SPARC filter
|
jpayne@68
|
933 0x0A 4 bytes ARM64 filter [2]
|
jpayne@68
|
934 0x0B 2 bytes RISC-V filter
|
jpayne@68
|
935
|
jpayne@68
|
936 [1] These are for little endian instruction encoding.
|
jpayne@68
|
937 This must not be confused with data endianness.
|
jpayne@68
|
938 A processor configured for big endian data access
|
jpayne@68
|
939 may still use little endian instruction encoding.
|
jpayne@68
|
940 The filters don't care about the data endianness.
|
jpayne@68
|
941
|
jpayne@68
|
942 [2] 4096-byte alignment gives the best results
|
jpayne@68
|
943 because the address in the ADRP instruction
|
jpayne@68
|
944 is a multiple of 4096 bytes.
|
jpayne@68
|
945
|
jpayne@68
|
946 If the size of Filter Properties is four bytes, the Filter
|
jpayne@68
|
947 Properties field contains the start offset used for address
|
jpayne@68
|
948 conversions. It is stored as an unsigned 32-bit little endian
|
jpayne@68
|
949 integer. The start offset MUST be a multiple of the alignment
|
jpayne@68
|
950 of the filter as listed in the table above; if it isn't, the
|
jpayne@68
|
951 decoder MUST indicate an error. If the size of Filter
|
jpayne@68
|
952 Properties is zero, the start offset is zero.
|
jpayne@68
|
953
|
jpayne@68
|
954 Setting the start offset may be useful if an executable has
|
jpayne@68
|
955 multiple sections, and there are many cross-section calls.
|
jpayne@68
|
956 Taking advantage of this feature usually requires usage of
|
jpayne@68
|
957 the Subblock filter, whose design is not complete yet.
|
jpayne@68
|
958
|
jpayne@68
|
959
|
jpayne@68
|
960 5.3.3. Delta
|
jpayne@68
|
961
|
jpayne@68
|
962 The Delta filter may increase compression ratio when the value
|
jpayne@68
|
963 of the next byte correlates with the value of an earlier byte
|
jpayne@68
|
964 at specified distance.
|
jpayne@68
|
965
|
jpayne@68
|
966 Filter ID: 0x03
|
jpayne@68
|
967 Size of Filter Properties: 1 byte
|
jpayne@68
|
968 Changes size of data: No
|
jpayne@68
|
969 Allow as a non-last filter: Yes
|
jpayne@68
|
970 Allow as the last filter: No
|
jpayne@68
|
971
|
jpayne@68
|
972 Preferred alignment:
|
jpayne@68
|
973 Input data: 1 byte
|
jpayne@68
|
974 Output data: Same as the original input data
|
jpayne@68
|
975
|
jpayne@68
|
976 The Properties byte indicates the delta distance, which can be
|
jpayne@68
|
977 1-256 bytes backwards from the current byte: 0x00 indicates
|
jpayne@68
|
978 distance of 1 byte and 0xFF distance of 256 bytes.
|
jpayne@68
|
979
|
jpayne@68
|
980
|
jpayne@68
|
981 5.3.3.1. Format of the Encoded Output
|
jpayne@68
|
982
|
jpayne@68
|
983 The code below illustrates both encoding and decoding with
|
jpayne@68
|
984 the Delta filter.
|
jpayne@68
|
985
|
jpayne@68
|
986 // Distance is in the range [1, 256].
|
jpayne@68
|
987 const unsigned int distance = get_properties_byte() + 1;
|
jpayne@68
|
988 uint8_t pos = 0;
|
jpayne@68
|
989 uint8_t delta[256];
|
jpayne@68
|
990
|
jpayne@68
|
991 memset(delta, 0, sizeof(delta));
|
jpayne@68
|
992
|
jpayne@68
|
993 while (1) {
|
jpayne@68
|
994 const int byte = read_byte();
|
jpayne@68
|
995 if (byte == EOF)
|
jpayne@68
|
996 break;
|
jpayne@68
|
997
|
jpayne@68
|
998 uint8_t tmp = delta[(uint8_t)(distance + pos)];
|
jpayne@68
|
999 if (is_encoder) {
|
jpayne@68
|
1000 tmp = (uint8_t)(byte) - tmp;
|
jpayne@68
|
1001 delta[pos] = (uint8_t)(byte);
|
jpayne@68
|
1002 } else {
|
jpayne@68
|
1003 tmp = (uint8_t)(byte) + tmp;
|
jpayne@68
|
1004 delta[pos] = tmp;
|
jpayne@68
|
1005 }
|
jpayne@68
|
1006
|
jpayne@68
|
1007 write_byte(tmp);
|
jpayne@68
|
1008 --pos;
|
jpayne@68
|
1009 }
|
jpayne@68
|
1010
|
jpayne@68
|
1011
|
jpayne@68
|
1012 5.4. Custom Filter IDs
|
jpayne@68
|
1013
|
jpayne@68
|
1014 If a developer wants to use custom Filter IDs, there are two
|
jpayne@68
|
1015 choices. The first choice is to contact Lasse Collin and ask
|
jpayne@68
|
1016 him to allocate a range of IDs for the developer.
|
jpayne@68
|
1017
|
jpayne@68
|
1018 The second choice is to generate a 40-bit random integer
|
jpayne@68
|
1019 which the developer can use as a personal Developer ID.
|
jpayne@68
|
1020 To minimize the risk of collisions, Developer ID has to be
|
jpayne@68
|
1021 a randomly generated integer, not manually selected "hex word".
|
jpayne@68
|
1022 The following command, which works on many free operating
|
jpayne@68
|
1023 systems, can be used to generate Developer ID:
|
jpayne@68
|
1024
|
jpayne@68
|
1025 dd if=/dev/urandom bs=5 count=1 | hexdump
|
jpayne@68
|
1026
|
jpayne@68
|
1027 The developer can then use the Developer ID to create unique
|
jpayne@68
|
1028 (well, hopefully unique) Filter IDs.
|
jpayne@68
|
1029
|
jpayne@68
|
1030 Bits Mask Description
|
jpayne@68
|
1031 0-15 0x0000_0000_0000_FFFF Filter ID
|
jpayne@68
|
1032 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
|
jpayne@68
|
1033 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
|
jpayne@68
|
1034
|
jpayne@68
|
1035 The resulting 63-bit integer will use 9 bytes of space when
|
jpayne@68
|
1036 stored using the encoding described in Section 1.2. To get
|
jpayne@68
|
1037 a shorter ID, see the beginning of this Section how to
|
jpayne@68
|
1038 request a custom ID range.
|
jpayne@68
|
1039
|
jpayne@68
|
1040
|
jpayne@68
|
1041 5.4.1. Reserved Custom Filter ID Ranges
|
jpayne@68
|
1042
|
jpayne@68
|
1043 Range Description
|
jpayne@68
|
1044 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
|
jpayne@68
|
1045 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
|
jpayne@68
|
1046 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
|
jpayne@68
|
1047
|
jpayne@68
|
1048
|
jpayne@68
|
1049 6. Cyclic Redundancy Checks
|
jpayne@68
|
1050
|
jpayne@68
|
1051 There are several incompatible variations to calculate CRC32
|
jpayne@68
|
1052 and CRC64. For simplicity and clarity, complete examples are
|
jpayne@68
|
1053 provided to calculate the checks as they are used in this file
|
jpayne@68
|
1054 format. Implementations MAY use different code as long as it
|
jpayne@68
|
1055 gives identical results.
|
jpayne@68
|
1056
|
jpayne@68
|
1057 The program below reads data from standard input, calculates
|
jpayne@68
|
1058 the CRC32 and CRC64 values, and prints the calculated values
|
jpayne@68
|
1059 as big endian hexadecimal strings to standard output.
|
jpayne@68
|
1060
|
jpayne@68
|
1061 #include <stddef.h>
|
jpayne@68
|
1062 #include <inttypes.h>
|
jpayne@68
|
1063 #include <stdio.h>
|
jpayne@68
|
1064
|
jpayne@68
|
1065 uint32_t crc32_table[256];
|
jpayne@68
|
1066 uint64_t crc64_table[256];
|
jpayne@68
|
1067
|
jpayne@68
|
1068 void
|
jpayne@68
|
1069 init(void)
|
jpayne@68
|
1070 {
|
jpayne@68
|
1071 static const uint32_t poly32 = UINT32_C(0xEDB88320);
|
jpayne@68
|
1072 static const uint64_t poly64
|
jpayne@68
|
1073 = UINT64_C(0xC96C5795D7870F42);
|
jpayne@68
|
1074
|
jpayne@68
|
1075 for (size_t i = 0; i < 256; ++i) {
|
jpayne@68
|
1076 uint32_t crc32 = i;
|
jpayne@68
|
1077 uint64_t crc64 = i;
|
jpayne@68
|
1078
|
jpayne@68
|
1079 for (size_t j = 0; j < 8; ++j) {
|
jpayne@68
|
1080 if (crc32 & 1)
|
jpayne@68
|
1081 crc32 = (crc32 >> 1) ^ poly32;
|
jpayne@68
|
1082 else
|
jpayne@68
|
1083 crc32 >>= 1;
|
jpayne@68
|
1084
|
jpayne@68
|
1085 if (crc64 & 1)
|
jpayne@68
|
1086 crc64 = (crc64 >> 1) ^ poly64;
|
jpayne@68
|
1087 else
|
jpayne@68
|
1088 crc64 >>= 1;
|
jpayne@68
|
1089 }
|
jpayne@68
|
1090
|
jpayne@68
|
1091 crc32_table[i] = crc32;
|
jpayne@68
|
1092 crc64_table[i] = crc64;
|
jpayne@68
|
1093 }
|
jpayne@68
|
1094 }
|
jpayne@68
|
1095
|
jpayne@68
|
1096 uint32_t
|
jpayne@68
|
1097 crc32(const uint8_t *buf, size_t size, uint32_t crc)
|
jpayne@68
|
1098 {
|
jpayne@68
|
1099 crc = ~crc;
|
jpayne@68
|
1100 for (size_t i = 0; i < size; ++i)
|
jpayne@68
|
1101 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
|
jpayne@68
|
1102 ^ (crc >> 8);
|
jpayne@68
|
1103 return ~crc;
|
jpayne@68
|
1104 }
|
jpayne@68
|
1105
|
jpayne@68
|
1106 uint64_t
|
jpayne@68
|
1107 crc64(const uint8_t *buf, size_t size, uint64_t crc)
|
jpayne@68
|
1108 {
|
jpayne@68
|
1109 crc = ~crc;
|
jpayne@68
|
1110 for (size_t i = 0; i < size; ++i)
|
jpayne@68
|
1111 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
|
jpayne@68
|
1112 ^ (crc >> 8);
|
jpayne@68
|
1113 return ~crc;
|
jpayne@68
|
1114 }
|
jpayne@68
|
1115
|
jpayne@68
|
1116 int
|
jpayne@68
|
1117 main()
|
jpayne@68
|
1118 {
|
jpayne@68
|
1119 init();
|
jpayne@68
|
1120
|
jpayne@68
|
1121 uint32_t value32 = 0;
|
jpayne@68
|
1122 uint64_t value64 = 0;
|
jpayne@68
|
1123 uint64_t total_size = 0;
|
jpayne@68
|
1124 uint8_t buf[8192];
|
jpayne@68
|
1125
|
jpayne@68
|
1126 while (1) {
|
jpayne@68
|
1127 const size_t buf_size
|
jpayne@68
|
1128 = fread(buf, 1, sizeof(buf), stdin);
|
jpayne@68
|
1129 if (buf_size == 0)
|
jpayne@68
|
1130 break;
|
jpayne@68
|
1131
|
jpayne@68
|
1132 total_size += buf_size;
|
jpayne@68
|
1133 value32 = crc32(buf, buf_size, value32);
|
jpayne@68
|
1134 value64 = crc64(buf, buf_size, value64);
|
jpayne@68
|
1135 }
|
jpayne@68
|
1136
|
jpayne@68
|
1137 printf("Bytes: %" PRIu64 "\n", total_size);
|
jpayne@68
|
1138 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
|
jpayne@68
|
1139 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
|
jpayne@68
|
1140
|
jpayne@68
|
1141 return 0;
|
jpayne@68
|
1142 }
|
jpayne@68
|
1143
|
jpayne@68
|
1144
|
jpayne@68
|
1145 7. References
|
jpayne@68
|
1146
|
jpayne@68
|
1147 LZMA SDK - The original LZMA implementation
|
jpayne@68
|
1148 https://7-zip.org/sdk.html
|
jpayne@68
|
1149
|
jpayne@68
|
1150 LZMA Utils - LZMA adapted to POSIX-like systems
|
jpayne@68
|
1151 https://tukaani.org/lzma/
|
jpayne@68
|
1152
|
jpayne@68
|
1153 XZ Utils - The next generation of LZMA Utils
|
jpayne@68
|
1154 https://tukaani.org/xz/
|
jpayne@68
|
1155
|
jpayne@68
|
1156 [RFC-1952]
|
jpayne@68
|
1157 GZIP file format specification version 4.3
|
jpayne@68
|
1158 https://www.ietf.org/rfc/rfc1952.txt
|
jpayne@68
|
1159 - Notation of byte boxes in section "2.1. Overall conventions"
|
jpayne@68
|
1160
|
jpayne@68
|
1161 [RFC-2119]
|
jpayne@68
|
1162 Key words for use in RFCs to Indicate Requirement Levels
|
jpayne@68
|
1163 https://www.ietf.org/rfc/rfc2119.txt
|
jpayne@68
|
1164
|
jpayne@68
|
1165 [GNU-tar]
|
jpayne@68
|
1166 GNU tar 1.35 manual
|
jpayne@68
|
1167 https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
|
jpayne@68
|
1168 - Node 9.4.2 "Blocking Factor", paragraph that begins
|
jpayne@68
|
1169 "gzip will complain about trailing garbage"
|
jpayne@68
|
1170 - Note that this URL points to the latest version of the
|
jpayne@68
|
1171 manual, and may some day not contain the note which is in
|
jpayne@68
|
1172 1.35. For the exact version of the manual, download GNU
|
jpayne@68
|
1173 tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz
|
jpayne@68
|
1174
|