annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/doc/xz/xz-file-format.txt @ 68:5028fdace37b

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 16:23:26 -0400
parents
children
rev   line source
jpayne@68 1
jpayne@68 2 The .xz File Format
jpayne@68 3 ===================
jpayne@68 4
jpayne@68 5 Version 1.2.1 (2024-04-08)
jpayne@68 6
jpayne@68 7
jpayne@68 8 0. Preface
jpayne@68 9 0.1. Notices and Acknowledgements
jpayne@68 10 0.2. Getting the Latest Version
jpayne@68 11 0.3. Version History
jpayne@68 12 1. Conventions
jpayne@68 13 1.1. Byte and Its Representation
jpayne@68 14 1.2. Multibyte Integers
jpayne@68 15 2. Overall Structure of .xz File
jpayne@68 16 2.1. Stream
jpayne@68 17 2.1.1. Stream Header
jpayne@68 18 2.1.1.1. Header Magic Bytes
jpayne@68 19 2.1.1.2. Stream Flags
jpayne@68 20 2.1.1.3. CRC32
jpayne@68 21 2.1.2. Stream Footer
jpayne@68 22 2.1.2.1. CRC32
jpayne@68 23 2.1.2.2. Backward Size
jpayne@68 24 2.1.2.3. Stream Flags
jpayne@68 25 2.1.2.4. Footer Magic Bytes
jpayne@68 26 2.2. Stream Padding
jpayne@68 27 3. Block
jpayne@68 28 3.1. Block Header
jpayne@68 29 3.1.1. Block Header Size
jpayne@68 30 3.1.2. Block Flags
jpayne@68 31 3.1.3. Compressed Size
jpayne@68 32 3.1.4. Uncompressed Size
jpayne@68 33 3.1.5. List of Filter Flags
jpayne@68 34 3.1.6. Header Padding
jpayne@68 35 3.1.7. CRC32
jpayne@68 36 3.2. Compressed Data
jpayne@68 37 3.3. Block Padding
jpayne@68 38 3.4. Check
jpayne@68 39 4. Index
jpayne@68 40 4.1. Index Indicator
jpayne@68 41 4.2. Number of Records
jpayne@68 42 4.3. List of Records
jpayne@68 43 4.3.1. Unpadded Size
jpayne@68 44 4.3.2. Uncompressed Size
jpayne@68 45 4.4. Index Padding
jpayne@68 46 4.5. CRC32
jpayne@68 47 5. Filter Chains
jpayne@68 48 5.1. Alignment
jpayne@68 49 5.2. Security
jpayne@68 50 5.3. Filters
jpayne@68 51 5.3.1. LZMA2
jpayne@68 52 5.3.2. Branch/Call/Jump Filters for Executables
jpayne@68 53 5.3.3. Delta
jpayne@68 54 5.3.3.1. Format of the Encoded Output
jpayne@68 55 5.4. Custom Filter IDs
jpayne@68 56 5.4.1. Reserved Custom Filter ID Ranges
jpayne@68 57 6. Cyclic Redundancy Checks
jpayne@68 58 7. References
jpayne@68 59
jpayne@68 60
jpayne@68 61 0. Preface
jpayne@68 62
jpayne@68 63 This document describes the .xz file format (filename suffix
jpayne@68 64 ".xz", MIME type "application/x-xz"). It is intended that this
jpayne@68 65 this format replace the old .lzma format used by LZMA SDK and
jpayne@68 66 LZMA Utils.
jpayne@68 67
jpayne@68 68
jpayne@68 69 0.1. Notices and Acknowledgements
jpayne@68 70
jpayne@68 71 This file format was designed by Lasse Collin
jpayne@68 72 <lasse.collin@tukaani.org> and Igor Pavlov.
jpayne@68 73
jpayne@68 74 Special thanks for helping with this document goes to
jpayne@68 75 Ville Koskinen. Thanks for helping with this document goes to
jpayne@68 76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
jpayne@68 77
jpayne@68 78 This document has been put into the public domain.
jpayne@68 79
jpayne@68 80
jpayne@68 81 0.2. Getting the Latest Version
jpayne@68 82
jpayne@68 83 The latest official version of this document can be downloaded
jpayne@68 84 from <https://tukaani.org/xz/xz-file-format.txt>.
jpayne@68 85
jpayne@68 86 Specific versions of this document have a filename
jpayne@68 87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
jpayne@68 88 For example, the version 1.0.0 of this document is available
jpayne@68 89 at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>.
jpayne@68 90
jpayne@68 91
jpayne@68 92 0.3. Version History
jpayne@68 93
jpayne@68 94 Version Date Description
jpayne@68 95
jpayne@68 96 1.2.1 2024-04-08 The URLs of this specification and
jpayne@68 97 XZ Utils were changed back to the
jpayne@68 98 original ones in Sections 0.2 and 7.
jpayne@68 99
jpayne@68 100 1.2.0 2024-01-19 Added RISC-V filter and updated URLs in
jpayne@68 101 Sections 0.2 and 7. The URL of this
jpayne@68 102 specification was changed.
jpayne@68 103
jpayne@68 104 1.1.0 2022-12-11 Added ARM64 filter and clarified 32-bit
jpayne@68 105 ARM endianness in Section 5.3.2,
jpayne@68 106 language improvements in Section 5.4
jpayne@68 107
jpayne@68 108 1.0.4 2009-08-27 Language improvements in Sections 1.2,
jpayne@68 109 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
jpayne@68 110
jpayne@68 111 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4
jpayne@68 112
jpayne@68 113 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1
jpayne@68 114
jpayne@68 115 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
jpayne@68 116 clarifications to Sections 2, 2.2,
jpayne@68 117 3.3, 4.4, and 5.3.2
jpayne@68 118
jpayne@68 119 1.0.0 2009-01-14 The first official version
jpayne@68 120
jpayne@68 121
jpayne@68 122 1. Conventions
jpayne@68 123
jpayne@68 124 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
jpayne@68 125 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
jpayne@68 126 document are to be interpreted as described in [RFC-2119].
jpayne@68 127
jpayne@68 128 Indicating a warning means displaying a message, returning
jpayne@68 129 appropriate exit status, or doing something else to let the
jpayne@68 130 user know that something worth warning occurred. The operation
jpayne@68 131 SHOULD still finish if a warning is indicated.
jpayne@68 132
jpayne@68 133 Indicating an error means displaying a message, returning
jpayne@68 134 appropriate exit status, or doing something else to let the
jpayne@68 135 user know that something prevented successfully finishing the
jpayne@68 136 operation. The operation MUST be aborted once an error has
jpayne@68 137 been indicated.
jpayne@68 138
jpayne@68 139
jpayne@68 140 1.1. Byte and Its Representation
jpayne@68 141
jpayne@68 142 In this document, byte is always 8 bits.
jpayne@68 143
jpayne@68 144 A "null byte" has all bits unset. That is, the value of a null
jpayne@68 145 byte is 0x00.
jpayne@68 146
jpayne@68 147 To represent byte blocks, this document uses notation that
jpayne@68 148 is similar to the notation used in [RFC-1952]:
jpayne@68 149
jpayne@68 150 +-------+
jpayne@68 151 | Foo | One byte.
jpayne@68 152 +-------+
jpayne@68 153
jpayne@68 154 +---+---+
jpayne@68 155 | Foo | Two bytes; that is, some of the vertical bars
jpayne@68 156 +---+---+ can be missing.
jpayne@68 157
jpayne@68 158 +=======+
jpayne@68 159 | Foo | Zero or more bytes.
jpayne@68 160 +=======+
jpayne@68 161
jpayne@68 162 In this document, a boxed byte or a byte sequence declared
jpayne@68 163 using this notation is called "a field". The example field
jpayne@68 164 above would be called "the Foo field" or plain "Foo".
jpayne@68 165
jpayne@68 166 If there are many fields, they may be split to multiple lines.
jpayne@68 167 This is indicated with an arrow ("--->"):
jpayne@68 168
jpayne@68 169 +=====+
jpayne@68 170 | Foo |
jpayne@68 171 +=====+
jpayne@68 172
jpayne@68 173 +=====+
jpayne@68 174 ---> | Bar |
jpayne@68 175 +=====+
jpayne@68 176
jpayne@68 177 The above is equivalent to this:
jpayne@68 178
jpayne@68 179 +=====+=====+
jpayne@68 180 | Foo | Bar |
jpayne@68 181 +=====+=====+
jpayne@68 182
jpayne@68 183
jpayne@68 184 1.2. Multibyte Integers
jpayne@68 185
jpayne@68 186 Multibyte integers of static length, such as CRC values,
jpayne@68 187 are stored in little endian byte order (least significant
jpayne@68 188 byte first).
jpayne@68 189
jpayne@68 190 When smaller values are more likely than bigger values (for
jpayne@68 191 example file sizes), multibyte integers are encoded in a
jpayne@68 192 variable-length representation:
jpayne@68 193 - Numbers in the range [0, 127] are copied as is, and take
jpayne@68 194 one byte of space.
jpayne@68 195 - Bigger numbers will occupy two or more bytes. All but the
jpayne@68 196 last byte of the multibyte representation have the highest
jpayne@68 197 (eighth) bit set.
jpayne@68 198
jpayne@68 199 For now, the value of the variable-length integers is limited
jpayne@68 200 to 63 bits, which limits the encoded size of the integer to
jpayne@68 201 nine bytes. These limits may be increased in the future if
jpayne@68 202 needed.
jpayne@68 203
jpayne@68 204 The following C code illustrates encoding and decoding of
jpayne@68 205 variable-length integers. The functions return the number of
jpayne@68 206 bytes occupied by the integer (1-9), or zero on error.
jpayne@68 207
jpayne@68 208 #include <stddef.h>
jpayne@68 209 #include <inttypes.h>
jpayne@68 210
jpayne@68 211 size_t
jpayne@68 212 encode(uint8_t buf[static 9], uint64_t num)
jpayne@68 213 {
jpayne@68 214 if (num > UINT64_MAX / 2)
jpayne@68 215 return 0;
jpayne@68 216
jpayne@68 217 size_t i = 0;
jpayne@68 218
jpayne@68 219 while (num >= 0x80) {
jpayne@68 220 buf[i++] = (uint8_t)(num) | 0x80;
jpayne@68 221 num >>= 7;
jpayne@68 222 }
jpayne@68 223
jpayne@68 224 buf[i++] = (uint8_t)(num);
jpayne@68 225
jpayne@68 226 return i;
jpayne@68 227 }
jpayne@68 228
jpayne@68 229 size_t
jpayne@68 230 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
jpayne@68 231 {
jpayne@68 232 if (size_max == 0)
jpayne@68 233 return 0;
jpayne@68 234
jpayne@68 235 if (size_max > 9)
jpayne@68 236 size_max = 9;
jpayne@68 237
jpayne@68 238 *num = buf[0] & 0x7F;
jpayne@68 239 size_t i = 0;
jpayne@68 240
jpayne@68 241 while (buf[i++] & 0x80) {
jpayne@68 242 if (i >= size_max || buf[i] == 0x00)
jpayne@68 243 return 0;
jpayne@68 244
jpayne@68 245 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
jpayne@68 246 }
jpayne@68 247
jpayne@68 248 return i;
jpayne@68 249 }
jpayne@68 250
jpayne@68 251
jpayne@68 252 2. Overall Structure of .xz File
jpayne@68 253
jpayne@68 254 A standalone .xz files consist of one or more Streams which may
jpayne@68 255 have Stream Padding between or after them:
jpayne@68 256
jpayne@68 257 +========+================+========+================+
jpayne@68 258 | Stream | Stream Padding | Stream | Stream Padding | ...
jpayne@68 259 +========+================+========+================+
jpayne@68 260
jpayne@68 261 The sizes of Stream and Stream Padding are always multiples
jpayne@68 262 of four bytes, thus the size of every valid .xz file MUST be
jpayne@68 263 a multiple of four bytes.
jpayne@68 264
jpayne@68 265 While a typical file contains only one Stream and no Stream
jpayne@68 266 Padding, a decoder handling standalone .xz files SHOULD support
jpayne@68 267 files that have more than one Stream or Stream Padding.
jpayne@68 268
jpayne@68 269 In contrast to standalone .xz files, when the .xz file format
jpayne@68 270 is used as an internal part of some other file format or
jpayne@68 271 communication protocol, it usually is expected that the decoder
jpayne@68 272 stops after the first Stream, and doesn't look for Stream
jpayne@68 273 Padding or possibly other Streams.
jpayne@68 274
jpayne@68 275
jpayne@68 276 2.1. Stream
jpayne@68 277
jpayne@68 278 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
jpayne@68 279 | Stream Header | Block | Block | ... | Block |
jpayne@68 280 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
jpayne@68 281
jpayne@68 282 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
jpayne@68 283 ---> | Index | Stream Footer |
jpayne@68 284 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
jpayne@68 285
jpayne@68 286 All the above fields have a size that is a multiple of four. If
jpayne@68 287 Stream is used as an internal part of another file format, it
jpayne@68 288 is RECOMMENDED to make the Stream start at an offset that is
jpayne@68 289 a multiple of four bytes.
jpayne@68 290
jpayne@68 291 Stream Header, Index, and Stream Footer are always present in
jpayne@68 292 a Stream. The maximum size of the Index field is 16 GiB (2^34).
jpayne@68 293
jpayne@68 294 There are zero or more Blocks. The maximum number of Blocks is
jpayne@68 295 limited only by the maximum size of the Index field.
jpayne@68 296
jpayne@68 297 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
jpayne@68 298 The same limit applies to the total amount of uncompressed
jpayne@68 299 data stored in a Stream.
jpayne@68 300
jpayne@68 301 If an implementation supports handling .xz files with multiple
jpayne@68 302 concatenated Streams, it MAY apply the above limits to the file
jpayne@68 303 as a whole instead of limiting per Stream basis.
jpayne@68 304
jpayne@68 305
jpayne@68 306 2.1.1. Stream Header
jpayne@68 307
jpayne@68 308 +---+---+---+---+---+---+-------+------+--+--+--+--+
jpayne@68 309 | Header Magic Bytes | Stream Flags | CRC32 |
jpayne@68 310 +---+---+---+---+---+---+-------+------+--+--+--+--+
jpayne@68 311
jpayne@68 312
jpayne@68 313 2.1.1.1. Header Magic Bytes
jpayne@68 314
jpayne@68 315 The first six (6) bytes of the Stream are so called Header
jpayne@68 316 Magic Bytes. They can be used to identify the file type.
jpayne@68 317
jpayne@68 318 Using a C array and ASCII:
jpayne@68 319 const uint8_t HEADER_MAGIC[6]
jpayne@68 320 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
jpayne@68 321
jpayne@68 322 In plain hexadecimal:
jpayne@68 323 FD 37 7A 58 5A 00
jpayne@68 324
jpayne@68 325 Notes:
jpayne@68 326 - The first byte (0xFD) was chosen so that the files cannot
jpayne@68 327 be erroneously detected as being in .lzma format, in which
jpayne@68 328 the first byte is in the range [0x00, 0xE0].
jpayne@68 329 - The sixth byte (0x00) was chosen to prevent applications
jpayne@68 330 from misdetecting the file as a text file.
jpayne@68 331
jpayne@68 332 If the Header Magic Bytes don't match, the decoder MUST
jpayne@68 333 indicate an error.
jpayne@68 334
jpayne@68 335
jpayne@68 336 2.1.1.2. Stream Flags
jpayne@68 337
jpayne@68 338 The first byte of Stream Flags is always a null byte. In the
jpayne@68 339 future, this byte may be used to indicate a new Stream version
jpayne@68 340 or other Stream properties.
jpayne@68 341
jpayne@68 342 The second byte of Stream Flags is a bit field:
jpayne@68 343
jpayne@68 344 Bit(s) Mask Description
jpayne@68 345 0-3 0x0F Type of Check (see Section 3.4):
jpayne@68 346 ID Size Check name
jpayne@68 347 0x00 0 bytes None
jpayne@68 348 0x01 4 bytes CRC32
jpayne@68 349 0x02 4 bytes (Reserved)
jpayne@68 350 0x03 4 bytes (Reserved)
jpayne@68 351 0x04 8 bytes CRC64
jpayne@68 352 0x05 8 bytes (Reserved)
jpayne@68 353 0x06 8 bytes (Reserved)
jpayne@68 354 0x07 16 bytes (Reserved)
jpayne@68 355 0x08 16 bytes (Reserved)
jpayne@68 356 0x09 16 bytes (Reserved)
jpayne@68 357 0x0A 32 bytes SHA-256
jpayne@68 358 0x0B 32 bytes (Reserved)
jpayne@68 359 0x0C 32 bytes (Reserved)
jpayne@68 360 0x0D 64 bytes (Reserved)
jpayne@68 361 0x0E 64 bytes (Reserved)
jpayne@68 362 0x0F 64 bytes (Reserved)
jpayne@68 363 4-7 0xF0 Reserved for future use; MUST be zero for now.
jpayne@68 364
jpayne@68 365 Implementations SHOULD support at least the Check IDs 0x00
jpayne@68 366 (None) and 0x01 (CRC32). Supporting other Check IDs is
jpayne@68 367 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
jpayne@68 368 indicate a warning or error.
jpayne@68 369
jpayne@68 370 If any reserved bit is set, the decoder MUST indicate an error.
jpayne@68 371 It is possible that there is a new field present which the
jpayne@68 372 decoder is not aware of, and can thus parse the Stream Header
jpayne@68 373 incorrectly.
jpayne@68 374
jpayne@68 375
jpayne@68 376 2.1.1.3. CRC32
jpayne@68 377
jpayne@68 378 The CRC32 is calculated from the Stream Flags field. It is
jpayne@68 379 stored as an unsigned 32-bit little endian integer. If the
jpayne@68 380 calculated value does not match the stored one, the decoder
jpayne@68 381 MUST indicate an error.
jpayne@68 382
jpayne@68 383 The idea is that Stream Flags would always be two bytes, even
jpayne@68 384 if new features are needed. This way old decoders will be able
jpayne@68 385 to verify the CRC32 calculated from Stream Flags, and thus
jpayne@68 386 distinguish between corrupt files (CRC32 doesn't match) and
jpayne@68 387 files that the decoder doesn't support (CRC32 matches but
jpayne@68 388 Stream Flags has reserved bits set).
jpayne@68 389
jpayne@68 390
jpayne@68 391 2.1.2. Stream Footer
jpayne@68 392
jpayne@68 393 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
jpayne@68 394 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
jpayne@68 395 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
jpayne@68 396
jpayne@68 397
jpayne@68 398 2.1.2.1. CRC32
jpayne@68 399
jpayne@68 400 The CRC32 is calculated from the Backward Size and Stream Flags
jpayne@68 401 fields. It is stored as an unsigned 32-bit little endian
jpayne@68 402 integer. If the calculated value does not match the stored one,
jpayne@68 403 the decoder MUST indicate an error.
jpayne@68 404
jpayne@68 405 The reason to have the CRC32 field before the Backward Size and
jpayne@68 406 Stream Flags fields is to keep the four-byte fields aligned to
jpayne@68 407 a multiple of four bytes.
jpayne@68 408
jpayne@68 409
jpayne@68 410 2.1.2.2. Backward Size
jpayne@68 411
jpayne@68 412 Backward Size is stored as a 32-bit little endian integer,
jpayne@68 413 which indicates the size of the Index field as multiple of
jpayne@68 414 four bytes, minimum value being four bytes:
jpayne@68 415
jpayne@68 416 real_backward_size = (stored_backward_size + 1) * 4;
jpayne@68 417
jpayne@68 418 If the stored value does not match the real size of the Index
jpayne@68 419 field, the decoder MUST indicate an error.
jpayne@68 420
jpayne@68 421 Using a fixed-size integer to store Backward Size makes
jpayne@68 422 it slightly simpler to parse the Stream Footer when the
jpayne@68 423 application needs to parse the Stream backwards.
jpayne@68 424
jpayne@68 425
jpayne@68 426 2.1.2.3. Stream Flags
jpayne@68 427
jpayne@68 428 This is a copy of the Stream Flags field from the Stream
jpayne@68 429 Header. The information stored to Stream Flags is needed
jpayne@68 430 when parsing the Stream backwards. The decoder MUST compare
jpayne@68 431 the Stream Flags fields in both Stream Header and Stream
jpayne@68 432 Footer, and indicate an error if they are not identical.
jpayne@68 433
jpayne@68 434
jpayne@68 435 2.1.2.4. Footer Magic Bytes
jpayne@68 436
jpayne@68 437 As the last step of the decoding process, the decoder MUST
jpayne@68 438 verify the existence of Footer Magic Bytes. If they don't
jpayne@68 439 match, an error MUST be indicated.
jpayne@68 440
jpayne@68 441 Using a C array and ASCII:
jpayne@68 442 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
jpayne@68 443
jpayne@68 444 In hexadecimal:
jpayne@68 445 59 5A
jpayne@68 446
jpayne@68 447 The primary reason to have Footer Magic Bytes is to make
jpayne@68 448 it easier to detect incomplete files quickly, without
jpayne@68 449 uncompressing. If the file does not end with Footer Magic Bytes
jpayne@68 450 (excluding Stream Padding described in Section 2.2), it cannot
jpayne@68 451 be undamaged, unless someone has intentionally appended garbage
jpayne@68 452 after the end of the Stream.
jpayne@68 453
jpayne@68 454
jpayne@68 455 2.2. Stream Padding
jpayne@68 456
jpayne@68 457 Only the decoders that support decoding of concatenated Streams
jpayne@68 458 MUST support Stream Padding.
jpayne@68 459
jpayne@68 460 Stream Padding MUST contain only null bytes. To preserve the
jpayne@68 461 four-byte alignment of consecutive Streams, the size of Stream
jpayne@68 462 Padding MUST be a multiple of four bytes. Empty Stream Padding
jpayne@68 463 is allowed. If these requirements are not met, the decoder MUST
jpayne@68 464 indicate an error.
jpayne@68 465
jpayne@68 466 Note that non-empty Stream Padding is allowed at the end of the
jpayne@68 467 file; there doesn't need to be a new Stream after non-empty
jpayne@68 468 Stream Padding. This can be convenient in certain situations
jpayne@68 469 [GNU-tar].
jpayne@68 470
jpayne@68 471 The possibility of Stream Padding MUST be taken into account
jpayne@68 472 when designing an application that parses Streams backwards,
jpayne@68 473 and the application supports concatenated Streams.
jpayne@68 474
jpayne@68 475
jpayne@68 476 3. Block
jpayne@68 477
jpayne@68 478 +==============+=================+===============+=======+
jpayne@68 479 | Block Header | Compressed Data | Block Padding | Check |
jpayne@68 480 +==============+=================+===============+=======+
jpayne@68 481
jpayne@68 482
jpayne@68 483 3.1. Block Header
jpayne@68 484
jpayne@68 485 +-------------------+-------------+=================+
jpayne@68 486 | Block Header Size | Block Flags | Compressed Size |
jpayne@68 487 +-------------------+-------------+=================+
jpayne@68 488
jpayne@68 489 +===================+======================+
jpayne@68 490 ---> | Uncompressed Size | List of Filter Flags |
jpayne@68 491 +===================+======================+
jpayne@68 492
jpayne@68 493 +================+--+--+--+--+
jpayne@68 494 ---> | Header Padding | CRC32 |
jpayne@68 495 +================+--+--+--+--+
jpayne@68 496
jpayne@68 497
jpayne@68 498 3.1.1. Block Header Size
jpayne@68 499
jpayne@68 500 This field overlaps with the Index Indicator field (see
jpayne@68 501 Section 4.1).
jpayne@68 502
jpayne@68 503 This field contains the size of the Block Header field,
jpayne@68 504 including the Block Header Size field itself. Valid values are
jpayne@68 505 in the range [0x01, 0xFF], which indicate the size of the Block
jpayne@68 506 Header as multiples of four bytes, minimum size being eight
jpayne@68 507 bytes:
jpayne@68 508
jpayne@68 509 real_header_size = (encoded_header_size + 1) * 4;
jpayne@68 510
jpayne@68 511 If a Block Header bigger than 1024 bytes is needed in the
jpayne@68 512 future, a new field can be added between the Block Header and
jpayne@68 513 Compressed Data fields. The presence of this new field would
jpayne@68 514 be indicated in the Block Header field.
jpayne@68 515
jpayne@68 516
jpayne@68 517 3.1.2. Block Flags
jpayne@68 518
jpayne@68 519 The Block Flags field is a bit field:
jpayne@68 520
jpayne@68 521 Bit(s) Mask Description
jpayne@68 522 0-1 0x03 Number of filters (1-4)
jpayne@68 523 2-5 0x3C Reserved for future use; MUST be zero for now.
jpayne@68 524 6 0x40 The Compressed Size field is present.
jpayne@68 525 7 0x80 The Uncompressed Size field is present.
jpayne@68 526
jpayne@68 527 If any reserved bit is set, the decoder MUST indicate an error.
jpayne@68 528 It is possible that there is a new field present which the
jpayne@68 529 decoder is not aware of, and can thus parse the Block Header
jpayne@68 530 incorrectly.
jpayne@68 531
jpayne@68 532
jpayne@68 533 3.1.3. Compressed Size
jpayne@68 534
jpayne@68 535 This field is present only if the appropriate bit is set in
jpayne@68 536 the Block Flags field (see Section 3.1.2).
jpayne@68 537
jpayne@68 538 The Compressed Size field contains the size of the Compressed
jpayne@68 539 Data field, which MUST be non-zero. Compressed Size is stored
jpayne@68 540 using the encoding described in Section 1.2. If the Compressed
jpayne@68 541 Size doesn't match the size of the Compressed Data field, the
jpayne@68 542 decoder MUST indicate an error.
jpayne@68 543
jpayne@68 544
jpayne@68 545 3.1.4. Uncompressed Size
jpayne@68 546
jpayne@68 547 This field is present only if the appropriate bit is set in
jpayne@68 548 the Block Flags field (see Section 3.1.2).
jpayne@68 549
jpayne@68 550 The Uncompressed Size field contains the size of the Block
jpayne@68 551 after uncompressing. Uncompressed Size is stored using the
jpayne@68 552 encoding described in Section 1.2. If the Uncompressed Size
jpayne@68 553 does not match the real uncompressed size, the decoder MUST
jpayne@68 554 indicate an error.
jpayne@68 555
jpayne@68 556 Storing the Compressed Size and Uncompressed Size fields serves
jpayne@68 557 several purposes:
jpayne@68 558 - The decoder knows how much memory it needs to allocate
jpayne@68 559 for a temporary buffer in multithreaded mode.
jpayne@68 560 - Simple error detection: wrong size indicates a broken file.
jpayne@68 561 - Seeking forwards to a specific location in streamed mode.
jpayne@68 562
jpayne@68 563 It should be noted that the only reliable way to determine
jpayne@68 564 the real uncompressed size is to uncompress the Block,
jpayne@68 565 because the Block Header and Index fields may contain
jpayne@68 566 (intentionally or unintentionally) invalid information.
jpayne@68 567
jpayne@68 568
jpayne@68 569 3.1.5. List of Filter Flags
jpayne@68 570
jpayne@68 571 +================+================+ +================+
jpayne@68 572 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
jpayne@68 573 +================+================+ +================+
jpayne@68 574
jpayne@68 575 The number of Filter Flags fields is stored in the Block Flags
jpayne@68 576 field (see Section 3.1.2).
jpayne@68 577
jpayne@68 578 The format of each Filter Flags field is as follows:
jpayne@68 579
jpayne@68 580 +===========+====================+===================+
jpayne@68 581 | Filter ID | Size of Properties | Filter Properties |
jpayne@68 582 +===========+====================+===================+
jpayne@68 583
jpayne@68 584 Both Filter ID and Size of Properties are stored using the
jpayne@68 585 encoding described in Section 1.2. Size of Properties indicates
jpayne@68 586 the size of the Filter Properties field as bytes. The list of
jpayne@68 587 officially defined Filter IDs and the formats of their Filter
jpayne@68 588 Properties are described in Section 5.3.
jpayne@68 589
jpayne@68 590 Filter IDs greater than or equal to 0x4000_0000_0000_0000
jpayne@68 591 (2^62) are reserved for implementation-specific internal use.
jpayne@68 592 These Filter IDs MUST never be used in List of Filter Flags.
jpayne@68 593
jpayne@68 594
jpayne@68 595 3.1.6. Header Padding
jpayne@68 596
jpayne@68 597 This field contains as many null byte as it is needed to make
jpayne@68 598 the Block Header have the size specified in Block Header Size.
jpayne@68 599 If any of the bytes are not null bytes, the decoder MUST
jpayne@68 600 indicate an error. It is possible that there is a new field
jpayne@68 601 present which the decoder is not aware of, and can thus parse
jpayne@68 602 the Block Header incorrectly.
jpayne@68 603
jpayne@68 604
jpayne@68 605 3.1.7. CRC32
jpayne@68 606
jpayne@68 607 The CRC32 is calculated over everything in the Block Header
jpayne@68 608 field except the CRC32 field itself. It is stored as an
jpayne@68 609 unsigned 32-bit little endian integer. If the calculated
jpayne@68 610 value does not match the stored one, the decoder MUST indicate
jpayne@68 611 an error.
jpayne@68 612
jpayne@68 613 By verifying the CRC32 of the Block Header before parsing the
jpayne@68 614 actual contents allows the decoder to distinguish between
jpayne@68 615 corrupt and unsupported files.
jpayne@68 616
jpayne@68 617
jpayne@68 618 3.2. Compressed Data
jpayne@68 619
jpayne@68 620 The format of Compressed Data depends on Block Flags and List
jpayne@68 621 of Filter Flags. Excluding the descriptions of the simplest
jpayne@68 622 filters in Section 5.3, the format of the filter-specific
jpayne@68 623 encoded data is out of scope of this document.
jpayne@68 624
jpayne@68 625
jpayne@68 626 3.3. Block Padding
jpayne@68 627
jpayne@68 628 Block Padding MUST contain 0-3 null bytes to make the size of
jpayne@68 629 the Block a multiple of four bytes. This can be needed when
jpayne@68 630 the size of Compressed Data is not a multiple of four. If any
jpayne@68 631 of the bytes in Block Padding are not null bytes, the decoder
jpayne@68 632 MUST indicate an error.
jpayne@68 633
jpayne@68 634
jpayne@68 635 3.4. Check
jpayne@68 636
jpayne@68 637 The type and size of the Check field depends on which bits
jpayne@68 638 are set in the Stream Flags field (see Section 2.1.1.2).
jpayne@68 639
jpayne@68 640 The Check, when used, is calculated from the original
jpayne@68 641 uncompressed data. If the calculated Check does not match the
jpayne@68 642 stored one, the decoder MUST indicate an error. If the selected
jpayne@68 643 type of Check is not supported by the decoder, it SHOULD
jpayne@68 644 indicate a warning or error.
jpayne@68 645
jpayne@68 646
jpayne@68 647 4. Index
jpayne@68 648
jpayne@68 649 +-----------------+===================+
jpayne@68 650 | Index Indicator | Number of Records |
jpayne@68 651 +-----------------+===================+
jpayne@68 652
jpayne@68 653 +=================+===============+-+-+-+-+
jpayne@68 654 ---> | List of Records | Index Padding | CRC32 |
jpayne@68 655 +=================+===============+-+-+-+-+
jpayne@68 656
jpayne@68 657 Index serves several purposes. Using it, one can
jpayne@68 658 - verify that all Blocks in a Stream have been processed;
jpayne@68 659 - find out the uncompressed size of a Stream; and
jpayne@68 660 - quickly access the beginning of any Block (random access).
jpayne@68 661
jpayne@68 662
jpayne@68 663 4.1. Index Indicator
jpayne@68 664
jpayne@68 665 This field overlaps with the Block Header Size field (see
jpayne@68 666 Section 3.1.1). The value of Index Indicator is always 0x00.
jpayne@68 667
jpayne@68 668
jpayne@68 669 4.2. Number of Records
jpayne@68 670
jpayne@68 671 This field indicates how many Records there are in the List
jpayne@68 672 of Records field, and thus how many Blocks there are in the
jpayne@68 673 Stream. The value is stored using the encoding described in
jpayne@68 674 Section 1.2. If the decoder has decoded all the Blocks of the
jpayne@68 675 Stream, and then notices that the Number of Records doesn't
jpayne@68 676 match the real number of Blocks, the decoder MUST indicate an
jpayne@68 677 error.
jpayne@68 678
jpayne@68 679
jpayne@68 680 4.3. List of Records
jpayne@68 681
jpayne@68 682 List of Records consists of as many Records as indicated by the
jpayne@68 683 Number of Records field:
jpayne@68 684
jpayne@68 685 +========+========+
jpayne@68 686 | Record | Record | ...
jpayne@68 687 +========+========+
jpayne@68 688
jpayne@68 689 Each Record contains information about one Block:
jpayne@68 690
jpayne@68 691 +===============+===================+
jpayne@68 692 | Unpadded Size | Uncompressed Size |
jpayne@68 693 +===============+===================+
jpayne@68 694
jpayne@68 695 If the decoder has decoded all the Blocks of the Stream, it
jpayne@68 696 MUST verify that the contents of the Records match the real
jpayne@68 697 Unpadded Size and Uncompressed Size of the respective Blocks.
jpayne@68 698
jpayne@68 699 Implementation hint: It is possible to verify the Index with
jpayne@68 700 constant memory usage by calculating for example SHA-256 of
jpayne@68 701 both the real size values and the List of Records, then
jpayne@68 702 comparing the hash values. Implementing this using
jpayne@68 703 non-cryptographic hash like CRC32 SHOULD be avoided unless
jpayne@68 704 small code size is important.
jpayne@68 705
jpayne@68 706 If the decoder supports random-access reading, it MUST verify
jpayne@68 707 that Unpadded Size and Uncompressed Size of every completely
jpayne@68 708 decoded Block match the sizes stored in the Index. If only
jpayne@68 709 partial Block is decoded, the decoder MUST verify that the
jpayne@68 710 processed sizes don't exceed the sizes stored in the Index.
jpayne@68 711
jpayne@68 712
jpayne@68 713 4.3.1. Unpadded Size
jpayne@68 714
jpayne@68 715 This field indicates the size of the Block excluding the Block
jpayne@68 716 Padding field. That is, Unpadded Size is the size of the Block
jpayne@68 717 Header, Compressed Data, and Check fields. Unpadded Size is
jpayne@68 718 stored using the encoding described in Section 1.2. The value
jpayne@68 719 MUST never be zero; with the current structure of Blocks, the
jpayne@68 720 actual minimum value for Unpadded Size is five.
jpayne@68 721
jpayne@68 722 Implementation note: Because the size of the Block Padding
jpayne@68 723 field is not included in Unpadded Size, calculating the total
jpayne@68 724 size of a Stream or doing random-access reading requires
jpayne@68 725 calculating the actual size of the Blocks by rounding Unpadded
jpayne@68 726 Sizes up to the next multiple of four.
jpayne@68 727
jpayne@68 728 The reason to exclude Block Padding from Unpadded Size is to
jpayne@68 729 ease making a raw copy of Compressed Data without Block
jpayne@68 730 Padding. This can be useful, for example, if someone wants
jpayne@68 731 to convert Streams to some other file format quickly.
jpayne@68 732
jpayne@68 733
jpayne@68 734 4.3.2. Uncompressed Size
jpayne@68 735
jpayne@68 736 This field indicates the Uncompressed Size of the respective
jpayne@68 737 Block as bytes. The value is stored using the encoding
jpayne@68 738 described in Section 1.2.
jpayne@68 739
jpayne@68 740
jpayne@68 741 4.4. Index Padding
jpayne@68 742
jpayne@68 743 This field MUST contain 0-3 null bytes to pad the Index to
jpayne@68 744 a multiple of four bytes. If any of the bytes are not null
jpayne@68 745 bytes, the decoder MUST indicate an error.
jpayne@68 746
jpayne@68 747
jpayne@68 748 4.5. CRC32
jpayne@68 749
jpayne@68 750 The CRC32 is calculated over everything in the Index field
jpayne@68 751 except the CRC32 field itself. The CRC32 is stored as an
jpayne@68 752 unsigned 32-bit little endian integer. If the calculated
jpayne@68 753 value does not match the stored one, the decoder MUST indicate
jpayne@68 754 an error.
jpayne@68 755
jpayne@68 756
jpayne@68 757 5. Filter Chains
jpayne@68 758
jpayne@68 759 The Block Flags field defines how many filters are used. When
jpayne@68 760 more than one filter is used, the filters are chained; that is,
jpayne@68 761 the output of one filter is the input of another filter. The
jpayne@68 762 following figure illustrates the direction of data flow.
jpayne@68 763
jpayne@68 764 v Uncompressed Data ^
jpayne@68 765 | Filter 0 |
jpayne@68 766 Encoder | Filter 1 | Decoder
jpayne@68 767 | Filter n |
jpayne@68 768 v Compressed Data ^
jpayne@68 769
jpayne@68 770
jpayne@68 771 5.1. Alignment
jpayne@68 772
jpayne@68 773 Alignment of uncompressed input data is usually the job of
jpayne@68 774 the application producing the data. For example, to get the
jpayne@68 775 best results, an archiver tool should make sure that all
jpayne@68 776 PowerPC executable files in the archive stream start at
jpayne@68 777 offsets that are multiples of four bytes.
jpayne@68 778
jpayne@68 779 Some filters, for example LZMA2, can be configured to take
jpayne@68 780 advantage of specified alignment of input data. Note that
jpayne@68 781 taking advantage of aligned input can be beneficial also when
jpayne@68 782 a filter is not the first filter in the chain. For example,
jpayne@68 783 if you compress PowerPC executables, you may want to use the
jpayne@68 784 PowerPC filter and chain that with the LZMA2 filter. Because
jpayne@68 785 not only the input but also the output alignment of the PowerPC
jpayne@68 786 filter is four bytes, it is now beneficial to set LZMA2
jpayne@68 787 settings so that the LZMA2 encoder can take advantage of its
jpayne@68 788 four-byte-aligned input data.
jpayne@68 789
jpayne@68 790 The output of the last filter in the chain is stored to the
jpayne@68 791 Compressed Data field, which is is guaranteed to be aligned
jpayne@68 792 to a multiple of four bytes relative to the beginning of the
jpayne@68 793 Stream. This can increase
jpayne@68 794 - speed, if the filtered data is handled multiple bytes at
jpayne@68 795 a time by the filter-specific encoder and decoder,
jpayne@68 796 because accessing aligned data in computer memory is
jpayne@68 797 usually faster; and
jpayne@68 798 - compression ratio, if the output data is later compressed
jpayne@68 799 with an external compression tool.
jpayne@68 800
jpayne@68 801
jpayne@68 802 5.2. Security
jpayne@68 803
jpayne@68 804 If filters would be allowed to be chained freely, it would be
jpayne@68 805 possible to create malicious files, that would be very slow to
jpayne@68 806 decode. Such files could be used to create denial of service
jpayne@68 807 attacks.
jpayne@68 808
jpayne@68 809 Slow files could occur when multiple filters are chained:
jpayne@68 810
jpayne@68 811 v Compressed input data
jpayne@68 812 | Filter 1 decoder (last filter)
jpayne@68 813 | Filter 0 decoder (non-last filter)
jpayne@68 814 v Uncompressed output data
jpayne@68 815
jpayne@68 816 The decoder of the last filter in the chain produces a lot of
jpayne@68 817 output from little input. Another filter in the chain takes the
jpayne@68 818 output of the last filter, and produces very little output
jpayne@68 819 while consuming a lot of input. As a result, a lot of data is
jpayne@68 820 moved inside the filter chain, but the filter chain as a whole
jpayne@68 821 gets very little work done.
jpayne@68 822
jpayne@68 823 To prevent this kind of slow files, there are restrictions on
jpayne@68 824 how the filters can be chained. These restrictions MUST be
jpayne@68 825 taken into account when designing new filters.
jpayne@68 826
jpayne@68 827 The maximum number of filters in the chain has been limited to
jpayne@68 828 four, thus there can be at maximum of three non-last filters.
jpayne@68 829 Of these three non-last filters, only two are allowed to change
jpayne@68 830 the size of the data.
jpayne@68 831
jpayne@68 832 The non-last filters, that change the size of the data, MUST
jpayne@68 833 have a limit how much the decoder can compress the data: the
jpayne@68 834 decoder SHOULD produce at least n bytes of output when the
jpayne@68 835 filter is given 2n bytes of input. This limit is not
jpayne@68 836 absolute, but significant deviations MUST be avoided.
jpayne@68 837
jpayne@68 838 The above limitations guarantee that if the last filter in the
jpayne@68 839 chain produces 4n bytes of output, the chain as a whole will
jpayne@68 840 produce at least n bytes of output.
jpayne@68 841
jpayne@68 842
jpayne@68 843 5.3. Filters
jpayne@68 844
jpayne@68 845 5.3.1. LZMA2
jpayne@68 846
jpayne@68 847 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
jpayne@68 848 compression algorithm with high compression ratio and fast
jpayne@68 849 decompression. LZMA is based on LZ77 and range coding
jpayne@68 850 algorithms.
jpayne@68 851
jpayne@68 852 LZMA2 is an extension on top of the original LZMA. LZMA2 uses
jpayne@68 853 LZMA internally, but adds support for flushing the encoder,
jpayne@68 854 uncompressed chunks, eases stateful decoder implementations,
jpayne@68 855 and improves support for multithreading. Thus, the plain LZMA
jpayne@68 856 will not be supported in this file format.
jpayne@68 857
jpayne@68 858 Filter ID: 0x21
jpayne@68 859 Size of Filter Properties: 1 byte
jpayne@68 860 Changes size of data: Yes
jpayne@68 861 Allow as a non-last filter: No
jpayne@68 862 Allow as the last filter: Yes
jpayne@68 863
jpayne@68 864 Preferred alignment:
jpayne@68 865 Input data: Adjustable to 1/2/4/8/16 byte(s)
jpayne@68 866 Output data: 1 byte
jpayne@68 867
jpayne@68 868 The format of the one-byte Filter Properties field is as
jpayne@68 869 follows:
jpayne@68 870
jpayne@68 871 Bits Mask Description
jpayne@68 872 0-5 0x3F Dictionary Size
jpayne@68 873 6-7 0xC0 Reserved for future use; MUST be zero for now.
jpayne@68 874
jpayne@68 875 Dictionary Size is encoded with one-bit mantissa and five-bit
jpayne@68 876 exponent. The smallest dictionary size is 4 KiB and the biggest
jpayne@68 877 is 4 GiB.
jpayne@68 878
jpayne@68 879 Raw value Mantissa Exponent Dictionary size
jpayne@68 880 0 2 11 4 KiB
jpayne@68 881 1 3 11 6 KiB
jpayne@68 882 2 2 12 8 KiB
jpayne@68 883 3 3 12 12 KiB
jpayne@68 884 4 2 13 16 KiB
jpayne@68 885 5 3 13 24 KiB
jpayne@68 886 6 2 14 32 KiB
jpayne@68 887 ... ... ... ...
jpayne@68 888 35 3 27 768 MiB
jpayne@68 889 36 2 28 1024 MiB
jpayne@68 890 37 3 29 1536 MiB
jpayne@68 891 38 2 30 2048 MiB
jpayne@68 892 39 3 30 3072 MiB
jpayne@68 893 40 2 31 4096 MiB - 1 B
jpayne@68 894
jpayne@68 895 Instead of having a table in the decoder, the dictionary size
jpayne@68 896 can be decoded using the following C code:
jpayne@68 897
jpayne@68 898 const uint8_t bits = get_dictionary_flags() & 0x3F;
jpayne@68 899 if (bits > 40)
jpayne@68 900 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
jpayne@68 901
jpayne@68 902 uint32_t dictionary_size;
jpayne@68 903 if (bits == 40) {
jpayne@68 904 dictionary_size = UINT32_MAX;
jpayne@68 905 } else {
jpayne@68 906 dictionary_size = 2 | (bits & 1);
jpayne@68 907 dictionary_size <<= bits / 2 + 11;
jpayne@68 908 }
jpayne@68 909
jpayne@68 910
jpayne@68 911 5.3.2. Branch/Call/Jump Filters for Executables
jpayne@68 912
jpayne@68 913 These filters convert relative branch, call, and jump
jpayne@68 914 instructions to their absolute counterparts in executable
jpayne@68 915 files. This conversion increases redundancy and thus
jpayne@68 916 compression ratio.
jpayne@68 917
jpayne@68 918 Size of Filter Properties: 0 or 4 bytes
jpayne@68 919 Changes size of data: No
jpayne@68 920 Allow as a non-last filter: Yes
jpayne@68 921 Allow as the last filter: No
jpayne@68 922
jpayne@68 923 Below is the list of filters in this category. The alignment
jpayne@68 924 is the same for both input and output data.
jpayne@68 925
jpayne@68 926 Filter ID Alignment Description
jpayne@68 927 0x04 1 byte x86 filter (BCJ)
jpayne@68 928 0x05 4 bytes PowerPC (big endian) filter
jpayne@68 929 0x06 16 bytes IA64 filter
jpayne@68 930 0x07 4 bytes ARM filter [1]
jpayne@68 931 0x08 2 bytes ARM Thumb filter [1]
jpayne@68 932 0x09 4 bytes SPARC filter
jpayne@68 933 0x0A 4 bytes ARM64 filter [2]
jpayne@68 934 0x0B 2 bytes RISC-V filter
jpayne@68 935
jpayne@68 936 [1] These are for little endian instruction encoding.
jpayne@68 937 This must not be confused with data endianness.
jpayne@68 938 A processor configured for big endian data access
jpayne@68 939 may still use little endian instruction encoding.
jpayne@68 940 The filters don't care about the data endianness.
jpayne@68 941
jpayne@68 942 [2] 4096-byte alignment gives the best results
jpayne@68 943 because the address in the ADRP instruction
jpayne@68 944 is a multiple of 4096 bytes.
jpayne@68 945
jpayne@68 946 If the size of Filter Properties is four bytes, the Filter
jpayne@68 947 Properties field contains the start offset used for address
jpayne@68 948 conversions. It is stored as an unsigned 32-bit little endian
jpayne@68 949 integer. The start offset MUST be a multiple of the alignment
jpayne@68 950 of the filter as listed in the table above; if it isn't, the
jpayne@68 951 decoder MUST indicate an error. If the size of Filter
jpayne@68 952 Properties is zero, the start offset is zero.
jpayne@68 953
jpayne@68 954 Setting the start offset may be useful if an executable has
jpayne@68 955 multiple sections, and there are many cross-section calls.
jpayne@68 956 Taking advantage of this feature usually requires usage of
jpayne@68 957 the Subblock filter, whose design is not complete yet.
jpayne@68 958
jpayne@68 959
jpayne@68 960 5.3.3. Delta
jpayne@68 961
jpayne@68 962 The Delta filter may increase compression ratio when the value
jpayne@68 963 of the next byte correlates with the value of an earlier byte
jpayne@68 964 at specified distance.
jpayne@68 965
jpayne@68 966 Filter ID: 0x03
jpayne@68 967 Size of Filter Properties: 1 byte
jpayne@68 968 Changes size of data: No
jpayne@68 969 Allow as a non-last filter: Yes
jpayne@68 970 Allow as the last filter: No
jpayne@68 971
jpayne@68 972 Preferred alignment:
jpayne@68 973 Input data: 1 byte
jpayne@68 974 Output data: Same as the original input data
jpayne@68 975
jpayne@68 976 The Properties byte indicates the delta distance, which can be
jpayne@68 977 1-256 bytes backwards from the current byte: 0x00 indicates
jpayne@68 978 distance of 1 byte and 0xFF distance of 256 bytes.
jpayne@68 979
jpayne@68 980
jpayne@68 981 5.3.3.1. Format of the Encoded Output
jpayne@68 982
jpayne@68 983 The code below illustrates both encoding and decoding with
jpayne@68 984 the Delta filter.
jpayne@68 985
jpayne@68 986 // Distance is in the range [1, 256].
jpayne@68 987 const unsigned int distance = get_properties_byte() + 1;
jpayne@68 988 uint8_t pos = 0;
jpayne@68 989 uint8_t delta[256];
jpayne@68 990
jpayne@68 991 memset(delta, 0, sizeof(delta));
jpayne@68 992
jpayne@68 993 while (1) {
jpayne@68 994 const int byte = read_byte();
jpayne@68 995 if (byte == EOF)
jpayne@68 996 break;
jpayne@68 997
jpayne@68 998 uint8_t tmp = delta[(uint8_t)(distance + pos)];
jpayne@68 999 if (is_encoder) {
jpayne@68 1000 tmp = (uint8_t)(byte) - tmp;
jpayne@68 1001 delta[pos] = (uint8_t)(byte);
jpayne@68 1002 } else {
jpayne@68 1003 tmp = (uint8_t)(byte) + tmp;
jpayne@68 1004 delta[pos] = tmp;
jpayne@68 1005 }
jpayne@68 1006
jpayne@68 1007 write_byte(tmp);
jpayne@68 1008 --pos;
jpayne@68 1009 }
jpayne@68 1010
jpayne@68 1011
jpayne@68 1012 5.4. Custom Filter IDs
jpayne@68 1013
jpayne@68 1014 If a developer wants to use custom Filter IDs, there are two
jpayne@68 1015 choices. The first choice is to contact Lasse Collin and ask
jpayne@68 1016 him to allocate a range of IDs for the developer.
jpayne@68 1017
jpayne@68 1018 The second choice is to generate a 40-bit random integer
jpayne@68 1019 which the developer can use as a personal Developer ID.
jpayne@68 1020 To minimize the risk of collisions, Developer ID has to be
jpayne@68 1021 a randomly generated integer, not manually selected "hex word".
jpayne@68 1022 The following command, which works on many free operating
jpayne@68 1023 systems, can be used to generate Developer ID:
jpayne@68 1024
jpayne@68 1025 dd if=/dev/urandom bs=5 count=1 | hexdump
jpayne@68 1026
jpayne@68 1027 The developer can then use the Developer ID to create unique
jpayne@68 1028 (well, hopefully unique) Filter IDs.
jpayne@68 1029
jpayne@68 1030 Bits Mask Description
jpayne@68 1031 0-15 0x0000_0000_0000_FFFF Filter ID
jpayne@68 1032 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
jpayne@68 1033 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
jpayne@68 1034
jpayne@68 1035 The resulting 63-bit integer will use 9 bytes of space when
jpayne@68 1036 stored using the encoding described in Section 1.2. To get
jpayne@68 1037 a shorter ID, see the beginning of this Section how to
jpayne@68 1038 request a custom ID range.
jpayne@68 1039
jpayne@68 1040
jpayne@68 1041 5.4.1. Reserved Custom Filter ID Ranges
jpayne@68 1042
jpayne@68 1043 Range Description
jpayne@68 1044 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
jpayne@68 1045 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
jpayne@68 1046 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
jpayne@68 1047
jpayne@68 1048
jpayne@68 1049 6. Cyclic Redundancy Checks
jpayne@68 1050
jpayne@68 1051 There are several incompatible variations to calculate CRC32
jpayne@68 1052 and CRC64. For simplicity and clarity, complete examples are
jpayne@68 1053 provided to calculate the checks as they are used in this file
jpayne@68 1054 format. Implementations MAY use different code as long as it
jpayne@68 1055 gives identical results.
jpayne@68 1056
jpayne@68 1057 The program below reads data from standard input, calculates
jpayne@68 1058 the CRC32 and CRC64 values, and prints the calculated values
jpayne@68 1059 as big endian hexadecimal strings to standard output.
jpayne@68 1060
jpayne@68 1061 #include <stddef.h>
jpayne@68 1062 #include <inttypes.h>
jpayne@68 1063 #include <stdio.h>
jpayne@68 1064
jpayne@68 1065 uint32_t crc32_table[256];
jpayne@68 1066 uint64_t crc64_table[256];
jpayne@68 1067
jpayne@68 1068 void
jpayne@68 1069 init(void)
jpayne@68 1070 {
jpayne@68 1071 static const uint32_t poly32 = UINT32_C(0xEDB88320);
jpayne@68 1072 static const uint64_t poly64
jpayne@68 1073 = UINT64_C(0xC96C5795D7870F42);
jpayne@68 1074
jpayne@68 1075 for (size_t i = 0; i < 256; ++i) {
jpayne@68 1076 uint32_t crc32 = i;
jpayne@68 1077 uint64_t crc64 = i;
jpayne@68 1078
jpayne@68 1079 for (size_t j = 0; j < 8; ++j) {
jpayne@68 1080 if (crc32 & 1)
jpayne@68 1081 crc32 = (crc32 >> 1) ^ poly32;
jpayne@68 1082 else
jpayne@68 1083 crc32 >>= 1;
jpayne@68 1084
jpayne@68 1085 if (crc64 & 1)
jpayne@68 1086 crc64 = (crc64 >> 1) ^ poly64;
jpayne@68 1087 else
jpayne@68 1088 crc64 >>= 1;
jpayne@68 1089 }
jpayne@68 1090
jpayne@68 1091 crc32_table[i] = crc32;
jpayne@68 1092 crc64_table[i] = crc64;
jpayne@68 1093 }
jpayne@68 1094 }
jpayne@68 1095
jpayne@68 1096 uint32_t
jpayne@68 1097 crc32(const uint8_t *buf, size_t size, uint32_t crc)
jpayne@68 1098 {
jpayne@68 1099 crc = ~crc;
jpayne@68 1100 for (size_t i = 0; i < size; ++i)
jpayne@68 1101 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
jpayne@68 1102 ^ (crc >> 8);
jpayne@68 1103 return ~crc;
jpayne@68 1104 }
jpayne@68 1105
jpayne@68 1106 uint64_t
jpayne@68 1107 crc64(const uint8_t *buf, size_t size, uint64_t crc)
jpayne@68 1108 {
jpayne@68 1109 crc = ~crc;
jpayne@68 1110 for (size_t i = 0; i < size; ++i)
jpayne@68 1111 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
jpayne@68 1112 ^ (crc >> 8);
jpayne@68 1113 return ~crc;
jpayne@68 1114 }
jpayne@68 1115
jpayne@68 1116 int
jpayne@68 1117 main()
jpayne@68 1118 {
jpayne@68 1119 init();
jpayne@68 1120
jpayne@68 1121 uint32_t value32 = 0;
jpayne@68 1122 uint64_t value64 = 0;
jpayne@68 1123 uint64_t total_size = 0;
jpayne@68 1124 uint8_t buf[8192];
jpayne@68 1125
jpayne@68 1126 while (1) {
jpayne@68 1127 const size_t buf_size
jpayne@68 1128 = fread(buf, 1, sizeof(buf), stdin);
jpayne@68 1129 if (buf_size == 0)
jpayne@68 1130 break;
jpayne@68 1131
jpayne@68 1132 total_size += buf_size;
jpayne@68 1133 value32 = crc32(buf, buf_size, value32);
jpayne@68 1134 value64 = crc64(buf, buf_size, value64);
jpayne@68 1135 }
jpayne@68 1136
jpayne@68 1137 printf("Bytes: %" PRIu64 "\n", total_size);
jpayne@68 1138 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
jpayne@68 1139 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
jpayne@68 1140
jpayne@68 1141 return 0;
jpayne@68 1142 }
jpayne@68 1143
jpayne@68 1144
jpayne@68 1145 7. References
jpayne@68 1146
jpayne@68 1147 LZMA SDK - The original LZMA implementation
jpayne@68 1148 https://7-zip.org/sdk.html
jpayne@68 1149
jpayne@68 1150 LZMA Utils - LZMA adapted to POSIX-like systems
jpayne@68 1151 https://tukaani.org/lzma/
jpayne@68 1152
jpayne@68 1153 XZ Utils - The next generation of LZMA Utils
jpayne@68 1154 https://tukaani.org/xz/
jpayne@68 1155
jpayne@68 1156 [RFC-1952]
jpayne@68 1157 GZIP file format specification version 4.3
jpayne@68 1158 https://www.ietf.org/rfc/rfc1952.txt
jpayne@68 1159 - Notation of byte boxes in section "2.1. Overall conventions"
jpayne@68 1160
jpayne@68 1161 [RFC-2119]
jpayne@68 1162 Key words for use in RFCs to Indicate Requirement Levels
jpayne@68 1163 https://www.ietf.org/rfc/rfc2119.txt
jpayne@68 1164
jpayne@68 1165 [GNU-tar]
jpayne@68 1166 GNU tar 1.35 manual
jpayne@68 1167 https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
jpayne@68 1168 - Node 9.4.2 "Blocking Factor", paragraph that begins
jpayne@68 1169 "gzip will complain about trailing garbage"
jpayne@68 1170 - Note that this URL points to the latest version of the
jpayne@68 1171 manual, and may some day not contain the note which is in
jpayne@68 1172 1.35. For the exact version of the manual, download GNU
jpayne@68 1173 tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz
jpayne@68 1174