jpayne@68: 
jpayne@68: The .xz File Format
jpayne@68: ===================
jpayne@68: 
jpayne@68: Version 1.2.1 (2024-04-08)
jpayne@68: 
jpayne@68: 
jpayne@68:         0. Preface
jpayne@68:            0.1. Notices and Acknowledgements
jpayne@68:            0.2. Getting the Latest Version
jpayne@68:            0.3. Version History
jpayne@68:         1. Conventions
jpayne@68:            1.1. Byte and Its Representation
jpayne@68:            1.2. Multibyte Integers
jpayne@68:         2. Overall Structure of .xz File
jpayne@68:            2.1. Stream
jpayne@68:                 2.1.1. Stream Header
jpayne@68:                        2.1.1.1. Header Magic Bytes
jpayne@68:                        2.1.1.2. Stream Flags
jpayne@68:                        2.1.1.3. CRC32
jpayne@68:                 2.1.2. Stream Footer
jpayne@68:                        2.1.2.1. CRC32
jpayne@68:                        2.1.2.2. Backward Size
jpayne@68:                        2.1.2.3. Stream Flags
jpayne@68:                        2.1.2.4. Footer Magic Bytes
jpayne@68:            2.2. Stream Padding
jpayne@68:         3. Block
jpayne@68:            3.1. Block Header
jpayne@68:                 3.1.1. Block Header Size
jpayne@68:                 3.1.2. Block Flags
jpayne@68:                 3.1.3. Compressed Size
jpayne@68:                 3.1.4. Uncompressed Size
jpayne@68:                 3.1.5. List of Filter Flags
jpayne@68:                 3.1.6. Header Padding
jpayne@68:                 3.1.7. CRC32
jpayne@68:            3.2. Compressed Data
jpayne@68:            3.3. Block Padding
jpayne@68:            3.4. Check
jpayne@68:         4. Index
jpayne@68:            4.1. Index Indicator
jpayne@68:            4.2. Number of Records
jpayne@68:            4.3. List of Records
jpayne@68:                 4.3.1. Unpadded Size
jpayne@68:                 4.3.2. Uncompressed Size
jpayne@68:            4.4. Index Padding
jpayne@68:            4.5. CRC32
jpayne@68:         5. Filter Chains
jpayne@68:            5.1. Alignment
jpayne@68:            5.2. Security
jpayne@68:            5.3. Filters
jpayne@68:                 5.3.1. LZMA2
jpayne@68:                 5.3.2. Branch/Call/Jump Filters for Executables
jpayne@68:                 5.3.3. Delta
jpayne@68:                        5.3.3.1. Format of the Encoded Output
jpayne@68:            5.4. Custom Filter IDs
jpayne@68:                 5.4.1. Reserved Custom Filter ID Ranges
jpayne@68:         6. Cyclic Redundancy Checks
jpayne@68:         7. References
jpayne@68: 
jpayne@68: 
jpayne@68: 0. Preface
jpayne@68: 
jpayne@68:         This document describes the .xz file format (filename suffix
jpayne@68:         ".xz", MIME type "application/x-xz"). It is intended that this
jpayne@68:         this format replace the old .lzma format used by LZMA SDK and
jpayne@68:         LZMA Utils.
jpayne@68: 
jpayne@68: 
jpayne@68: 0.1. Notices and Acknowledgements
jpayne@68: 
jpayne@68:         This file format was designed by Lasse Collin
jpayne@68:         <lasse.collin@tukaani.org> and Igor Pavlov.
jpayne@68: 
jpayne@68:         Special thanks for helping with this document goes to
jpayne@68:         Ville Koskinen. Thanks for helping with this document goes to
jpayne@68:         Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
jpayne@68: 
jpayne@68:         This document has been put into the public domain.
jpayne@68: 
jpayne@68: 
jpayne@68: 0.2. Getting the Latest Version
jpayne@68: 
jpayne@68:         The latest official version of this document can be downloaded
jpayne@68:         from <https://tukaani.org/xz/xz-file-format.txt>.
jpayne@68: 
jpayne@68:         Specific versions of this document have a filename
jpayne@68:         xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
jpayne@68:         For example, the version 1.0.0 of this document is available
jpayne@68:         at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>.
jpayne@68: 
jpayne@68: 
jpayne@68: 0.3. Version History
jpayne@68: 
jpayne@68:         Version   Date          Description
jpayne@68: 
jpayne@68:         1.2.1     2024-04-08    The URLs of this specification and
jpayne@68:                                 XZ Utils were changed back to the
jpayne@68:                                 original ones in Sections 0.2 and 7.
jpayne@68: 
jpayne@68:         1.2.0     2024-01-19    Added RISC-V filter and updated URLs in
jpayne@68:                                 Sections 0.2 and 7. The URL of this
jpayne@68:                                 specification was changed.
jpayne@68: 
jpayne@68:         1.1.0     2022-12-11    Added ARM64 filter and clarified 32-bit
jpayne@68:                                 ARM endianness in Section 5.3.2,
jpayne@68:                                 language improvements in Section 5.4
jpayne@68: 
jpayne@68:         1.0.4     2009-08-27    Language improvements in Sections 1.2,
jpayne@68:                                 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
jpayne@68: 
jpayne@68:         1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
jpayne@68: 
jpayne@68:         1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
jpayne@68: 
jpayne@68:         1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
jpayne@68:                                 clarifications to Sections 2, 2.2,
jpayne@68:                                 3.3, 4.4, and 5.3.2
jpayne@68: 
jpayne@68:         1.0.0     2009-01-14    The first official version
jpayne@68: 
jpayne@68: 
jpayne@68: 1. Conventions
jpayne@68: 
jpayne@68:         The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
jpayne@68:         "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
jpayne@68:         document are to be interpreted as described in [RFC-2119].
jpayne@68: 
jpayne@68:         Indicating a warning means displaying a message, returning
jpayne@68:         appropriate exit status, or doing something else to let the
jpayne@68:         user know that something worth warning occurred. The operation
jpayne@68:         SHOULD still finish if a warning is indicated.
jpayne@68: 
jpayne@68:         Indicating an error means displaying a message, returning
jpayne@68:         appropriate exit status, or doing something else to let the
jpayne@68:         user know that something prevented successfully finishing the
jpayne@68:         operation. The operation MUST be aborted once an error has
jpayne@68:         been indicated.
jpayne@68: 
jpayne@68: 
jpayne@68: 1.1. Byte and Its Representation
jpayne@68: 
jpayne@68:         In this document, byte is always 8 bits.
jpayne@68: 
jpayne@68:         A "null byte" has all bits unset. That is, the value of a null
jpayne@68:         byte is 0x00.
jpayne@68: 
jpayne@68:         To represent byte blocks, this document uses notation that
jpayne@68:         is similar to the notation used in [RFC-1952]:
jpayne@68: 
jpayne@68:             +-------+
jpayne@68:             |  Foo  |   One byte.
jpayne@68:             +-------+
jpayne@68: 
jpayne@68:             +---+---+
jpayne@68:             |  Foo  |   Two bytes; that is, some of the vertical bars
jpayne@68:             +---+---+   can be missing.
jpayne@68: 
jpayne@68:             +=======+
jpayne@68:             |  Foo  |   Zero or more bytes.
jpayne@68:             +=======+
jpayne@68: 
jpayne@68:         In this document, a boxed byte or a byte sequence declared
jpayne@68:         using this notation is called "a field". The example field
jpayne@68:         above would be called "the Foo field" or plain "Foo".
jpayne@68: 
jpayne@68:         If there are many fields, they may be split to multiple lines.
jpayne@68:         This is indicated with an arrow ("--->"):
jpayne@68: 
jpayne@68:             +=====+
jpayne@68:             | Foo |
jpayne@68:             +=====+
jpayne@68: 
jpayne@68:                  +=====+
jpayne@68:             ---> | Bar |
jpayne@68:                  +=====+
jpayne@68: 
jpayne@68:         The above is equivalent to this:
jpayne@68: 
jpayne@68:             +=====+=====+
jpayne@68:             | Foo | Bar |
jpayne@68:             +=====+=====+
jpayne@68: 
jpayne@68: 
jpayne@68: 1.2. Multibyte Integers
jpayne@68: 
jpayne@68:         Multibyte integers of static length, such as CRC values,
jpayne@68:         are stored in little endian byte order (least significant
jpayne@68:         byte first).
jpayne@68: 
jpayne@68:         When smaller values are more likely than bigger values (for
jpayne@68:         example file sizes), multibyte integers are encoded in a
jpayne@68:         variable-length representation:
jpayne@68:           - Numbers in the range [0, 127] are copied as is, and take
jpayne@68:             one byte of space.
jpayne@68:           - Bigger numbers will occupy two or more bytes. All but the
jpayne@68:             last byte of the multibyte representation have the highest
jpayne@68:             (eighth) bit set.
jpayne@68: 
jpayne@68:         For now, the value of the variable-length integers is limited
jpayne@68:         to 63 bits, which limits the encoded size of the integer to
jpayne@68:         nine bytes. These limits may be increased in the future if
jpayne@68:         needed.
jpayne@68: 
jpayne@68:         The following C code illustrates encoding and decoding of
jpayne@68:         variable-length integers. The functions return the number of
jpayne@68:         bytes occupied by the integer (1-9), or zero on error.
jpayne@68: 
jpayne@68:             #include <stddef.h>
jpayne@68:             #include <inttypes.h>
jpayne@68: 
jpayne@68:             size_t
jpayne@68:             encode(uint8_t buf[static 9], uint64_t num)
jpayne@68:             {
jpayne@68:                 if (num > UINT64_MAX / 2)
jpayne@68:                     return 0;
jpayne@68: 
jpayne@68:                 size_t i = 0;
jpayne@68: 
jpayne@68:                 while (num >= 0x80) {
jpayne@68:                     buf[i++] = (uint8_t)(num) | 0x80;
jpayne@68:                     num >>= 7;
jpayne@68:                 }
jpayne@68: 
jpayne@68:                 buf[i++] = (uint8_t)(num);
jpayne@68: 
jpayne@68:                 return i;
jpayne@68:             }
jpayne@68: 
jpayne@68:             size_t
jpayne@68:             decode(const uint8_t buf[], size_t size_max, uint64_t *num)
jpayne@68:             {
jpayne@68:                 if (size_max == 0)
jpayne@68:                     return 0;
jpayne@68: 
jpayne@68:                 if (size_max > 9)
jpayne@68:                     size_max = 9;
jpayne@68: 
jpayne@68:                 *num = buf[0] & 0x7F;
jpayne@68:                 size_t i = 0;
jpayne@68: 
jpayne@68:                 while (buf[i++] & 0x80) {
jpayne@68:                     if (i >= size_max || buf[i] == 0x00)
jpayne@68:                         return 0;
jpayne@68: 
jpayne@68:                     *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
jpayne@68:                 }
jpayne@68: 
jpayne@68:                 return i;
jpayne@68:             }
jpayne@68: 
jpayne@68: 
jpayne@68: 2. Overall Structure of .xz File
jpayne@68: 
jpayne@68:         A standalone .xz files consist of one or more Streams which may
jpayne@68:         have Stream Padding between or after them:
jpayne@68: 
jpayne@68:             +========+================+========+================+
jpayne@68:             | Stream | Stream Padding | Stream | Stream Padding | ...
jpayne@68:             +========+================+========+================+
jpayne@68: 
jpayne@68:         The sizes of Stream and Stream Padding are always multiples
jpayne@68:         of four bytes, thus the size of every valid .xz file MUST be
jpayne@68:         a multiple of four bytes.
jpayne@68: 
jpayne@68:         While a typical file contains only one Stream and no Stream
jpayne@68:         Padding, a decoder handling standalone .xz files SHOULD support
jpayne@68:         files that have more than one Stream or Stream Padding.
jpayne@68: 
jpayne@68:         In contrast to standalone .xz files, when the .xz file format
jpayne@68:         is used as an internal part of some other file format or
jpayne@68:         communication protocol, it usually is expected that the decoder
jpayne@68:         stops after the first Stream, and doesn't look for Stream
jpayne@68:         Padding or possibly other Streams.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1. Stream
jpayne@68: 
jpayne@68:         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
jpayne@68:         |     Stream Header     | Block | Block | ... | Block |
jpayne@68:         +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
jpayne@68: 
jpayne@68:              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
jpayne@68:         ---> | Index |     Stream Footer     |
jpayne@68:              +=======+-+-+-+-+-+-+-+-+-+-+-+-+
jpayne@68: 
jpayne@68:         All the above fields have a size that is a multiple of four. If
jpayne@68:         Stream is used as an internal part of another file format, it
jpayne@68:         is RECOMMENDED to make the Stream start at an offset that is
jpayne@68:         a multiple of four bytes.
jpayne@68: 
jpayne@68:         Stream Header, Index, and Stream Footer are always present in
jpayne@68:         a Stream. The maximum size of the Index field is 16 GiB (2^34).
jpayne@68: 
jpayne@68:         There are zero or more Blocks. The maximum number of Blocks is
jpayne@68:         limited only by the maximum size of the Index field.
jpayne@68: 
jpayne@68:         Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
jpayne@68:         The same limit applies to the total amount of uncompressed
jpayne@68:         data stored in a Stream.
jpayne@68: 
jpayne@68:         If an implementation supports handling .xz files with multiple
jpayne@68:         concatenated Streams, it MAY apply the above limits to the file
jpayne@68:         as a whole instead of limiting per Stream basis.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.1. Stream Header
jpayne@68: 
jpayne@68:         +---+---+---+---+---+---+-------+------+--+--+--+--+
jpayne@68:         |  Header Magic Bytes   | Stream Flags |   CRC32   |
jpayne@68:         +---+---+---+---+---+---+-------+------+--+--+--+--+
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.1.1. Header Magic Bytes
jpayne@68: 
jpayne@68:         The first six (6) bytes of the Stream are so called Header
jpayne@68:         Magic Bytes. They can be used to identify the file type.
jpayne@68: 
jpayne@68:             Using a C array and ASCII:
jpayne@68:             const uint8_t HEADER_MAGIC[6]
jpayne@68:                     = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
jpayne@68: 
jpayne@68:             In plain hexadecimal:
jpayne@68:             FD 37 7A 58 5A 00
jpayne@68: 
jpayne@68:         Notes:
jpayne@68:           - The first byte (0xFD) was chosen so that the files cannot
jpayne@68:             be erroneously detected as being in .lzma format, in which
jpayne@68:             the first byte is in the range [0x00, 0xE0].
jpayne@68:           - The sixth byte (0x00) was chosen to prevent applications
jpayne@68:             from misdetecting the file as a text file.
jpayne@68: 
jpayne@68:         If the Header Magic Bytes don't match, the decoder MUST
jpayne@68:         indicate an error.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.1.2. Stream Flags
jpayne@68: 
jpayne@68:         The first byte of Stream Flags is always a null byte. In the
jpayne@68:         future, this byte may be used to indicate a new Stream version
jpayne@68:         or other Stream properties.
jpayne@68: 
jpayne@68:         The second byte of Stream Flags is a bit field:
jpayne@68: 
jpayne@68:             Bit(s)  Mask  Description
jpayne@68:              0-3    0x0F  Type of Check (see Section 3.4):
jpayne@68:                               ID    Size      Check name
jpayne@68:                               0x00   0 bytes  None
jpayne@68:                               0x01   4 bytes  CRC32
jpayne@68:                               0x02   4 bytes  (Reserved)
jpayne@68:                               0x03   4 bytes  (Reserved)
jpayne@68:                               0x04   8 bytes  CRC64
jpayne@68:                               0x05   8 bytes  (Reserved)
jpayne@68:                               0x06   8 bytes  (Reserved)
jpayne@68:                               0x07  16 bytes  (Reserved)
jpayne@68:                               0x08  16 bytes  (Reserved)
jpayne@68:                               0x09  16 bytes  (Reserved)
jpayne@68:                               0x0A  32 bytes  SHA-256
jpayne@68:                               0x0B  32 bytes  (Reserved)
jpayne@68:                               0x0C  32 bytes  (Reserved)
jpayne@68:                               0x0D  64 bytes  (Reserved)
jpayne@68:                               0x0E  64 bytes  (Reserved)
jpayne@68:                               0x0F  64 bytes  (Reserved)
jpayne@68:              4-7    0xF0  Reserved for future use; MUST be zero for now.
jpayne@68: 
jpayne@68:         Implementations SHOULD support at least the Check IDs 0x00
jpayne@68:         (None) and 0x01 (CRC32). Supporting other Check IDs is
jpayne@68:         OPTIONAL. If an unsupported Check is used, the decoder SHOULD
jpayne@68:         indicate a warning or error.
jpayne@68: 
jpayne@68:         If any reserved bit is set, the decoder MUST indicate an error.
jpayne@68:         It is possible that there is a new field present which the
jpayne@68:         decoder is not aware of, and can thus parse the Stream Header
jpayne@68:         incorrectly.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.1.3. CRC32
jpayne@68: 
jpayne@68:         The CRC32 is calculated from the Stream Flags field. It is
jpayne@68:         stored as an unsigned 32-bit little endian integer. If the
jpayne@68:         calculated value does not match the stored one, the decoder
jpayne@68:         MUST indicate an error.
jpayne@68: 
jpayne@68:         The idea is that Stream Flags would always be two bytes, even
jpayne@68:         if new features are needed. This way old decoders will be able
jpayne@68:         to verify the CRC32 calculated from Stream Flags, and thus
jpayne@68:         distinguish between corrupt files (CRC32 doesn't match) and
jpayne@68:         files that the decoder doesn't support (CRC32 matches but
jpayne@68:         Stream Flags has reserved bits set).
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.2. Stream Footer
jpayne@68: 
jpayne@68:         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
jpayne@68:         | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
jpayne@68:         +-+-+-+-+---+---+---+---+-------+------+----------+---------+
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.2.1. CRC32
jpayne@68: 
jpayne@68:         The CRC32 is calculated from the Backward Size and Stream Flags
jpayne@68:         fields. It is stored as an unsigned 32-bit little endian
jpayne@68:         integer. If the calculated value does not match the stored one,
jpayne@68:         the decoder MUST indicate an error.
jpayne@68: 
jpayne@68:         The reason to have the CRC32 field before the Backward Size and
jpayne@68:         Stream Flags fields is to keep the four-byte fields aligned to
jpayne@68:         a multiple of four bytes.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.2.2. Backward Size
jpayne@68: 
jpayne@68:         Backward Size is stored as a 32-bit little endian integer,
jpayne@68:         which indicates the size of the Index field as multiple of
jpayne@68:         four bytes, minimum value being four bytes:
jpayne@68: 
jpayne@68:             real_backward_size = (stored_backward_size + 1) * 4;
jpayne@68: 
jpayne@68:         If the stored value does not match the real size of the Index
jpayne@68:         field, the decoder MUST indicate an error.
jpayne@68: 
jpayne@68:         Using a fixed-size integer to store Backward Size makes
jpayne@68:         it slightly simpler to parse the Stream Footer when the
jpayne@68:         application needs to parse the Stream backwards.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.2.3. Stream Flags
jpayne@68: 
jpayne@68:         This is a copy of the Stream Flags field from the Stream
jpayne@68:         Header. The information stored to Stream Flags is needed
jpayne@68:         when parsing the Stream backwards. The decoder MUST compare
jpayne@68:         the Stream Flags fields in both Stream Header and Stream
jpayne@68:         Footer, and indicate an error if they are not identical.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.1.2.4. Footer Magic Bytes
jpayne@68: 
jpayne@68:         As the last step of the decoding process, the decoder MUST
jpayne@68:         verify the existence of Footer Magic Bytes. If they don't
jpayne@68:         match, an error MUST be indicated.
jpayne@68: 
jpayne@68:             Using a C array and ASCII:
jpayne@68:             const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
jpayne@68: 
jpayne@68:             In hexadecimal:
jpayne@68:             59 5A
jpayne@68: 
jpayne@68:         The primary reason to have Footer Magic Bytes is to make
jpayne@68:         it easier to detect incomplete files quickly, without
jpayne@68:         uncompressing. If the file does not end with Footer Magic Bytes
jpayne@68:         (excluding Stream Padding described in Section 2.2), it cannot
jpayne@68:         be undamaged, unless someone has intentionally appended garbage
jpayne@68:         after the end of the Stream.
jpayne@68: 
jpayne@68: 
jpayne@68: 2.2. Stream Padding
jpayne@68: 
jpayne@68:         Only the decoders that support decoding of concatenated Streams
jpayne@68:         MUST support Stream Padding.
jpayne@68: 
jpayne@68:         Stream Padding MUST contain only null bytes. To preserve the
jpayne@68:         four-byte alignment of consecutive Streams, the size of Stream
jpayne@68:         Padding MUST be a multiple of four bytes. Empty Stream Padding
jpayne@68:         is allowed. If these requirements are not met, the decoder MUST
jpayne@68:         indicate an error.
jpayne@68: 
jpayne@68:         Note that non-empty Stream Padding is allowed at the end of the
jpayne@68:         file; there doesn't need to be a new Stream after non-empty
jpayne@68:         Stream Padding. This can be convenient in certain situations
jpayne@68:         [GNU-tar].
jpayne@68: 
jpayne@68:         The possibility of Stream Padding MUST be taken into account
jpayne@68:         when designing an application that parses Streams backwards,
jpayne@68:         and the application supports concatenated Streams.
jpayne@68: 
jpayne@68: 
jpayne@68: 3. Block
jpayne@68: 
jpayne@68:         +==============+=================+===============+=======+
jpayne@68:         | Block Header | Compressed Data | Block Padding | Check |
jpayne@68:         +==============+=================+===============+=======+
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1. Block Header
jpayne@68: 
jpayne@68:         +-------------------+-------------+=================+
jpayne@68:         | Block Header Size | Block Flags | Compressed Size |
jpayne@68:         +-------------------+-------------+=================+
jpayne@68: 
jpayne@68:              +===================+======================+
jpayne@68:         ---> | Uncompressed Size | List of Filter Flags |
jpayne@68:              +===================+======================+
jpayne@68: 
jpayne@68:              +================+--+--+--+--+
jpayne@68:         ---> | Header Padding |   CRC32   |
jpayne@68:              +================+--+--+--+--+
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.1. Block Header Size
jpayne@68: 
jpayne@68:         This field overlaps with the Index Indicator field (see
jpayne@68:         Section 4.1).
jpayne@68: 
jpayne@68:         This field contains the size of the Block Header field,
jpayne@68:         including the Block Header Size field itself. Valid values are
jpayne@68:         in the range [0x01, 0xFF], which indicate the size of the Block
jpayne@68:         Header as multiples of four bytes, minimum size being eight
jpayne@68:         bytes:
jpayne@68: 
jpayne@68:             real_header_size = (encoded_header_size + 1) * 4;
jpayne@68: 
jpayne@68:         If a Block Header bigger than 1024 bytes is needed in the
jpayne@68:         future, a new field can be added between the Block Header and
jpayne@68:         Compressed Data fields. The presence of this new field would
jpayne@68:         be indicated in the Block Header field.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.2. Block Flags
jpayne@68: 
jpayne@68:         The Block Flags field is a bit field:
jpayne@68: 
jpayne@68:             Bit(s)  Mask  Description
jpayne@68:              0-1    0x03  Number of filters (1-4)
jpayne@68:              2-5    0x3C  Reserved for future use; MUST be zero for now.
jpayne@68:               6     0x40  The Compressed Size field is present.
jpayne@68:               7     0x80  The Uncompressed Size field is present.
jpayne@68: 
jpayne@68:         If any reserved bit is set, the decoder MUST indicate an error.
jpayne@68:         It is possible that there is a new field present which the
jpayne@68:         decoder is not aware of, and can thus parse the Block Header
jpayne@68:         incorrectly.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.3. Compressed Size
jpayne@68: 
jpayne@68:         This field is present only if the appropriate bit is set in
jpayne@68:         the Block Flags field (see Section 3.1.2).
jpayne@68: 
jpayne@68:         The Compressed Size field contains the size of the Compressed
jpayne@68:         Data field, which MUST be non-zero. Compressed Size is stored
jpayne@68:         using the encoding described in Section 1.2. If the Compressed
jpayne@68:         Size doesn't match the size of the Compressed Data field, the
jpayne@68:         decoder MUST indicate an error.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.4. Uncompressed Size
jpayne@68: 
jpayne@68:         This field is present only if the appropriate bit is set in
jpayne@68:         the Block Flags field (see Section 3.1.2).
jpayne@68: 
jpayne@68:         The Uncompressed Size field contains the size of the Block
jpayne@68:         after uncompressing. Uncompressed Size is stored using the
jpayne@68:         encoding described in Section 1.2. If the Uncompressed Size
jpayne@68:         does not match the real uncompressed size, the decoder MUST
jpayne@68:         indicate an error.
jpayne@68: 
jpayne@68:         Storing the Compressed Size and Uncompressed Size fields serves
jpayne@68:         several purposes:
jpayne@68:           - The decoder knows how much memory it needs to allocate
jpayne@68:             for a temporary buffer in multithreaded mode.
jpayne@68:           - Simple error detection: wrong size indicates a broken file.
jpayne@68:           - Seeking forwards to a specific location in streamed mode.
jpayne@68: 
jpayne@68:         It should be noted that the only reliable way to determine
jpayne@68:         the real uncompressed size is to uncompress the Block,
jpayne@68:         because the Block Header and Index fields may contain
jpayne@68:         (intentionally or unintentionally) invalid information.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.5. List of Filter Flags
jpayne@68: 
jpayne@68:         +================+================+     +================+
jpayne@68:         | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
jpayne@68:         +================+================+     +================+
jpayne@68: 
jpayne@68:         The number of Filter Flags fields is stored in the Block Flags
jpayne@68:         field (see Section 3.1.2).
jpayne@68: 
jpayne@68:         The format of each Filter Flags field is as follows:
jpayne@68: 
jpayne@68:             +===========+====================+===================+
jpayne@68:             | Filter ID | Size of Properties | Filter Properties |
jpayne@68:             +===========+====================+===================+
jpayne@68: 
jpayne@68:         Both Filter ID and Size of Properties are stored using the
jpayne@68:         encoding described in Section 1.2. Size of Properties indicates
jpayne@68:         the size of the Filter Properties field as bytes. The list of
jpayne@68:         officially defined Filter IDs and the formats of their Filter
jpayne@68:         Properties are described in Section 5.3.
jpayne@68: 
jpayne@68:         Filter IDs greater than or equal to 0x4000_0000_0000_0000
jpayne@68:         (2^62) are reserved for implementation-specific internal use.
jpayne@68:         These Filter IDs MUST never be used in List of Filter Flags.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.6. Header Padding
jpayne@68: 
jpayne@68:         This field contains as many null byte as it is needed to make
jpayne@68:         the Block Header have the size specified in Block Header Size.
jpayne@68:         If any of the bytes are not null bytes, the decoder MUST
jpayne@68:         indicate an error. It is possible that there is a new field
jpayne@68:         present which the decoder is not aware of, and can thus parse
jpayne@68:         the Block Header incorrectly.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.1.7. CRC32
jpayne@68: 
jpayne@68:         The CRC32 is calculated over everything in the Block Header
jpayne@68:         field except the CRC32 field itself. It is stored as an
jpayne@68:         unsigned 32-bit little endian integer. If the calculated
jpayne@68:         value does not match the stored one, the decoder MUST indicate
jpayne@68:         an error.
jpayne@68: 
jpayne@68:         By verifying the CRC32 of the Block Header before parsing the
jpayne@68:         actual contents allows the decoder to distinguish between
jpayne@68:         corrupt and unsupported files.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.2. Compressed Data
jpayne@68: 
jpayne@68:         The format of Compressed Data depends on Block Flags and List
jpayne@68:         of Filter Flags. Excluding the descriptions of the simplest
jpayne@68:         filters in Section 5.3, the format of the filter-specific
jpayne@68:         encoded data is out of scope of this document.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.3. Block Padding
jpayne@68: 
jpayne@68:         Block Padding MUST contain 0-3 null bytes to make the size of
jpayne@68:         the Block a multiple of four bytes. This can be needed when
jpayne@68:         the size of Compressed Data is not a multiple of four. If any
jpayne@68:         of the bytes in Block Padding are not null bytes, the decoder
jpayne@68:         MUST indicate an error.
jpayne@68: 
jpayne@68: 
jpayne@68: 3.4. Check
jpayne@68: 
jpayne@68:         The type and size of the Check field depends on which bits
jpayne@68:         are set in the Stream Flags field (see Section 2.1.1.2).
jpayne@68: 
jpayne@68:         The Check, when used, is calculated from the original
jpayne@68:         uncompressed data. If the calculated Check does not match the
jpayne@68:         stored one, the decoder MUST indicate an error. If the selected
jpayne@68:         type of Check is not supported by the decoder, it SHOULD
jpayne@68:         indicate a warning or error.
jpayne@68: 
jpayne@68: 
jpayne@68: 4. Index
jpayne@68: 
jpayne@68:         +-----------------+===================+
jpayne@68:         | Index Indicator | Number of Records |
jpayne@68:         +-----------------+===================+
jpayne@68: 
jpayne@68:              +=================+===============+-+-+-+-+
jpayne@68:         ---> | List of Records | Index Padding | CRC32 |
jpayne@68:              +=================+===============+-+-+-+-+
jpayne@68: 
jpayne@68:         Index serves several purposes. Using it, one can
jpayne@68:           - verify that all Blocks in a Stream have been processed;
jpayne@68:           - find out the uncompressed size of a Stream; and
jpayne@68:           - quickly access the beginning of any Block (random access).
jpayne@68: 
jpayne@68: 
jpayne@68: 4.1. Index Indicator
jpayne@68: 
jpayne@68:         This field overlaps with the Block Header Size field (see
jpayne@68:         Section 3.1.1). The value of Index Indicator is always 0x00.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.2. Number of Records
jpayne@68: 
jpayne@68:         This field indicates how many Records there are in the List
jpayne@68:         of Records field, and thus how many Blocks there are in the
jpayne@68:         Stream. The value is stored using the encoding described in
jpayne@68:         Section 1.2. If the decoder has decoded all the Blocks of the
jpayne@68:         Stream, and then notices that the Number of Records doesn't
jpayne@68:         match the real number of Blocks, the decoder MUST indicate an
jpayne@68:         error.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.3. List of Records
jpayne@68: 
jpayne@68:         List of Records consists of as many Records as indicated by the
jpayne@68:         Number of Records field:
jpayne@68: 
jpayne@68:             +========+========+
jpayne@68:             | Record | Record | ...
jpayne@68:             +========+========+
jpayne@68: 
jpayne@68:         Each Record contains information about one Block:
jpayne@68: 
jpayne@68:             +===============+===================+
jpayne@68:             | Unpadded Size | Uncompressed Size |
jpayne@68:             +===============+===================+
jpayne@68: 
jpayne@68:         If the decoder has decoded all the Blocks of the Stream, it
jpayne@68:         MUST verify that the contents of the Records match the real
jpayne@68:         Unpadded Size and Uncompressed Size of the respective Blocks.
jpayne@68: 
jpayne@68:         Implementation hint: It is possible to verify the Index with
jpayne@68:         constant memory usage by calculating for example SHA-256 of
jpayne@68:         both the real size values and the List of Records, then
jpayne@68:         comparing the hash values. Implementing this using
jpayne@68:         non-cryptographic hash like CRC32 SHOULD be avoided unless
jpayne@68:         small code size is important.
jpayne@68: 
jpayne@68:         If the decoder supports random-access reading, it MUST verify
jpayne@68:         that Unpadded Size and Uncompressed Size of every completely
jpayne@68:         decoded Block match the sizes stored in the Index. If only
jpayne@68:         partial Block is decoded, the decoder MUST verify that the
jpayne@68:         processed sizes don't exceed the sizes stored in the Index.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.3.1. Unpadded Size
jpayne@68: 
jpayne@68:         This field indicates the size of the Block excluding the Block
jpayne@68:         Padding field. That is, Unpadded Size is the size of the Block
jpayne@68:         Header, Compressed Data, and Check fields. Unpadded Size is
jpayne@68:         stored using the encoding described in Section 1.2. The value
jpayne@68:         MUST never be zero; with the current structure of Blocks, the
jpayne@68:         actual minimum value for Unpadded Size is five.
jpayne@68: 
jpayne@68:         Implementation note: Because the size of the Block Padding
jpayne@68:         field is not included in Unpadded Size, calculating the total
jpayne@68:         size of a Stream or doing random-access reading requires
jpayne@68:         calculating the actual size of the Blocks by rounding Unpadded
jpayne@68:         Sizes up to the next multiple of four.
jpayne@68: 
jpayne@68:         The reason to exclude Block Padding from Unpadded Size is to
jpayne@68:         ease making a raw copy of Compressed Data without Block
jpayne@68:         Padding. This can be useful, for example, if someone wants
jpayne@68:         to convert Streams to some other file format quickly.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.3.2. Uncompressed Size
jpayne@68: 
jpayne@68:         This field indicates the Uncompressed Size of the respective
jpayne@68:         Block as bytes. The value is stored using the encoding
jpayne@68:         described in Section 1.2.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.4. Index Padding
jpayne@68: 
jpayne@68:         This field MUST contain 0-3 null bytes to pad the Index to
jpayne@68:         a multiple of four bytes. If any of the bytes are not null
jpayne@68:         bytes, the decoder MUST indicate an error.
jpayne@68: 
jpayne@68: 
jpayne@68: 4.5. CRC32
jpayne@68: 
jpayne@68:         The CRC32 is calculated over everything in the Index field
jpayne@68:         except the CRC32 field itself. The CRC32 is stored as an
jpayne@68:         unsigned 32-bit little endian integer. If the calculated
jpayne@68:         value does not match the stored one, the decoder MUST indicate
jpayne@68:         an error.
jpayne@68: 
jpayne@68: 
jpayne@68: 5. Filter Chains
jpayne@68: 
jpayne@68:         The Block Flags field defines how many filters are used. When
jpayne@68:         more than one filter is used, the filters are chained; that is,
jpayne@68:         the output of one filter is the input of another filter. The
jpayne@68:         following figure illustrates the direction of data flow.
jpayne@68: 
jpayne@68:                     v   Uncompressed Data   ^
jpayne@68:                     |       Filter 0        |
jpayne@68:             Encoder |       Filter 1        | Decoder
jpayne@68:                     |       Filter n        |
jpayne@68:                     v    Compressed Data    ^
jpayne@68: 
jpayne@68: 
jpayne@68: 5.1. Alignment
jpayne@68: 
jpayne@68:         Alignment of uncompressed input data is usually the job of
jpayne@68:         the application producing the data. For example, to get the
jpayne@68:         best results, an archiver tool should make sure that all
jpayne@68:         PowerPC executable files in the archive stream start at
jpayne@68:         offsets that are multiples of four bytes.
jpayne@68: 
jpayne@68:         Some filters, for example LZMA2, can be configured to take
jpayne@68:         advantage of specified alignment of input data. Note that
jpayne@68:         taking advantage of aligned input can be beneficial also when
jpayne@68:         a filter is not the first filter in the chain. For example,
jpayne@68:         if you compress PowerPC executables, you may want to use the
jpayne@68:         PowerPC filter and chain that with the LZMA2 filter. Because
jpayne@68:         not only the input but also the output alignment of the PowerPC
jpayne@68:         filter is four bytes, it is now beneficial to set LZMA2
jpayne@68:         settings so that the LZMA2 encoder can take advantage of its
jpayne@68:         four-byte-aligned input data.
jpayne@68: 
jpayne@68:         The output of the last filter in the chain is stored to the
jpayne@68:         Compressed Data field, which is is guaranteed to be aligned
jpayne@68:         to a multiple of four bytes relative to the beginning of the
jpayne@68:         Stream. This can increase
jpayne@68:           - speed, if the filtered data is handled multiple bytes at
jpayne@68:             a time by the filter-specific encoder and decoder,
jpayne@68:             because accessing aligned data in computer memory is
jpayne@68:             usually faster; and
jpayne@68:           - compression ratio, if the output data is later compressed
jpayne@68:             with an external compression tool.
jpayne@68: 
jpayne@68: 
jpayne@68: 5.2. Security
jpayne@68: 
jpayne@68:         If filters would be allowed to be chained freely, it would be
jpayne@68:         possible to create malicious files, that would be very slow to
jpayne@68:         decode. Such files could be used to create denial of service
jpayne@68:         attacks.
jpayne@68: 
jpayne@68:         Slow files could occur when multiple filters are chained:
jpayne@68: 
jpayne@68:             v   Compressed input data
jpayne@68:             |   Filter 1 decoder (last filter)
jpayne@68:             |   Filter 0 decoder (non-last filter)
jpayne@68:             v   Uncompressed output data
jpayne@68: 
jpayne@68:         The decoder of the last filter in the chain produces a lot of
jpayne@68:         output from little input. Another filter in the chain takes the
jpayne@68:         output of the last filter, and produces very little output
jpayne@68:         while consuming a lot of input. As a result, a lot of data is
jpayne@68:         moved inside the filter chain, but the filter chain as a whole
jpayne@68:         gets very little work done.
jpayne@68: 
jpayne@68:         To prevent this kind of slow files, there are restrictions on
jpayne@68:         how the filters can be chained. These restrictions MUST be
jpayne@68:         taken into account when designing new filters.
jpayne@68: 
jpayne@68:         The maximum number of filters in the chain has been limited to
jpayne@68:         four, thus there can be at maximum of three non-last filters.
jpayne@68:         Of these three non-last filters, only two are allowed to change
jpayne@68:         the size of the data.
jpayne@68: 
jpayne@68:         The non-last filters, that change the size of the data, MUST
jpayne@68:         have a limit how much the decoder can compress the data: the
jpayne@68:         decoder SHOULD produce at least n bytes of output when the
jpayne@68:         filter is given 2n bytes of input. This  limit is not
jpayne@68:         absolute, but significant deviations MUST be avoided.
jpayne@68: 
jpayne@68:         The above limitations guarantee that if the last filter in the
jpayne@68:         chain produces 4n bytes of output, the chain as a whole will
jpayne@68:         produce at least n bytes of output.
jpayne@68: 
jpayne@68: 
jpayne@68: 5.3. Filters
jpayne@68: 
jpayne@68: 5.3.1. LZMA2
jpayne@68: 
jpayne@68:         LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
jpayne@68:         compression algorithm with high compression ratio and fast
jpayne@68:         decompression. LZMA is based on LZ77 and range coding
jpayne@68:         algorithms.
jpayne@68: 
jpayne@68:         LZMA2 is an extension on top of the original LZMA. LZMA2 uses
jpayne@68:         LZMA internally, but adds support for flushing the encoder,
jpayne@68:         uncompressed chunks, eases stateful decoder implementations,
jpayne@68:         and improves support for multithreading. Thus, the plain LZMA
jpayne@68:         will not be supported in this file format.
jpayne@68: 
jpayne@68:             Filter ID:                  0x21
jpayne@68:             Size of Filter Properties:  1 byte
jpayne@68:             Changes size of data:       Yes
jpayne@68:             Allow as a non-last filter: No
jpayne@68:             Allow as the last filter:   Yes
jpayne@68: 
jpayne@68:             Preferred alignment:
jpayne@68:                 Input data:             Adjustable to 1/2/4/8/16 byte(s)
jpayne@68:                 Output data:            1 byte
jpayne@68: 
jpayne@68:         The format of the one-byte Filter Properties field is as
jpayne@68:         follows:
jpayne@68: 
jpayne@68:             Bits   Mask   Description
jpayne@68:             0-5    0x3F   Dictionary Size
jpayne@68:             6-7    0xC0   Reserved for future use; MUST be zero for now.
jpayne@68: 
jpayne@68:         Dictionary Size is encoded with one-bit mantissa and five-bit
jpayne@68:         exponent. The smallest dictionary size is 4 KiB and the biggest
jpayne@68:         is 4 GiB.
jpayne@68: 
jpayne@68:             Raw value   Mantissa   Exponent   Dictionary size
jpayne@68:                 0           2         11         4 KiB
jpayne@68:                 1           3         11         6 KiB
jpayne@68:                 2           2         12         8 KiB
jpayne@68:                 3           3         12        12 KiB
jpayne@68:                 4           2         13        16 KiB
jpayne@68:                 5           3         13        24 KiB
jpayne@68:                 6           2         14        32 KiB
jpayne@68:               ...         ...        ...      ...
jpayne@68:                35           3         27       768 MiB
jpayne@68:                36           2         28      1024 MiB
jpayne@68:                37           3         29      1536 MiB
jpayne@68:                38           2         30      2048 MiB
jpayne@68:                39           3         30      3072 MiB
jpayne@68:                40           2         31      4096 MiB - 1 B
jpayne@68: 
jpayne@68:         Instead of having a table in the decoder, the dictionary size
jpayne@68:         can be decoded using the following C code:
jpayne@68: 
jpayne@68:             const uint8_t bits = get_dictionary_flags() & 0x3F;
jpayne@68:             if (bits > 40)
jpayne@68:                 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
jpayne@68: 
jpayne@68:             uint32_t dictionary_size;
jpayne@68:             if (bits == 40) {
jpayne@68:                 dictionary_size = UINT32_MAX;
jpayne@68:             } else {
jpayne@68:                 dictionary_size = 2 | (bits & 1);
jpayne@68:                 dictionary_size <<= bits / 2 + 11;
jpayne@68:             }
jpayne@68: 
jpayne@68: 
jpayne@68: 5.3.2. Branch/Call/Jump Filters for Executables
jpayne@68: 
jpayne@68:         These filters convert relative branch, call, and jump
jpayne@68:         instructions to their absolute counterparts in executable
jpayne@68:         files. This conversion increases redundancy and thus
jpayne@68:         compression ratio.
jpayne@68: 
jpayne@68:             Size of Filter Properties:  0 or 4 bytes
jpayne@68:             Changes size of data:       No
jpayne@68:             Allow as a non-last filter: Yes
jpayne@68:             Allow as the last filter:   No
jpayne@68: 
jpayne@68:         Below is the list of filters in this category. The alignment
jpayne@68:         is the same for both input and output data.
jpayne@68: 
jpayne@68:             Filter ID   Alignment   Description
jpayne@68:               0x04       1 byte     x86 filter (BCJ)
jpayne@68:               0x05       4 bytes    PowerPC (big endian) filter
jpayne@68:               0x06      16 bytes    IA64 filter
jpayne@68:               0x07       4 bytes    ARM filter [1]
jpayne@68:               0x08       2 bytes    ARM Thumb filter [1]
jpayne@68:               0x09       4 bytes    SPARC filter
jpayne@68:               0x0A       4 bytes    ARM64 filter [2]
jpayne@68:               0x0B       2 bytes    RISC-V filter
jpayne@68: 
jpayne@68:               [1] These are for little endian instruction encoding.
jpayne@68:                   This must not be confused with data endianness.
jpayne@68:                   A processor configured for big endian data access
jpayne@68:                   may still use little endian instruction encoding.
jpayne@68:                   The filters don't care about the data endianness.
jpayne@68: 
jpayne@68:               [2] 4096-byte alignment gives the best results
jpayne@68:                   because the address in the ADRP instruction
jpayne@68:                   is a multiple of 4096 bytes.
jpayne@68: 
jpayne@68:         If the size of Filter Properties is four bytes, the Filter
jpayne@68:         Properties field contains the start offset used for address
jpayne@68:         conversions. It is stored as an unsigned 32-bit little endian
jpayne@68:         integer. The start offset MUST be a multiple of the alignment
jpayne@68:         of the filter as listed in the table above; if it isn't, the
jpayne@68:         decoder MUST indicate an error. If the size of Filter
jpayne@68:         Properties is zero, the start offset is zero.
jpayne@68: 
jpayne@68:         Setting the start offset may be useful if an executable has
jpayne@68:         multiple sections, and there are many cross-section calls.
jpayne@68:         Taking advantage of this feature usually requires usage of
jpayne@68:         the Subblock filter, whose design is not complete yet.
jpayne@68: 
jpayne@68: 
jpayne@68: 5.3.3. Delta
jpayne@68: 
jpayne@68:         The Delta filter may increase compression ratio when the value
jpayne@68:         of the next byte correlates with the value of an earlier byte
jpayne@68:         at specified distance.
jpayne@68: 
jpayne@68:             Filter ID:                  0x03
jpayne@68:             Size of Filter Properties:  1 byte
jpayne@68:             Changes size of data:       No
jpayne@68:             Allow as a non-last filter: Yes
jpayne@68:             Allow as the last filter:   No
jpayne@68: 
jpayne@68:             Preferred alignment:
jpayne@68:                 Input data:             1 byte
jpayne@68:                 Output data:            Same as the original input data
jpayne@68: 
jpayne@68:         The Properties byte indicates the delta distance, which can be
jpayne@68:         1-256 bytes backwards from the current byte: 0x00 indicates
jpayne@68:         distance of 1 byte and 0xFF distance of 256 bytes.
jpayne@68: 
jpayne@68: 
jpayne@68: 5.3.3.1. Format of the Encoded Output
jpayne@68: 
jpayne@68:         The code below illustrates both encoding and decoding with
jpayne@68:         the Delta filter.
jpayne@68: 
jpayne@68:             // Distance is in the range [1, 256].
jpayne@68:             const unsigned int distance = get_properties_byte() + 1;
jpayne@68:             uint8_t pos = 0;
jpayne@68:             uint8_t delta[256];
jpayne@68: 
jpayne@68:             memset(delta, 0, sizeof(delta));
jpayne@68: 
jpayne@68:             while (1) {
jpayne@68:                 const int byte = read_byte();
jpayne@68:                 if (byte == EOF)
jpayne@68:                     break;
jpayne@68: 
jpayne@68:                 uint8_t tmp = delta[(uint8_t)(distance + pos)];
jpayne@68:                 if (is_encoder) {
jpayne@68:                     tmp = (uint8_t)(byte) - tmp;
jpayne@68:                     delta[pos] = (uint8_t)(byte);
jpayne@68:                 } else {
jpayne@68:                     tmp = (uint8_t)(byte) + tmp;
jpayne@68:                     delta[pos] = tmp;
jpayne@68:                 }
jpayne@68: 
jpayne@68:                 write_byte(tmp);
jpayne@68:                 --pos;
jpayne@68:             }
jpayne@68: 
jpayne@68: 
jpayne@68: 5.4. Custom Filter IDs
jpayne@68: 
jpayne@68:         If a developer wants to use custom Filter IDs, there are two
jpayne@68:         choices. The first choice is to contact Lasse Collin and ask
jpayne@68:         him to allocate a range of IDs for the developer.
jpayne@68: 
jpayne@68:         The second choice is to generate a 40-bit random integer
jpayne@68:         which the developer can use as a personal Developer ID.
jpayne@68:         To minimize the risk of collisions, Developer ID has to be
jpayne@68:         a randomly generated integer, not manually selected "hex word".
jpayne@68:         The following command, which works on many free operating
jpayne@68:         systems, can be used to generate Developer ID:
jpayne@68: 
jpayne@68:             dd if=/dev/urandom bs=5 count=1 | hexdump
jpayne@68: 
jpayne@68:         The developer can then use the Developer ID to create unique
jpayne@68:         (well, hopefully unique) Filter IDs.
jpayne@68: 
jpayne@68:             Bits    Mask                    Description
jpayne@68:              0-15   0x0000_0000_0000_FFFF   Filter ID
jpayne@68:             16-55   0x00FF_FFFF_FFFF_0000   Developer ID
jpayne@68:             56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
jpayne@68: 
jpayne@68:         The resulting 63-bit integer will use 9 bytes of space when
jpayne@68:         stored using the encoding described in Section 1.2. To get
jpayne@68:         a shorter ID, see the beginning of this Section how to
jpayne@68:         request a custom ID range.
jpayne@68: 
jpayne@68: 
jpayne@68: 5.4.1. Reserved Custom Filter ID Ranges
jpayne@68: 
jpayne@68:         Range                       Description
jpayne@68:         0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
jpayne@68:         0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
jpayne@68:         0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
jpayne@68: 
jpayne@68: 
jpayne@68: 6. Cyclic Redundancy Checks
jpayne@68: 
jpayne@68:         There are several incompatible variations to calculate CRC32
jpayne@68:         and CRC64. For simplicity and clarity, complete examples are
jpayne@68:         provided to calculate the checks as they are used in this file
jpayne@68:         format. Implementations MAY use different code as long as it
jpayne@68:         gives identical results.
jpayne@68: 
jpayne@68:         The program below reads data from standard input, calculates
jpayne@68:         the CRC32 and CRC64 values, and prints the calculated values
jpayne@68:         as big endian hexadecimal strings to standard output.
jpayne@68: 
jpayne@68:             #include <stddef.h>
jpayne@68:             #include <inttypes.h>
jpayne@68:             #include <stdio.h>
jpayne@68: 
jpayne@68:             uint32_t crc32_table[256];
jpayne@68:             uint64_t crc64_table[256];
jpayne@68: 
jpayne@68:             void
jpayne@68:             init(void)
jpayne@68:             {
jpayne@68:                 static const uint32_t poly32 = UINT32_C(0xEDB88320);
jpayne@68:                 static const uint64_t poly64
jpayne@68:                         = UINT64_C(0xC96C5795D7870F42);
jpayne@68: 
jpayne@68:                 for (size_t i = 0; i < 256; ++i) {
jpayne@68:                     uint32_t crc32 = i;
jpayne@68:                     uint64_t crc64 = i;
jpayne@68: 
jpayne@68:                     for (size_t j = 0; j < 8; ++j) {
jpayne@68:                         if (crc32 & 1)
jpayne@68:                             crc32 = (crc32 >> 1) ^ poly32;
jpayne@68:                         else
jpayne@68:                             crc32 >>= 1;
jpayne@68: 
jpayne@68:                         if (crc64 & 1)
jpayne@68:                             crc64 = (crc64 >> 1) ^ poly64;
jpayne@68:                         else
jpayne@68:                             crc64 >>= 1;
jpayne@68:                     }
jpayne@68: 
jpayne@68:                     crc32_table[i] = crc32;
jpayne@68:                     crc64_table[i] = crc64;
jpayne@68:                 }
jpayne@68:             }
jpayne@68: 
jpayne@68:             uint32_t
jpayne@68:             crc32(const uint8_t *buf, size_t size, uint32_t crc)
jpayne@68:             {
jpayne@68:                 crc = ~crc;
jpayne@68:                 for (size_t i = 0; i < size; ++i)
jpayne@68:                     crc = crc32_table[buf[i] ^ (crc & 0xFF)]
jpayne@68:                             ^ (crc >> 8);
jpayne@68:                 return ~crc;
jpayne@68:             }
jpayne@68: 
jpayne@68:             uint64_t
jpayne@68:             crc64(const uint8_t *buf, size_t size, uint64_t crc)
jpayne@68:             {
jpayne@68:                 crc = ~crc;
jpayne@68:                 for (size_t i = 0; i < size; ++i)
jpayne@68:                     crc = crc64_table[buf[i] ^ (crc & 0xFF)]
jpayne@68:                             ^ (crc >> 8);
jpayne@68:                 return ~crc;
jpayne@68:             }
jpayne@68: 
jpayne@68:             int
jpayne@68:             main()
jpayne@68:             {
jpayne@68:                 init();
jpayne@68: 
jpayne@68:                 uint32_t value32 = 0;
jpayne@68:                 uint64_t value64 = 0;
jpayne@68:                 uint64_t total_size = 0;
jpayne@68:                 uint8_t buf[8192];
jpayne@68: 
jpayne@68:                 while (1) {
jpayne@68:                     const size_t buf_size
jpayne@68:                             = fread(buf, 1, sizeof(buf), stdin);
jpayne@68:                     if (buf_size == 0)
jpayne@68:                         break;
jpayne@68: 
jpayne@68:                     total_size += buf_size;
jpayne@68:                     value32 = crc32(buf, buf_size, value32);
jpayne@68:                     value64 = crc64(buf, buf_size, value64);
jpayne@68:                 }
jpayne@68: 
jpayne@68:                 printf("Bytes:  %" PRIu64 "\n", total_size);
jpayne@68:                 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
jpayne@68:                 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
jpayne@68: 
jpayne@68:                 return 0;
jpayne@68:             }
jpayne@68: 
jpayne@68: 
jpayne@68: 7. References
jpayne@68: 
jpayne@68:         LZMA SDK - The original LZMA implementation
jpayne@68:         https://7-zip.org/sdk.html
jpayne@68: 
jpayne@68:         LZMA Utils - LZMA adapted to POSIX-like systems
jpayne@68:         https://tukaani.org/lzma/
jpayne@68: 
jpayne@68:         XZ Utils - The next generation of LZMA Utils
jpayne@68:         https://tukaani.org/xz/
jpayne@68: 
jpayne@68:         [RFC-1952]
jpayne@68:         GZIP file format specification version 4.3
jpayne@68:         https://www.ietf.org/rfc/rfc1952.txt
jpayne@68:           - Notation of byte boxes in section "2.1. Overall conventions"
jpayne@68: 
jpayne@68:         [RFC-2119]
jpayne@68:         Key words for use in RFCs to Indicate Requirement Levels
jpayne@68:         https://www.ietf.org/rfc/rfc2119.txt
jpayne@68: 
jpayne@68:         [GNU-tar]
jpayne@68:         GNU tar 1.35 manual
jpayne@68:         https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
jpayne@68:           - Node 9.4.2 "Blocking Factor", paragraph that begins
jpayne@68:             "gzip will complain about trailing garbage"
jpayne@68:           - Note that this URL points to the latest version of the
jpayne@68:             manual, and may some day not contain the note which is in
jpayne@68:             1.35. For the exact version of the manual, download GNU
jpayne@68:             tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz
jpayne@68: