csp2: CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/doc/xz/xz-file-format.txt comparison

comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/doc/xz/xz-file-format.txt @ 68:5028fdace37b

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d

author	jpayne
date	Tue, 18 Mar 2025 16:23:26 -0400
parents
children

comparison

equal deleted inserted replaced

-:0e9998148a16
+:5028fdace37b
+The .xz File Format
+===================
+Version 1.2.1 (2024-04-08)
+0. Preface
+0.1. Notices and Acknowledgements
+0.2. Getting the Latest Version
+0.3. Version History
+1. Conventions
+1.1. Byte and Its Representation
+1.2. Multibyte Integers
+2. Overall Structure of .xz File
+2.1. Stream
+2.1.1. Stream Header
+2.1.1.1. Header Magic Bytes
+2.1.1.2. Stream Flags
+2.1.1.3. CRC32
+2.1.2. Stream Footer
+2.1.2.1. CRC32
+2.1.2.2. Backward Size
+2.1.2.3. Stream Flags
+2.1.2.4. Footer Magic Bytes
+2.2. Stream Padding
+3. Block
+3.1. Block Header
+3.1.1. Block Header Size
+3.1.2. Block Flags
+3.1.3. Compressed Size
+3.1.4. Uncompressed Size
+3.1.5. List of Filter Flags
+3.1.6. Header Padding
+3.1.7. CRC32
+3.2. Compressed Data
+3.3. Block Padding
+3.4. Check
+4. Index
+4.1. Index Indicator
+4.2. Number of Records
+4.3. List of Records
+4.3.1. Unpadded Size
+4.3.2. Uncompressed Size
+4.4. Index Padding
+4.5. CRC32
+5. Filter Chains
+5.1. Alignment
+5.2. Security
+5.3. Filters
+5.3.1. LZMA2
+5.3.2. Branch/Call/Jump Filters for Executables
+5.3.3. Delta
+5.3.3.1. Format of the Encoded Output
+5.4. Custom Filter IDs
+5.4.1. Reserved Custom Filter ID Ranges
+6. Cyclic Redundancy Checks
+7. References
+0. Preface
+This document describes the .xz file format (filename suffix
+".xz", MIME type "application/x-xz"). It is intended that this
+this format replace the old .lzma format used by LZMA SDK and
+LZMA Utils.
+0.1. Notices and Acknowledgements
+This file format was designed by Lasse Collin
+<lasse.collin@tukaani.org> and Igor Pavlov.
+Special thanks for helping with this document goes to
+Ville Koskinen. Thanks for helping with this document goes to
+Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
+This document has been put into the public domain.
+0.2. Getting the Latest Version
+The latest official version of this document can be downloaded
+from <https://tukaani.org/xz/xz-file-format.txt>.
+Specific versions of this document have a filename
+xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
+For example, the version 1.0.0 of this document is available
+at <https://tukaani.org/xz/xz-file-format-1.0.0.txt>.
+0.3. Version History
+Version   Date          Description
+1.2.1     2024-04-08    The URLs of this specification and
+XZ Utils were changed back to the
+original ones in Sections 0.2 and 7.
+1.2.0     2024-01-19    Added RISC-V filter and updated URLs in
+Sections 0.2 and 7. The URL of this
+specification was changed.
+1.1.0     2022-12-11    Added ARM64 filter and clarified 32-bit
+ARM endianness in Section 5.3.2,
+language improvements in Section 5.4
+1.0.4     2009-08-27    Language improvements in Sections 1.2,
+2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
+1.0.3     2009-06-05    Spelling fixes in Sections 5.1 and 5.4
+1.0.2     2009-06-04    Typo fixes in Sections 4 and 5.3.1
+1.0.1     2009-06-01    Typo fix in Section 0.3 and minor
+clarifications to Sections 2, 2.2,
+3.3, 4.4, and 5.3.2
+1.0.0     2009-01-14    The first official version
+1. Conventions
+The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
+"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
+document are to be interpreted as described in [RFC-2119].
+Indicating a warning means displaying a message, returning
+appropriate exit status, or doing something else to let the
+user know that something worth warning occurred. The operation
+SHOULD still finish if a warning is indicated.
+Indicating an error means displaying a message, returning
+appropriate exit status, or doing something else to let the
+user know that something prevented successfully finishing the
+operation. The operation MUST be aborted once an error has
+been indicated.
+1.1. Byte and Its Representation
+In this document, byte is always 8 bits.
+A "null byte" has all bits unset. That is, the value of a null
+byte is 0x00.
+To represent byte blocks, this document uses notation that
+is similar to the notation used in [RFC-1952]:
++-------+
+|  Foo  |   One byte.
++-------+
++---+---+
+|  Foo  |   Two bytes; that is, some of the vertical bars
++---+---+   can be missing.
++=======+
+|  Foo  |   Zero or more bytes.
++=======+
+In this document, a boxed byte or a byte sequence declared
+using this notation is called "a field". The example field
+above would be called "the Foo field" or plain "Foo".
+If there are many fields, they may be split to multiple lines.
+This is indicated with an arrow ("--->"):
++=====+
+| Foo |
++=====+
++=====+
+---> | Bar |
++=====+
+The above is equivalent to this:
++=====+=====+
+| Foo | Bar |
++=====+=====+
+1.2. Multibyte Integers
+Multibyte integers of static length, such as CRC values,
+are stored in little endian byte order (least significant
+byte first).
+When smaller values are more likely than bigger values (for
+example file sizes), multibyte integers are encoded in a
+variable-length representation:
+- Numbers in the range [0, 127] are copied as is, and take
+one byte of space.
+- Bigger numbers will occupy two or more bytes. All but the
+last byte of the multibyte representation have the highest
+(eighth) bit set.
+For now, the value of the variable-length integers is limited
+to 63 bits, which limits the encoded size of the integer to
+nine bytes. These limits may be increased in the future if
+needed.
+The following C code illustrates encoding and decoding of
+variable-length integers. The functions return the number of
+bytes occupied by the integer (1-9), or zero on error.
+#include <stddef.h>
+#include <inttypes.h>
+size_t
+encode(uint8_t buf[static 9], uint64_t num)
+{
+if (num > UINT64_MAX / 2)
+return 0;
+size_t i = 0;
+while (num >= 0x80) {
+buf[i++] = (uint8_t)(num) | 0x80;
+num >>= 7;
+}
+buf[i++] = (uint8_t)(num);
+return i;
+}
+size_t
+decode(const uint8_t buf[], size_t size_max, uint64_t *num)
+{
+if (size_max == 0)
+return 0;
+if (size_max > 9)
+size_max = 9;
+*num = buf[0] & 0x7F;
+size_t i = 0;
+while (buf[i++] & 0x80) {
+if (i >= size_max || buf[i] == 0x00)
+return 0;
+*num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
+}
+return i;
+}
+2. Overall Structure of .xz File
+A standalone .xz files consist of one or more Streams which may
+have Stream Padding between or after them:
++========+================+========+================+
+| Stream | Stream Padding | Stream | Stream Padding | ...
++========+================+========+================+
+The sizes of Stream and Stream Padding are always multiples
+of four bytes, thus the size of every valid .xz file MUST be
+a multiple of four bytes.
+While a typical file contains only one Stream and no Stream
+Padding, a decoder handling standalone .xz files SHOULD support
+files that have more than one Stream or Stream Padding.
+In contrast to standalone .xz files, when the .xz file format
+is used as an internal part of some other file format or
+communication protocol, it usually is expected that the decoder
+stops after the first Stream, and doesn't look for Stream
+Padding or possibly other Streams.
+2.1. Stream
++-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
+|     Stream Header     | Block | Block | ... | Block |
++-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+     +=======+
++=======+-+-+-+-+-+-+-+-+-+-+-+-+
+---> | Index |     Stream Footer     |
++=======+-+-+-+-+-+-+-+-+-+-+-+-+
+All the above fields have a size that is a multiple of four. If
+Stream is used as an internal part of another file format, it
+is RECOMMENDED to make the Stream start at an offset that is
+a multiple of four bytes.
+Stream Header, Index, and Stream Footer are always present in
+a Stream. The maximum size of the Index field is 16 GiB (2^34).
+There are zero or more Blocks. The maximum number of Blocks is
+limited only by the maximum size of the Index field.
+Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
+The same limit applies to the total amount of uncompressed
+data stored in a Stream.
+If an implementation supports handling .xz files with multiple
+concatenated Streams, it MAY apply the above limits to the file
+as a whole instead of limiting per Stream basis.
+2.1.1. Stream Header
++---+---+---+---+---+---+-------+------+--+--+--+--+
+|  Header Magic Bytes   | Stream Flags |   CRC32   |
++---+---+---+---+---+---+-------+------+--+--+--+--+
+2.1.1.1. Header Magic Bytes
+The first six (6) bytes of the Stream are so called Header
+Magic Bytes. They can be used to identify the file type.
+Using a C array and ASCII:
+const uint8_t HEADER_MAGIC[6]
+= { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
+In plain hexadecimal:
+FD 37 7A 58 5A 00
+Notes:
+- The first byte (0xFD) was chosen so that the files cannot
+be erroneously detected as being in .lzma format, in which
+the first byte is in the range [0x00, 0xE0].
+- The sixth byte (0x00) was chosen to prevent applications
+from misdetecting the file as a text file.
+If the Header Magic Bytes don't match, the decoder MUST
+indicate an error.
+2.1.1.2. Stream Flags
+The first byte of Stream Flags is always a null byte. In the
+future, this byte may be used to indicate a new Stream version
+or other Stream properties.
+The second byte of Stream Flags is a bit field:
+Bit(s)  Mask  Description
+0-3    0x0F  Type of Check (see Section 3.4):
+ID    Size      Check name
+0x00   0 bytes  None
+0x01   4 bytes  CRC32
+0x02   4 bytes  (Reserved)
+0x03   4 bytes  (Reserved)
+0x04   8 bytes  CRC64
+0x05   8 bytes  (Reserved)
+0x06   8 bytes  (Reserved)
+0x07  16 bytes  (Reserved)
+0x08  16 bytes  (Reserved)
+0x09  16 bytes  (Reserved)
+0x0A  32 bytes  SHA-256
+0x0B  32 bytes  (Reserved)
+0x0C  32 bytes  (Reserved)
+0x0D  64 bytes  (Reserved)
+0x0E  64 bytes  (Reserved)
+0x0F  64 bytes  (Reserved)
+4-7    0xF0  Reserved for future use; MUST be zero for now.
+Implementations SHOULD support at least the Check IDs 0x00
+(None) and 0x01 (CRC32). Supporting other Check IDs is
+OPTIONAL. If an unsupported Check is used, the decoder SHOULD
+indicate a warning or error.
+If any reserved bit is set, the decoder MUST indicate an error.
+It is possible that there is a new field present which the
+decoder is not aware of, and can thus parse the Stream Header
+incorrectly.
+2.1.1.3. CRC32
+The CRC32 is calculated from the Stream Flags field. It is
+stored as an unsigned 32-bit little endian integer. If the
+calculated value does not match the stored one, the decoder
+MUST indicate an error.
+The idea is that Stream Flags would always be two bytes, even
+if new features are needed. This way old decoders will be able
+to verify the CRC32 calculated from Stream Flags, and thus
+distinguish between corrupt files (CRC32 doesn't match) and
+files that the decoder doesn't support (CRC32 matches but
+Stream Flags has reserved bits set).
+2.1.2. Stream Footer
++-+-+-+-+---+---+---+---+-------+------+----------+---------+
+| CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
++-+-+-+-+---+---+---+---+-------+------+----------+---------+
+2.1.2.1. CRC32
+The CRC32 is calculated from the Backward Size and Stream Flags
+fields. It is stored as an unsigned 32-bit little endian
+integer. If the calculated value does not match the stored one,
+the decoder MUST indicate an error.
+The reason to have the CRC32 field before the Backward Size and
+Stream Flags fields is to keep the four-byte fields aligned to
+a multiple of four bytes.
+2.1.2.2. Backward Size
+Backward Size is stored as a 32-bit little endian integer,
+which indicates the size of the Index field as multiple of
+four bytes, minimum value being four bytes:
+real_backward_size = (stored_backward_size + 1) * 4;
+If the stored value does not match the real size of the Index
+field, the decoder MUST indicate an error.
+Using a fixed-size integer to store Backward Size makes
+it slightly simpler to parse the Stream Footer when the
+application needs to parse the Stream backwards.
+2.1.2.3. Stream Flags
+This is a copy of the Stream Flags field from the Stream
+Header. The information stored to Stream Flags is needed
+when parsing the Stream backwards. The decoder MUST compare
+the Stream Flags fields in both Stream Header and Stream
+Footer, and indicate an error if they are not identical.
+2.1.2.4. Footer Magic Bytes
+As the last step of the decoding process, the decoder MUST
+verify the existence of Footer Magic Bytes. If they don't
+match, an error MUST be indicated.
+Using a C array and ASCII:
+const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
+In hexadecimal:
+59 5A
+The primary reason to have Footer Magic Bytes is to make
+it easier to detect incomplete files quickly, without
+uncompressing. If the file does not end with Footer Magic Bytes
+(excluding Stream Padding described in Section 2.2), it cannot
+be undamaged, unless someone has intentionally appended garbage
+after the end of the Stream.
+2.2. Stream Padding
+Only the decoders that support decoding of concatenated Streams
+MUST support Stream Padding.
+Stream Padding MUST contain only null bytes. To preserve the
+four-byte alignment of consecutive Streams, the size of Stream
+Padding MUST be a multiple of four bytes. Empty Stream Padding
+is allowed. If these requirements are not met, the decoder MUST
+indicate an error.
+Note that non-empty Stream Padding is allowed at the end of the
+file; there doesn't need to be a new Stream after non-empty
+Stream Padding. This can be convenient in certain situations
+[GNU-tar].
+The possibility of Stream Padding MUST be taken into account
+when designing an application that parses Streams backwards,
+and the application supports concatenated Streams.
+3. Block
++==============+=================+===============+=======+
+| Block Header | Compressed Data | Block Padding | Check |
++==============+=================+===============+=======+
+3.1. Block Header
++-------------------+-------------+=================+
+| Block Header Size | Block Flags | Compressed Size |
++-------------------+-------------+=================+
++===================+======================+
+---> | Uncompressed Size | List of Filter Flags |
++===================+======================+
++================+--+--+--+--+
+---> | Header Padding |   CRC32   |
++================+--+--+--+--+
+3.1.1. Block Header Size
+This field overlaps with the Index Indicator field (see
+Section 4.1).
+This field contains the size of the Block Header field,
+including the Block Header Size field itself. Valid values are
+in the range [0x01, 0xFF], which indicate the size of the Block
+Header as multiples of four bytes, minimum size being eight
+bytes:
+real_header_size = (encoded_header_size + 1) * 4;
+If a Block Header bigger than 1024 bytes is needed in the
+future, a new field can be added between the Block Header and
+Compressed Data fields. The presence of this new field would
+be indicated in the Block Header field.
+3.1.2. Block Flags
+The Block Flags field is a bit field:
+Bit(s)  Mask  Description
+0-1    0x03  Number of filters (1-4)
+2-5    0x3C  Reserved for future use; MUST be zero for now.
+6     0x40  The Compressed Size field is present.
+7     0x80  The Uncompressed Size field is present.
+If any reserved bit is set, the decoder MUST indicate an error.
+It is possible that there is a new field present which the
+decoder is not aware of, and can thus parse the Block Header
+incorrectly.
+3.1.3. Compressed Size
+This field is present only if the appropriate bit is set in
+the Block Flags field (see Section 3.1.2).
+The Compressed Size field contains the size of the Compressed
+Data field, which MUST be non-zero. Compressed Size is stored
+using the encoding described in Section 1.2. If the Compressed
+Size doesn't match the size of the Compressed Data field, the
+decoder MUST indicate an error.
+3.1.4. Uncompressed Size
+This field is present only if the appropriate bit is set in
+the Block Flags field (see Section 3.1.2).
+The Uncompressed Size field contains the size of the Block
+after uncompressing. Uncompressed Size is stored using the
+encoding described in Section 1.2. If the Uncompressed Size
+does not match the real uncompressed size, the decoder MUST
+indicate an error.
+Storing the Compressed Size and Uncompressed Size fields serves
+several purposes:
+- The decoder knows how much memory it needs to allocate
+for a temporary buffer in multithreaded mode.
+- Simple error detection: wrong size indicates a broken file.
+- Seeking forwards to a specific location in streamed mode.
+It should be noted that the only reliable way to determine
+the real uncompressed size is to uncompress the Block,
+because the Block Header and Index fields may contain
+(intentionally or unintentionally) invalid information.
+3.1.5. List of Filter Flags
++================+================+     +================+
+| Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
++================+================+     +================+
+The number of Filter Flags fields is stored in the Block Flags
+field (see Section 3.1.2).
+The format of each Filter Flags field is as follows:
++===========+====================+===================+
+| Filter ID | Size of Properties | Filter Properties |
++===========+====================+===================+
+Both Filter ID and Size of Properties are stored using the
+encoding described in Section 1.2. Size of Properties indicates
+the size of the Filter Properties field as bytes. The list of
+officially defined Filter IDs and the formats of their Filter
+Properties are described in Section 5.3.
+Filter IDs greater than or equal to 0x4000_0000_0000_0000
+(2^62) are reserved for implementation-specific internal use.
+These Filter IDs MUST never be used in List of Filter Flags.
+3.1.6. Header Padding
+This field contains as many null byte as it is needed to make
+the Block Header have the size specified in Block Header Size.
+If any of the bytes are not null bytes, the decoder MUST
+indicate an error. It is possible that there is a new field
+present which the decoder is not aware of, and can thus parse
+the Block Header incorrectly.
+3.1.7. CRC32
+The CRC32 is calculated over everything in the Block Header
+field except the CRC32 field itself. It is stored as an
+unsigned 32-bit little endian integer. If the calculated
+value does not match the stored one, the decoder MUST indicate
+an error.
+By verifying the CRC32 of the Block Header before parsing the
+actual contents allows the decoder to distinguish between
+corrupt and unsupported files.
+3.2. Compressed Data
+The format of Compressed Data depends on Block Flags and List
+of Filter Flags. Excluding the descriptions of the simplest
+filters in Section 5.3, the format of the filter-specific
+encoded data is out of scope of this document.
+3.3. Block Padding
+Block Padding MUST contain 0-3 null bytes to make the size of
+the Block a multiple of four bytes. This can be needed when
+the size of Compressed Data is not a multiple of four. If any
+of the bytes in Block Padding are not null bytes, the decoder
+MUST indicate an error.
+3.4. Check
+The type and size of the Check field depends on which bits
+are set in the Stream Flags field (see Section 2.1.1.2).
+The Check, when used, is calculated from the original
+uncompressed data. If the calculated Check does not match the
+stored one, the decoder MUST indicate an error. If the selected
+type of Check is not supported by the decoder, it SHOULD
+indicate a warning or error.
+4. Index
++-----------------+===================+
+| Index Indicator | Number of Records |
++-----------------+===================+
++=================+===============+-+-+-+-+
+---> | List of Records | Index Padding | CRC32 |
++=================+===============+-+-+-+-+
+Index serves several purposes. Using it, one can
+- verify that all Blocks in a Stream have been processed;
+- find out the uncompressed size of a Stream; and
+- quickly access the beginning of any Block (random access).
+4.1. Index Indicator
+This field overlaps with the Block Header Size field (see
+Section 3.1.1). The value of Index Indicator is always 0x00.
+4.2. Number of Records
+This field indicates how many Records there are in the List
+of Records field, and thus how many Blocks there are in the
+Stream. The value is stored using the encoding described in
+Section 1.2. If the decoder has decoded all the Blocks of the
+Stream, and then notices that the Number of Records doesn't
+match the real number of Blocks, the decoder MUST indicate an
+error.
+4.3. List of Records
+List of Records consists of as many Records as indicated by the
+Number of Records field:
++========+========+
+| Record | Record | ...
++========+========+
+Each Record contains information about one Block:
++===============+===================+
+| Unpadded Size | Uncompressed Size |
++===============+===================+
+If the decoder has decoded all the Blocks of the Stream, it
+MUST verify that the contents of the Records match the real
+Unpadded Size and Uncompressed Size of the respective Blocks.
+Implementation hint: It is possible to verify the Index with
+constant memory usage by calculating for example SHA-256 of
+both the real size values and the List of Records, then
+comparing the hash values. Implementing this using
+non-cryptographic hash like CRC32 SHOULD be avoided unless
+small code size is important.
+If the decoder supports random-access reading, it MUST verify
+that Unpadded Size and Uncompressed Size of every completely
+decoded Block match the sizes stored in the Index. If only
+partial Block is decoded, the decoder MUST verify that the
+processed sizes don't exceed the sizes stored in the Index.
+4.3.1. Unpadded Size
+This field indicates the size of the Block excluding the Block
+Padding field. That is, Unpadded Size is the size of the Block
+Header, Compressed Data, and Check fields. Unpadded Size is
+stored using the encoding described in Section 1.2. The value
+MUST never be zero; with the current structure of Blocks, the
+actual minimum value for Unpadded Size is five.
+Implementation note: Because the size of the Block Padding
+field is not included in Unpadded Size, calculating the total
+size of a Stream or doing random-access reading requires
+calculating the actual size of the Blocks by rounding Unpadded
+Sizes up to the next multiple of four.
+The reason to exclude Block Padding from Unpadded Size is to
+ease making a raw copy of Compressed Data without Block
+Padding. This can be useful, for example, if someone wants
+to convert Streams to some other file format quickly.
+4.3.2. Uncompressed Size
+This field indicates the Uncompressed Size of the respective
+Block as bytes. The value is stored using the encoding
+described in Section 1.2.
+4.4. Index Padding
+This field MUST contain 0-3 null bytes to pad the Index to
+a multiple of four bytes. If any of the bytes are not null
+bytes, the decoder MUST indicate an error.
+4.5. CRC32
+The CRC32 is calculated over everything in the Index field
+except the CRC32 field itself. The CRC32 is stored as an
+unsigned 32-bit little endian integer. If the calculated
+value does not match the stored one, the decoder MUST indicate
+an error.
+5. Filter Chains
+The Block Flags field defines how many filters are used. When
+more than one filter is used, the filters are chained; that is,
+the output of one filter is the input of another filter. The
+following figure illustrates the direction of data flow.
+v   Uncompressed Data   ^
+|       Filter 0        |
+Encoder |       Filter 1        | Decoder
+|       Filter n        |
+v    Compressed Data    ^
+5.1. Alignment
+Alignment of uncompressed input data is usually the job of
+the application producing the data. For example, to get the
+best results, an archiver tool should make sure that all
+PowerPC executable files in the archive stream start at
+offsets that are multiples of four bytes.
+Some filters, for example LZMA2, can be configured to take
+advantage of specified alignment of input data. Note that
+taking advantage of aligned input can be beneficial also when
+a filter is not the first filter in the chain. For example,
+if you compress PowerPC executables, you may want to use the
+PowerPC filter and chain that with the LZMA2 filter. Because
+not only the input but also the output alignment of the PowerPC
+filter is four bytes, it is now beneficial to set LZMA2
+settings so that the LZMA2 encoder can take advantage of its
+four-byte-aligned input data.
+The output of the last filter in the chain is stored to the
+Compressed Data field, which is is guaranteed to be aligned
+to a multiple of four bytes relative to the beginning of the
+Stream. This can increase
+- speed, if the filtered data is handled multiple bytes at
+a time by the filter-specific encoder and decoder,
+because accessing aligned data in computer memory is
+usually faster; and
+- compression ratio, if the output data is later compressed
+with an external compression tool.
+5.2. Security
+If filters would be allowed to be chained freely, it would be
+possible to create malicious files, that would be very slow to
+decode. Such files could be used to create denial of service
+attacks.
+Slow files could occur when multiple filters are chained:
+v   Compressed input data
+|   Filter 1 decoder (last filter)
+|   Filter 0 decoder (non-last filter)
+v   Uncompressed output data
+The decoder of the last filter in the chain produces a lot of
+output from little input. Another filter in the chain takes the
+output of the last filter, and produces very little output
+while consuming a lot of input. As a result, a lot of data is
+moved inside the filter chain, but the filter chain as a whole
+gets very little work done.
+To prevent this kind of slow files, there are restrictions on
+how the filters can be chained. These restrictions MUST be
+taken into account when designing new filters.
+The maximum number of filters in the chain has been limited to
+four, thus there can be at maximum of three non-last filters.
+Of these three non-last filters, only two are allowed to change
+the size of the data.
+The non-last filters, that change the size of the data, MUST
+have a limit how much the decoder can compress the data: the
+decoder SHOULD produce at least n bytes of output when the
+filter is given 2n bytes of input. This  limit is not
+absolute, but significant deviations MUST be avoided.
+The above limitations guarantee that if the last filter in the
+chain produces 4n bytes of output, the chain as a whole will
+produce at least n bytes of output.
+5.3. Filters
+5.3.1. LZMA2
+LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
+compression algorithm with high compression ratio and fast
+decompression. LZMA is based on LZ77 and range coding
+algorithms.
+LZMA2 is an extension on top of the original LZMA. LZMA2 uses
+LZMA internally, but adds support for flushing the encoder,
+uncompressed chunks, eases stateful decoder implementations,
+and improves support for multithreading. Thus, the plain LZMA
+will not be supported in this file format.
+Filter ID:                  0x21
+Size of Filter Properties:  1 byte
+Changes size of data:       Yes
+Allow as a non-last filter: No
+Allow as the last filter:   Yes
+Preferred alignment:
+Input data:             Adjustable to 1/2/4/8/16 byte(s)
+Output data:            1 byte
+The format of the one-byte Filter Properties field is as
+follows:
+Bits   Mask   Description
+0-5    0x3F   Dictionary Size
+6-7    0xC0   Reserved for future use; MUST be zero for now.
+Dictionary Size is encoded with one-bit mantissa and five-bit
+exponent. The smallest dictionary size is 4 KiB and the biggest
+is 4 GiB.
+Raw value   Mantissa   Exponent   Dictionary size
+0           2         11         4 KiB
+1           3         11         6 KiB
+2           2         12         8 KiB
+3           3         12        12 KiB
+4           2         13        16 KiB
+5           3         13        24 KiB
+6           2         14        32 KiB
+...         ...        ...      ...
+35           3         27       768 MiB
+36           2         28      1024 MiB
+37           3         29      1536 MiB
+38           2         30      2048 MiB
+39           3         30      3072 MiB
+40           2         31      4096 MiB - 1 B
+Instead of having a table in the decoder, the dictionary size
+can be decoded using the following C code:
+const uint8_t bits = get_dictionary_flags() & 0x3F;
+if (bits > 40)
+return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
+uint32_t dictionary_size;
+if (bits == 40) {
+dictionary_size = UINT32_MAX;
+} else {
+dictionary_size = 2 | (bits & 1);
+dictionary_size <<= bits / 2 + 11;
+}
+5.3.2. Branch/Call/Jump Filters for Executables
+These filters convert relative branch, call, and jump
+instructions to their absolute counterparts in executable
+files. This conversion increases redundancy and thus
+compression ratio.
+Size of Filter Properties:  0 or 4 bytes
+Changes size of data:       No
+Allow as a non-last filter: Yes
+Allow as the last filter:   No
+Below is the list of filters in this category. The alignment
+is the same for both input and output data.
+Filter ID   Alignment   Description
+0x04       1 byte     x86 filter (BCJ)
+0x05       4 bytes    PowerPC (big endian) filter
+0x06      16 bytes    IA64 filter
+0x07       4 bytes    ARM filter [1]
+0x08       2 bytes    ARM Thumb filter [1]
+0x09       4 bytes    SPARC filter
+0x0A       4 bytes    ARM64 filter [2]
+0x0B       2 bytes    RISC-V filter
+[1] These are for little endian instruction encoding.
+This must not be confused with data endianness.
+A processor configured for big endian data access
+may still use little endian instruction encoding.
+The filters don't care about the data endianness.
+[2] 4096-byte alignment gives the best results
+because the address in the ADRP instruction
+is a multiple of 4096 bytes.
+If the size of Filter Properties is four bytes, the Filter
+Properties field contains the start offset used for address
+conversions. It is stored as an unsigned 32-bit little endian
+integer. The start offset MUST be a multiple of the alignment
+of the filter as listed in the table above; if it isn't, the
+decoder MUST indicate an error. If the size of Filter
+Properties is zero, the start offset is zero.
+Setting the start offset may be useful if an executable has
+multiple sections, and there are many cross-section calls.
+Taking advantage of this feature usually requires usage of
+the Subblock filter, whose design is not complete yet.
+5.3.3. Delta
+The Delta filter may increase compression ratio when the value
+of the next byte correlates with the value of an earlier byte
+at specified distance.
+Filter ID:                  0x03
+Size of Filter Properties:  1 byte
+Changes size of data:       No
+Allow as a non-last filter: Yes
+Allow as the last filter:   No
+Preferred alignment:
+Input data:             1 byte
+Output data:            Same as the original input data
+The Properties byte indicates the delta distance, which can be
+1-256 bytes backwards from the current byte: 0x00 indicates
+distance of 1 byte and 0xFF distance of 256 bytes.
+5.3.3.1. Format of the Encoded Output
+The code below illustrates both encoding and decoding with
+the Delta filter.
+// Distance is in the range [1, 256].
+const unsigned int distance = get_properties_byte() + 1;
+uint8_t pos = 0;
+uint8_t delta[256];
+memset(delta, 0, sizeof(delta));
+while (1) {
+const int byte = read_byte();
+if (byte == EOF)
+break;
+uint8_t tmp = delta[(uint8_t)(distance + pos)];
+if (is_encoder) {
+tmp = (uint8_t)(byte) - tmp;
+delta[pos] = (uint8_t)(byte);
+} else {
+tmp = (uint8_t)(byte) + tmp;
+delta[pos] = tmp;
+}
+write_byte(tmp);
+--pos;
+}
+5.4. Custom Filter IDs
+If a developer wants to use custom Filter IDs, there are two
+choices. The first choice is to contact Lasse Collin and ask
+him to allocate a range of IDs for the developer.
+The second choice is to generate a 40-bit random integer
+which the developer can use as a personal Developer ID.
+To minimize the risk of collisions, Developer ID has to be
+a randomly generated integer, not manually selected "hex word".
+The following command, which works on many free operating
+systems, can be used to generate Developer ID:
+dd if=/dev/urandom bs=5 count=1 | hexdump
+The developer can then use the Developer ID to create unique
+(well, hopefully unique) Filter IDs.
+Bits    Mask                    Description
+0-15   0x0000_0000_0000_FFFF   Filter ID
+16-55   0x00FF_FFFF_FFFF_0000   Developer ID
+56-62   0x3F00_0000_0000_0000   Static prefix: 0x3F
+The resulting 63-bit integer will use 9 bytes of space when
+stored using the encoding described in Section 1.2. To get
+a shorter ID, see the beginning of this Section how to
+request a custom ID range.
+5.4.1. Reserved Custom Filter ID Ranges
+Range                       Description
+0x0000_0300 - 0x0000_04FF   Reserved to ease .7z compatibility
+0x0002_0000 - 0x0007_FFFF   Reserved to ease .7z compatibility
+0x0200_0000 - 0x07FF_FFFF   Reserved to ease .7z compatibility
+6. Cyclic Redundancy Checks
+There are several incompatible variations to calculate CRC32
+and CRC64. For simplicity and clarity, complete examples are
+provided to calculate the checks as they are used in this file
+format. Implementations MAY use different code as long as it
+gives identical results.
+The program below reads data from standard input, calculates
+the CRC32 and CRC64 values, and prints the calculated values
+as big endian hexadecimal strings to standard output.
+#include <stddef.h>
+#include <inttypes.h>
+#include <stdio.h>
+uint32_t crc32_table[256];
+uint64_t crc64_table[256];
+void
+init(void)
+{
+static const uint32_t poly32 = UINT32_C(0xEDB88320);
+static const uint64_t poly64
+= UINT64_C(0xC96C5795D7870F42);
+for (size_t i = 0; i < 256; ++i) {
+uint32_t crc32 = i;
+uint64_t crc64 = i;
+for (size_t j = 0; j < 8; ++j) {
+if (crc32 & 1)
+crc32 = (crc32 >> 1) ^ poly32;
+else
+crc32 >>= 1;
+if (crc64 & 1)
+crc64 = (crc64 >> 1) ^ poly64;
+else
+crc64 >>= 1;
+}
+crc32_table[i] = crc32;
+crc64_table[i] = crc64;
+}
+}
+uint32_t
+crc32(const uint8_t *buf, size_t size, uint32_t crc)
+{
+crc = ~crc;
+for (size_t i = 0; i < size; ++i)
+crc = crc32_table[buf[i] ^ (crc & 0xFF)]
+^ (crc >> 8);
+return ~crc;
+}
+uint64_t
+crc64(const uint8_t *buf, size_t size, uint64_t crc)
+{
+crc = ~crc;
+for (size_t i = 0; i < size; ++i)
+crc = crc64_table[buf[i] ^ (crc & 0xFF)]
+^ (crc >> 8);
+return ~crc;
+}
+int
+main()
+{
+init();
+uint32_t value32 = 0;
+uint64_t value64 = 0;
+uint64_t total_size = 0;
+uint8_t buf[8192];
+while (1) {
+const size_t buf_size
+= fread(buf, 1, sizeof(buf), stdin);
+if (buf_size == 0)
+break;
+total_size += buf_size;
+value32 = crc32(buf, buf_size, value32);
+value64 = crc64(buf, buf_size, value64);
+}
+printf("Bytes:  %" PRIu64 "\n", total_size);
+printf("CRC-32: 0x%08" PRIX32 "\n", value32);
+printf("CRC-64: 0x%016" PRIX64 "\n", value64);
+return 0;
+}
+7. References
+LZMA SDK - The original LZMA implementation
+https://7-zip.org/sdk.html
+LZMA Utils - LZMA adapted to POSIX-like systems
+https://tukaani.org/lzma/
+XZ Utils - The next generation of LZMA Utils
+https://tukaani.org/xz/
+[RFC-1952]
+GZIP file format specification version 4.3
+https://www.ietf.org/rfc/rfc1952.txt
+- Notation of byte boxes in section "2.1. Overall conventions"
+[RFC-2119]
+Key words for use in RFCs to Indicate Requirement Levels
+https://www.ietf.org/rfc/rfc2119.txt
+[GNU-tar]
+GNU tar 1.35 manual
+https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
+- Node 9.4.2 "Blocking Factor", paragraph that begins
+"gzip will complain about trailing garbage"
+- Note that this URL points to the latest version of the
+manual, and may some day not contain the note which is in
+1.35. For the exact version of the manual, download GNU
+tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz

Mercurial > repos > rliterman > csp2

comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/doc/xz/xz-file-format.txt @ 68:5028fdace37b