Difference between revisions of "Compression"

From Stunts Wiki
(Created page with "Stunts uses a custom data compression format for code and resource files to save disk space and, more importantly, the number of floppy disks used to...")
 
(RLE intro, article category, typo fixes.)
Line 23: Line 23:
  
 
  <span class="hlP">struct</span> RLE <span class="hlB">{</span>
 
  <span class="hlP">struct</span> RLE <span class="hlB">{</span>
     <span class="hlP">uint8</span> unknown
+
     <span class="hlP">uint8</span> unknown <span class="hlC">// Always 0x00, value is ignored by the game</span>
 
     <span class="hlP">uint8</span> escape_length : 7
 
     <span class="hlP">uint8</span> escape_length : 7
 
     <span class="hlP">uint8</span> no_sequence : 1
 
     <span class="hlP">uint8</span> no_sequence : 1
Line 30: Line 30:
 
     <span class="hlP">uint8</span> data<span class="hlB">[</span><span class="hlC">...</span><span class="hlB">]</span>
 
     <span class="hlP">uint8</span> data<span class="hlB">[</span><span class="hlC">...</span><span class="hlB">]</span>
 
  <span class="hlB">}</span>
 
  <span class="hlB">}</span>
 +
 +
RLE is used to reduce the length of consecutively repeated bytes by using an escape code followed by a counter to signify how many times the next byte should be repeated. It is particularly useful for compressing images, as large areas with a single colour would be represented by many consecutive identical bytes.
  
 
The RLE stage can itself have two passes if <tt>no_sequence</tt> is not set. The first pass will then expand sequences of bytes in the data. <tt>escape_codes[0]</tt> will mark the beginning and the end of a byte sequence. When the escape code is encountered in the data, the following bytes are considered a sequence until the escape code is hit again. After the end-marking escape code, a one byte length counter tells how many times the preceding byte sequence should be repeated in the output buffer. Because the escape code cannot itself be escaped, sequential byte runs can only be applied in files that have unused byte values to use as the first escape code.
 
The RLE stage can itself have two passes if <tt>no_sequence</tt> is not set. The first pass will then expand sequences of bytes in the data. <tt>escape_codes[0]</tt> will mark the beginning and the end of a byte sequence. When the escape code is encountered in the data, the following bytes are considered a sequence until the escape code is hit again. After the end-marking escape code, a one byte length counter tells how many times the preceding byte sequence should be repeated in the output buffer. Because the escape code cannot itself be escaped, sequential byte runs can only be applied in files that have unused byte values to use as the first escape code.
Line 67: Line 69:
 
With Huffman codes being paths in a binary tree, they have the property of being prefix-free, meaning that one leaf node's code cannot be the prefix of another node's code. This allows for a decoding optimisation technique known as ''prefix tables'' or ''Huffman tables''. By creating arrays of symbols and code widths where the indices are the Huffman codes, the codes can be looked up directly without traversing the tree. Every array index for non-leaf nodes between two leaf node entries are filled with data for the preceding leaf node. This way there's no need to filter out bits that belong to the next Huffman code, at the expense of common symbols taking up multiple entries in the lookup table during decompression.
 
With Huffman codes being paths in a binary tree, they have the property of being prefix-free, meaning that one leaf node's code cannot be the prefix of another node's code. This allows for a decoding optimisation technique known as ''prefix tables'' or ''Huffman tables''. By creating arrays of symbols and code widths where the indices are the Huffman codes, the codes can be looked up directly without traversing the tree. Every array index for non-leaf nodes between two leaf node entries are filled with data for the preceding leaf node. This way there's no need to filter out bits that belong to the next Huffman code, at the expense of common symbols taking up multiple entries in the lookup table during decompression.
  
Stunts utilises this optimisation for Huffman codes that are up to 8 bits long. The first byte of the code is checked in the <tt>widths[256]</tt> array. If the code's width is 8 bits or less, the same code is looked up in the <tt>symbols[256]</tt> array and written to the output buffer. To find the start of the next Huffman code, the source buffer is advanced by the code's bit width found in the <tt>widths</tt> array.
+
Stunts utilises this optimisation for Huffman codes that are up to 8 bits wide. The first byte of the code is checked in the <tt>widths[256]</tt> array. If the code's width is 8 bits or less, the same code is looked up in the <tt>symbols[256]</tt> array and written to the output buffer. To find the start of the next Huffman code, the source buffer is advanced by the code's bit width found in the <tt>widths</tt> array.
  
 
==== Offset table ====
 
==== Offset table ====
Line 73: Line 75:
  
 
=== Version differences ===
 
=== Version differences ===
The Huffman decompression in the game code of Stunts 1.0 differs from the one in its LOAD.EXE and later versions of the game. The Huffman codes are written with the least significant bit first, as opposed to the other versions that has the codes written with the opposite orientation. The game code of Stunts 1.0 also does not use the prefix tables, and the bit-stream orientation was likely changed with the introduction of this optimisation.
+
The Huffman decompression in the game code of Stunts 1.0 differs from the one in its LOAD.EXE and later versions of the game. The Huffman codes are written with the least significant bit first, as opposed to the other versions that has the codes written with the opposite orientation. The game code of Stunts 1.0 also does not use a prefix table, and the bit-stream orientation was likely changed with the introduction of this optimisation.
 +
 
 +
[[Category:Internals]]

Revision as of 06:56, 16 June 2023

Stunts uses a custom data compression format for code and resource files to save disk space and, more importantly, the number of floppy disks used to distribute the game. The data format supports both run-length encoding (RLE) and Huffman coding. They can be combined in multiple passes. The data format and the decompression code don't require any particular order or combination of algorithms. Every compressed file shipped with the game use Huffman coding, and when combined with RLE, the RLE pass is applied before Huffman.

File header

enum CompressionType {
    RLE     = 1
    Huffman = 2
}

struct Header {
    union {
        uint8 passes : 7
        uint8 compression_type : 7
    }
    uint8  multi_pass : 1
    uint32 uncompressed_size : 24
}


If the most significant bit of the first byte of the file is set, this is a multi-pass file and the remaining 7 bits are the number of passes to be processed recursively. If the multi_pass flag is not set, we are only doing a single pass and the first byte is the compression type. The next three bytes make up the uncompressed size. If the multi_pass flag was set, this is the final uncompressed size. When the multi_pass flag is not set, this is the uncompressed size of the current pass. The remaining data is processed recursively. The compression_type field decides how it is decoded.

Run length coding

struct RLE {
    uint8 unknown // Always 0x00, value is ignored by the game
    uint8 escape_length : 7
    uint8 no_sequence : 1
    uint8 escape_codes[escape_length]

    uint8 data[...]
}

RLE is used to reduce the length of consecutively repeated bytes by using an escape code followed by a counter to signify how many times the next byte should be repeated. It is particularly useful for compressing images, as large areas with a single colour would be represented by many consecutive identical bytes.

The RLE stage can itself have two passes if no_sequence is not set. The first pass will then expand sequences of bytes in the data. escape_codes[0] will mark the beginning and the end of a byte sequence. When the escape code is encountered in the data, the following bytes are considered a sequence until the escape code is hit again. After the end-marking escape code, a one byte length counter tells how many times the preceding byte sequence should be repeated in the output buffer. Because the escape code cannot itself be escaped, sequential byte runs can only be applied in files that have unused byte values to use as the first escape code.

The main RLE pass is a traditional expansion of single-byte runs, but with a twist that allows for multiple types of escape codes. The escape code's position in the header's escape code list affects its properties. When the first escape code is encountered in the source buffer, the following byte has the number of repetitions, and the next byte will be copied that many times to the output buffer. When the third escape code is encountered, the number of repetitions is stored in the following two bytes as a 16-bit integer. Then the following byte is copied that many times. All other escape codes use its position in the escape code list as the number of repetitions, and have no additional counter bytes before the data byte to be repeated. Escape codes can themselves be escaped in single-byte runs by having them expanded to only one copy.

All repetition counters have an off-by-one surplus, as the data byte itself counts as the first copy. For optimal compression, escape codes should be chosen from byte values that occur least frequently, or not at all, in the source data.

RLE decoding optimisation

Escape codes are put in an escape[256] array, where the code's byte value is the index and the escape code's position is the entry's value. This array is used to test incoming bytes for escape codes. Raw bytes will have the value 0, while other values corresponds to the escape code's property. This optimisation makes the decoding run O(1), as opposed to O(n), where n is the number of escape codes.

Huffman coding

struct Huffman {
    uint8 tree_levels : 7
    uint8 delta_coding : 1
    uint8 symbols_per_level[tree_levels]
    uint8 alphabet[SUM(symbols_per_level)]

    uint16 codes : var [...]
}


Huffman coding uses a binary tree to store data. When Huffman codes are read, each bit tells which branch to take when traversing the tree. If a leaf node is hit, we have found the code's symbol and can proceed with the next code. Frequently used symbols have shorter paths and thus takes up fewer bits in the source file.

The tree is stored by having two lists, one with symbols per level, and one with the alphabet. The alphabet is every possible symbol in the order they appear in the tree. The tree is built by going through each symbols_per_level. If there are symbols at the current level, consume them from alphabet and create a leaf nodes. If there are no symbols, and there still are more levels to create, proceed to create left and right children nodes for the next level.

The remaining data is the Huffman codes. Bit by bit is read to find leaf nodes in the Huffman tree whose symbols are appended to the output buffer. This is repeated until the output buffer has reached uncompressed_size.

The maximum Huffman tree height is 16. While the data format allows for up to 127 levels, Stunts' decompression function uses arrays with a fixed capacity of 16 for its level offset table.

Delta coding

If the flag for delta coding is set, each symbol is treated as the difference from the previous output byte instead of an independent final value. The last byte written to the output buffer is remembered and added to the next expanded symbol. While regular Huffman coding assigns shorter codes to frequently occurring values, this delta variant assigns shorter codes to differences that occurs frequently. This feature is not used by any shipped files, but the code is present in all versions that implements Huffman decoding. One can speculate that delta coding was outperformed by RLE combined with regular Huffman coding, making delta coding unnecessary.

Huffman decoding optimisations

Prefix table

With Huffman codes being paths in a binary tree, they have the property of being prefix-free, meaning that one leaf node's code cannot be the prefix of another node's code. This allows for a decoding optimisation technique known as prefix tables or Huffman tables. By creating arrays of symbols and code widths where the indices are the Huffman codes, the codes can be looked up directly without traversing the tree. Every array index for non-leaf nodes between two leaf node entries are filled with data for the preceding leaf node. This way there's no need to filter out bits that belong to the next Huffman code, at the expense of common symbols taking up multiple entries in the lookup table during decompression.

Stunts utilises this optimisation for Huffman codes that are up to 8 bits wide. The first byte of the code is checked in the widths[256] array. If the code's width is 8 bits or less, the same code is looked up in the symbols[256] array and written to the output buffer. To find the start of the next Huffman code, the source buffer is advanced by the code's bit width found in the widths array.

Offset table

As Stunts limits the prefix tables to 8 bit wide Huffman codes to save RAM, another novel optimisation is applied for codes that exceeds 8 bits. Prefixes for longer codes have the value 0x40 in the widths array, indicating that the current code is a prefix that needs further expansion. For this purpose Stunts uses two additional helper arrays: total_codes[tree_levels], which has the total number of leaf nodes that exists in the tree up to each level, and code_offsets[tree_levels], which for each level has a value that can be added to a code with the same level in order to get its index in the alphabet. This works because the codes are arranged canonically. As a code's bit width corresponds to the code's level in the Huffman tree, the width can be used to compare the current code with the level's total by getting total_codes[current_width] for each extra bit read. If the code is smaller than the total number of codes encountered, it is a leaf node, and we get the symbol by adding the level's code offset to the code with alphabet[code + code_offsets[width]]. This is more laborious than the use of prefix tables for direct lookup, but it only applies to codes that are occurring less frequently in the input file.

Version differences

The Huffman decompression in the game code of Stunts 1.0 differs from the one in its LOAD.EXE and later versions of the game. The Huffman codes are written with the least significant bit first, as opposed to the other versions that has the codes written with the opposite orientation. The game code of Stunts 1.0 also does not use a prefix table, and the bit-stream orientation was likely changed with the introduction of this optimisation.