add explanation for newer compression features

This commit is contained in:
Jeff Epler 2023-08-31 13:27:16 -05:00
parent 4d8b354c13
commit 5c23e28208
No known key found for this signature in database
GPG Key ID: D5BF15AB975AB4DE
1 changed files with 15 additions and 0 deletions

View File

@ -53,6 +53,13 @@
// speaking, words. They're just spans of code points that frequently // speaking, words. They're just spans of code points that frequently
// occur together. They are ordered shortest to longest. // occur together. They are ordered shortest to longest.
// //
// - If the translation uses a lot of code points or widely spaced code points,
// then the huffman table entries are UTF-16 code points. But if the translation
// uses only ASCII 7-bit code points plus a SMALL range of higher code points that
// still fit in 8 bits, translation_offset and translation_offstart are used to
// renumber the code points so that they still fit within 8 bits. (it's very beneficial
// for mchar_t to be 8 bits instead of 16!)
//
// - dictionary entries are non-overlapping, and the _ending_ index of each // - dictionary entries are non-overlapping, and the _ending_ index of each
// entry is stored in an array. A count of words of each length, from // entry is stored in an array. A count of words of each length, from
// minlen to maxlen, is given in the array called wlencount. From // minlen to maxlen, is given in the array called wlencount. From
@ -60,6 +67,14 @@
// calculated by an efficient, small loop. (A bit of time is traded // calculated by an efficient, small loop. (A bit of time is traded
// to reduce the size of this table indicating lengths) // to reduce the size of this table indicating lengths)
// //
// - Value 1 ('\1') is used to indicate that a QSTR number follows. the
// QSTR is encoded as a fixed number of bits (translation_qstr_bits), e.g.,
// 10 bits if the highest core qstr is from 512 to 1023 inclusive.
// (maketranslationdata uses a simple heuristic where any qstr >= 3
// characters long is encoded in this way; this is simple but probably not
// optimal. In fact, the rule of >= 2 characters is better for SOME languages
// on SOME boards.)
//
// The "data" / "tail" construct is so that the struct's last member is a // The "data" / "tail" construct is so that the struct's last member is a
// "flexible array". However, the _only_ member is not permitted to be // "flexible array". However, the _only_ member is not permitted to be
// a flexible member, so we have to declare the first byte as a separate // a flexible member, so we have to declare the first byte as a separate