translations don't always use unicode code points now

This commit is contained in:
Jeff Epler 2023-08-10 09:22:10 -05:00
parent 337b800ceb
commit 17015b48ad
No known key found for this signature in database
GPG Key ID: D5BF15AB975AB4DE
1 changed files with 7 additions and 1 deletions

View File

@ -38,12 +38,18 @@
// 9 in some translations sometime in the future. This length excludes
// the trailing NUL, though notably decompress_length includes it.
//
// - followed by the huffman encoding of the individual UTF-16 code
// - followed by the huffman encoding of the individual code
// points that make up the string. The trailing "\0" is not
// represented by a huffman code, but is implied by the length.
// (building the huffman encoding on UTF-16 code points gave better
// compression than building it on UTF-8 bytes)
//
// - If possible, the code points are represented as uint8_t values, with
// 0..127 representing themselves and 160..255 representing another range
// of Unicode, controlled by translation_offset and translation_offstart.
// If this is not possible, uint16_t values are used. At present, no translation
// requires code points not in the BMP, so this is adequate.
//
// - code points starting at 128 (word_start) and potentially extending
// to 255 (word_end) (but never interfering with the target
// language's used code points) stand for dictionary entries in a