makeqstrdata: don't print "compression incrased length" messages
This check as implemented is misleading, because it compares the compressed size in bytes (including the length indication) with the source string length in Unicode code points. For English this is approximately fair, but for Japanese this is quite unfair and produces an excess of "increased length" messages. This message might have existed for one of two reasons: * to alert to an improperly function huffman compression * to call attention to a need for a "string is stored uncompressed" case We know by now that the huffman compression is functioning as designed and effective in general. Just to be on the safe side, I did some back-of-the-envelope estimates. I considered these three replacements for "the true source string size, in bytes": + decompressed_len_utf8 = len(decompressed.encode('utf-8')) + decompressed_len_utf16 = len(decompressed.encode('utf-16be')) + decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8 The third counts how many bits each character requires (fewer than 128 characters in the source character set = 7, fewer than 256 = 8, fewer than 512 = 9, etc, adding a string-terminating value) and is in some way representative of the best way we would be able to store "uncompressed strings". The Japanese translation (largest as of writing) has just a few strings which increase by this metric. However, the amount of loss due to expansion in those cases is outweighed by the cost of adding 1 bit per string to indicate whether it's compressed or not. For instance, in the BOARD=trinket_m0 TRANSLATION=ja build the loss is 47 bytes over 300 strings. Adding 1 bit to each of 300 strings will cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to check and decode "uncompressed" strings in order to break even.
This commit is contained in:
parent
ac15726e13
commit
08ed09acc6
@ -259,8 +259,6 @@ def compress(encoding_table, decompressed, encoded_length_bits, len_translation_
|
|||||||
current_bit -= 1
|
current_bit -= 1
|
||||||
if current_bit != 7:
|
if current_bit != 7:
|
||||||
current_byte += 1
|
current_byte += 1
|
||||||
if current_byte > len(decompressed):
|
|
||||||
print("Note: compression increased length", repr(decompressed), len(decompressed), current_byte, file=sys.stderr)
|
|
||||||
return enc[:current_byte]
|
return enc[:current_byte]
|
||||||
|
|
||||||
def qstr_escape(qst):
|
def qstr_escape(qst):
|
||||||
|
Loading…
Reference in New Issue
Block a user