makeqstrdata: don't print "compression incrased length" messages

This check as implemented is misleading, because it compares the
compressed size in bytes (including the length indication) with the source
string length in Unicode code points.  For English this is approximately
fair, but for Japanese this is quite unfair and produces an excess of
"increased length" messages.

This message might have existed for one of two reasons:
 * to alert to an improperly function huffman compression
 * to call attention to a need for a "string is stored uncompressed" case
We know by now that the huffman compression is functioning as designed and
effective in general.

Just to be on the safe side, I did some back-of-the-envelope estimates.
I considered these three replacements for "the true source string size, in bytes":
+    decompressed_len_utf8 = len(decompressed.encode('utf-8'))
+    decompressed_len_utf16 = len(decompressed.encode('utf-16be'))
+    decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8

The third counts how many bits each character requires (fewer than 128
characters in the source character set = 7, fewer than 256 = 8, fewer than 512
= 9, etc, adding a string-terminating value) and is in some way representative
of the best way we would be able to store "uncompressed strings".  The Japanese
translation (largest as of writing) has just a few strings which increase by
this metric.  However, the amount of loss due to expansion in those cases is
outweighed by the cost of adding 1 bit per string to indicate whether it's
compressed or not.  For instance, in the BOARD=trinket_m0 TRANSLATION=ja build
the loss is 47 bytes over 300 strings.  Adding 1 bit to each of 300 strings will
cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to
check and decode "uncompressed" strings in order to break even.
This commit is contained in:
Jeff Epler 2020-08-16 20:38:05 -05:00
parent ac15726e13
commit 08ed09acc6

View File

@ -259,8 +259,6 @@ def compress(encoding_table, decompressed, encoded_length_bits, len_translation_
current_bit -= 1 current_bit -= 1
if current_bit != 7: if current_bit != 7:
current_byte += 1 current_byte += 1
if current_byte > len(decompressed):
print("Note: compression increased length", repr(decompressed), len(decompressed), current_byte, file=sys.stderr)
return enc[:current_byte] return enc[:current_byte]
def qstr_escape(qst): def qstr_escape(qst):