Most users and the CI system are running in configurations where Python
configures stdout and stderr in UTF-8 mode. However, Windows is different,
setting values like CP1252. This led to a build failure on Windows, because
makeqstrdata printed Unicode strings to its stdout, expecting them to be
encoded as UTF-8.
This script is writing (stdout) to a compiler input file and potentially
printing messages (stderr) to a log or console. Explicitly configure stdout to
use utf-8 to get consistent behavior on all platforms, and configure stderr so
that if any log/diagnostic messages are printed that cannot be displayed
correctly, they are still displayed instead of creating an error while trying
to print the diagnostic information.
I considered setting the encodings both to ascii, but this would just be
occasionally inconvenient to developers like me who want to show diagnostic
info on stderr and in comments while working with the compression code.
Closes: #3408
Massive savings. Thanks so much @ciscorn for providing the initial
code for choosing the dictionary.
This adds a bit of time to the build, both to find the dictionary
but also because (for reasons I don't fully understand), the binary
search in the compress() function no longer worked and had to be
replaced with a linear search.
I think this is because the intended invariant is that for codebook
entries that encode to the same number of bits, the entries are ordered
in ascending value. However, I mis-placed the transition from "words"
to "byte/char values" so the codebook entries for words are in word-order
rather than their code order.
Because this price is only paid at build time, I didn't care to determine
exactly where the correct fix was.
I also commented out a line to produce the "estimated total memory size"
-- at least on the unix build with TRANSLATION=ja, this led to a build
time KeyError trying to compute the codebook size for all the strings.
I think this occurs because some single unicode code point ('ァ') is
no longer present as itself in the compressed strings, due to always
being replaced by a word.
As promised, this seems to save hundreds of bytes in the German translation
on the trinket m0.
Testing performed:
- built trinket_m0 in several languages
- built and ran unix port in several languages (en, de_DE, ja) and ran
simple error-producing codes like ./micropython -c '1/0'
Two problems: The lead byte for 3-byte sequences was wrong, and one
mid-byte was not even filled in due to a missing "++"!
Apparently this was broken ever since the first "Compress as unicode,
not bytes" commit, but I believed I'd "tested" it by running on the
Pinyin translation.
This rendered at least the Korean and Japanese translations completely
illegible, affecting 5.0 and all later releases.