Commit Graph

56 Commits

Author SHA1 Message Date
Taku Fukada
79a3796b1c Calculate the Huffman codebook without MP_QSTRs 2020-08-18 23:21:14 +09:00
Jeff Epler
08ed09acc6 makeqstrdata: don't print "compression incrased length" messages
This check as implemented is misleading, because it compares the
compressed size in bytes (including the length indication) with the source
string length in Unicode code points.  For English this is approximately
fair, but for Japanese this is quite unfair and produces an excess of
"increased length" messages.

This message might have existed for one of two reasons:
 * to alert to an improperly function huffman compression
 * to call attention to a need for a "string is stored uncompressed" case
We know by now that the huffman compression is functioning as designed and
effective in general.

Just to be on the safe side, I did some back-of-the-envelope estimates.
I considered these three replacements for "the true source string size, in bytes":
+    decompressed_len_utf8 = len(decompressed.encode('utf-8'))
+    decompressed_len_utf16 = len(decompressed.encode('utf-16be'))
+    decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8

The third counts how many bits each character requires (fewer than 128
characters in the source character set = 7, fewer than 256 = 8, fewer than 512
= 9, etc, adding a string-terminating value) and is in some way representative
of the best way we would be able to store "uncompressed strings".  The Japanese
translation (largest as of writing) has just a few strings which increase by
this metric.  However, the amount of loss due to expansion in those cases is
outweighed by the cost of adding 1 bit per string to indicate whether it's
compressed or not.  For instance, in the BOARD=trinket_m0 TRANSLATION=ja build
the loss is 47 bytes over 300 strings.  Adding 1 bit to each of 300 strings will
cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to
check and decode "uncompressed" strings in order to break even.
2020-08-16 20:50:48 -05:00
Jeff Epler
d0f9b5901e translations: document the compressed format 2020-05-28 11:30:46 -05:00
Jeff Epler
fe3e8d1589 string compression: save a few bits per string
Length was stored as a 16-bit number always.  Most translations have
a max length far less.  For example, US English translation lengths
always fit in just 8 bits.  probably all languages fit in 9 bits.

This also has the side effect of reducing the alignment of
compressed_string_t from 2 bytes to 1.

testing performed: ran in german and english on pyruler, printed messages
looked right.

Firmware size, en_US
Before: 3044 bytes free in flash
After: 3408 bytes free in flash

Firmware size, de_DE (with #2967 merged to restore translations)
Before: 1236 bytes free in flash
After: 1600 bytes free in flash
2020-05-28 08:36:08 -05:00
Jeff Epler
1a0dcb5caa makeqstrdata: reclaim some more bytes on some translations
If a translation only has unicode code points 255 and below, the "values"
array can be 8 bits instead of 16 bits.  This reclaims some code size,
e.g., in a local build, trinket_m0 / en_US reclaimed 112 bytes and de_DE
reclaimed 104 bytes.  However, languages like zh_Latn_pinyin, which use
code points above 255, did not benefit.
2019-12-02 14:49:23 -06:00
Jeff Epler
879e1041c9 makeqstrdata: fix printing of 'increased length' message 2019-12-02 10:18:48 -06:00
Jeff Epler
e06a3bbceb translation: Compress as unicode, not bytes
By treating each unicode code-point as a single entity for huffman
compression, the overall compression rate can be somewhat improved
without changing the algorithm.  On the decompression side, when
compressed values above 127 are encountered, they need to be
converted from a 16-bit Unicode code point into a UTF-8 byte
sequence.

Doing this returns approximately 1.5kB of flash storage with the
zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build
of trinket_m0)

Other "more ASCII" translations benefit less, and in fact
zh_Latn_pinyin is no longer the most constrained translation!
(de_DE 1156 -> 1384 bytes free in flash, I didn't check others
before pushing for CI)

English is slightly pessimized, 2840 -> 2788 bytes, probably mostly
because the "values" array was changed from uint8_t to uint16_t,
which is strictly not required for an all-ASCII translation.  This
could probably be avoided in this case, but as English is not the
most constrained translation it doesn't really matter.

Testing performed: built for feather nRF52840 express and trinket m0
in English and zh_Latn_pinyin; ran and verified the localized
messages such as
    Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài.
and
    Press any key to enter the REPL. Use CTRL-D to reload.
were properly displayed.
2019-12-02 09:46:46 -06:00
Jeff Epler
c4f3a02b3b makeqstrdata: permit longer "compressed" outputs
It is possible for this routine to expand some inputs, and in fact
it does for certan strings in the proposed Korean translation of
CircuitPython (#1858).  I did not determine what the maximum
expansion is -- it's probably modest, like len()/7+2 bytes or
something -- so I tried to just make enc[] an adequate
over-allocation, and then ensured that all the strings in the
proposed ko.po now worked.  The worst actual expansion seems to be a
string that goes from 65 UTF-8-encoded bytes to 68 compressed bytes
(+4.6%).  Only a few out of all strings are reported as
non-compressed.
2019-08-06 07:39:09 -05:00
Scott Shawcroft
355abc835e
Fix output overflow and make help translatable 2018-11-09 16:41:08 -08:00
Scott Shawcroft
137a30ad75
fix mpy-cross 2018-08-16 17:40:57 -07:00
Scott Shawcroft
de5a9d72dc
Compress all translated strings with Huffman coding.
This saves code space in builds which use link-time optimization.
The optimization drops the untranslated strings and replaces them
with a compressed_string_t struct. It can then be decompressed to
a c string.

Builds without LTO work as well but include both untranslated
strings and compressed strings.

This work could be expanded to include QSTRs and loaded strings if
a compress method is added to C. Its tracked in #531.
2018-08-16 17:40:57 -07:00
Scott Shawcroft
4513bd6ea3
Fix translation newlines
Escape table was incorrect
2018-08-10 16:17:03 -07:00
Scott Shawcroft
24e53ad591
Rework escaping and fix ESP build. 2018-08-09 15:58:45 -07:00
Scott Shawcroft
96ebf5bc3f
Two fixes and translate more strings.
* Fix finding translations with escaped characters.
* Add back \r to translations since its needed by screen.
2018-08-09 13:29:30 -07:00
Scott Shawcroft
933add6cd8
Support internationalisation. 2018-08-07 14:58:57 -07:00
Damien George
3678a6bdc6 py/modbuiltins: Make built-in dir support the __dir__ special method.
If MICROPY_PY_ALL_SPECIAL_METHODS is enabled then dir() will now delegate
to the special method __dir__ if the object it is listing has this method.
2018-05-10 23:14:23 +10:00
Paul Sokolovsky
9956fd0710 py/objtype: Fit qstrs for special methods in byte type.
Update makeqstrdata.py to sort strings starting with "__" to the beginning
of qstr list, so they get low qstr id's, guaranteedly fitting in 8 bits.
Then use this property to further compact op_id => qstr mapping arrays.
2017-10-21 11:06:32 +03:00
Damien George
f127bef3e4 py/makeqstrdata.py: Compute the qstr hash from bytes, not characters. 2016-09-02 14:32:47 +10:00
Damien George
202d5acd06 py/makeqstrdata.py: Allow to have double-quote characters in qstrs.
When rendering the qstr for a C header file, the double-quate character
must be escaped.
2016-05-23 15:18:55 +01:00
Damien George
a649d72606 py/makeqstrdata: Add special case to handle \n qstr. 2016-04-14 15:22:36 +01:00
Damien George
2243d68345 py/makeqstrdata: Reinstate Python2 compatibility. 2016-04-14 14:37:04 +01:00
Damien George
49bb04ee64 py/makeqstrdata: Fix rendering of qstrs that have non-printable ASCII.
The qstr data needs to be turned into a proper C string so non-ASCII
chars must be properly escaped according to C rules.
2016-04-14 14:20:25 +01:00
Damien George
f30b6f0af5 py/makeqstrdata: Add more names for escaped chars and esc non-printable.
Non-printable characters are escaped as 0xXX, where XX are the hex
digits of the character value.
2016-04-13 22:12:39 +01:00
Damien George
594fa73411 py/makeqstrdata: Factor out some code to functions that can be reused. 2016-04-13 16:05:43 +01:00
Paul Sokolovsky
53ca6ae1f3 py/makeqstrdata.py: Catch and report case of empty input file.
The usual cause would be that a cross-compiler for a port is not in PATH.
2015-10-11 11:09:57 +03:00
Tony Abboud
8d8fdcb4be stmhal: add option to query for the current usb mode
Fetch the current usb mode and return a string representation when
pyb.usb_mode() is called with no args. The possible string values are interned
as qstr's. None will be returned if an incorrect mode is set.
2015-09-03 23:30:43 +01:00
Paul Sokolovsky
3a2fb201a5 makeqstrdata.py: Typo fix in comment. 2015-07-31 14:58:14 +03:00
Damien George
c3bd9415cc py: Make qstr hash size configurable, defaults to 2 bytes.
This patch makes configurable, via MICROPY_QSTR_BYTES_IN_HASH, the
number of bytes used for a qstr hash.  It was originally fixed at 2
bytes, and now defaults to 2 bytes.  Setting it to 1 byte will save
ROM and RAM at a small expense of hash collisions.
2015-07-20 11:03:13 +00:00
Damien George
26b512ea1b py: Get makeqstrdata.py and makeversionhdr.py running under Python 2.6.
These scripts should run under as wide a range of Python versions as
possible.
2015-05-30 23:11:16 +01:00
Paul Sokolovsky
f88eec0de2 makeqstrdata.py: Add support for strings with backslash escapes. 2015-04-02 01:10:11 +03:00
Damien George
99ab64ffd4 py/makeqstrdata.py: Make it work again with both Python2 and Python3. 2015-01-11 22:40:38 +00:00
Damien George
95836f8439 py: Add MICROPY_QSTR_BYTES_IN_LEN config option, defaulting to 1.
This new config option sets how many fixed-number-of-bytes to use to
store the length of each qstr.  Previously this was hard coded to 2,
but, as per issue #1056, this is considered overkill since no-one
needs identifiers longer than 255 bytes.

With this patch the number of bytes for the length is configurable, and
defaults to 1 byte.  The configuration option filters through to the
makeqstrdata.py script.

Code size savings going from 2 to 1 byte:
- unix x64 down by 592 bytes
- stmhal down by 1148 bytes
- bare-arm down by 284 bytes

Also has RAM savings, and will be slightly more efficient in execution.
2015-01-11 22:27:30 +00:00
Damien George
6942f80a8f py: Add qstr cfg capability; generate QSTR_NULL and QSTR_ from script. 2015-01-11 22:06:53 +00:00
Damien George
56e1f99ca1 py/makeqstrdata.py: Add more allowed qstr characters; escape quot. 2015-01-11 14:16:24 +00:00
Damien George
e191d42188 py: Use % str formatting instead of {} in makeqstrdata.py.
Script is equivalent, but now also runs under ancient Python 2.6.
Goes part way to addressing issue #847.
2014-09-05 13:16:19 +01:00
Chris Angelico
de09caaa37 Bring the C and Python compute_hash functions into consistency 2014-06-07 07:06:18 +10:00
stijn
1dc7f0427b More relaxed parsing of preprocessed qstr header
The original parsing would error out on any C declarations that are not typedefs
or extern variables. This limits what can go in mpconfig.h and mpconfigport.h,
as they are included in qstr.h. For instance even a function declaration would be
rejected and including system headers is a complete no-go.
That seems too limiting for a global config header, so makeqstrdata now
ignores everything that does not match a qstr definition.
2014-05-03 10:26:31 +02:00
Damien George
708c073250 py: Add '*' qstr for 'import *'; use blank qstr for comprehension arg. 2014-04-27 19:23:46 +01:00
Damien George
897fe0c0d0 py: Add builtin functions bin and oct, and some tests for them. 2014-04-15 22:03:55 +01:00
Damien George
b013aea809 py: Fix builtin hex to print prefix.
I was too hasty.  Still a one-liner though.
2014-04-15 12:50:21 +01:00
Damien George
5805111732 py: Add hex builtin function.
A one-liner, added especially for @pfalcon :)
2014-04-15 12:42:52 +01:00
Damien George
3683789207 py: Clean up and add comments to makeqstrdata. 2014-04-14 23:38:37 +01:00
Damien George
5bb7d99175 py: Modify makeqstrdata to recognise better the output of CPP. 2014-04-13 13:16:51 +01:00
Paul Sokolovsky
73b7027b83 objstr: Add str.encode() and bytes.decode() methods.
These largely duplicate str() & bytes() constructors' functionality,
but can be used to achieve Python2 compatibility.
2014-04-13 06:45:02 +03:00
Paul Sokolovsky
a925cb54f1 py: Preprocess qstrdefs.h before feeding to makeqstrdata.py.
This is alternative implementation of supporting conditionals in qstrdefs.h,
hard to say if it's much cleaner than munging #ifdef's in Python code...
2014-04-12 00:39:55 +03:00
Paul Sokolovsky
6ea0e928d8 Revert "makeqstrdata.py: Add support for conditionally defined qstrs."
This reverts commit acb133d1b1.

Conditionals will be suported using C preprocessor.
2014-04-12 00:39:54 +03:00
Paul Sokolovsky
acb133d1b1 makeqstrdata.py: Add support for conditionally defined qstrs.
Syntax is usual C #if*/#endif, but each qstr must be wrapped individually.
2014-04-10 03:38:42 +03:00
Damien George
6e628c49ca py: Replace naive and teribble hash function with djb2. 2014-03-25 15:27:15 +00:00
Dave Hylands
0308f964a0 Fix makeqstrdata.py to work in Python 2.7 2014-03-10 00:07:35 -07:00
Damien George
fdf0da5436 makeqstrdata: print error to stderr. 2014-03-08 15:03:25 +00:00