Commit Graph

3805 Commits

Author SHA1 Message Date
Jeff Epler a8e98cda83 makeqstrdata: comment my understanding of @ciscorn's code 2020-09-16 08:28:15 -05:00
Taku Fukada d18d79ac47 Small improvements to the dictionary compression 2020-09-14 01:50:01 +09:00
Jeff Epler 15964a4750 makeqstrdata: Avoid encoding problems
Most users and the CI system are running in configurations where Python
configures stdout and stderr in UTF-8 mode.  However, Windows is different,
setting values like CP1252.  This led to a build failure on Windows, because
makeqstrdata printed Unicode strings to its stdout, expecting them to be
encoded as UTF-8.

This script is writing (stdout) to a compiler input file and potentially
printing messages (stderr) to a log or console.  Explicitly configure stdout to
use utf-8 to get consistent behavior on all platforms, and configure stderr so
that if any log/diagnostic messages are printed that cannot be displayed
correctly, they are still displayed instead of creating an error while trying
to print the diagnostic information.

I considered setting the encodings both to ascii, but this would just be
occasionally inconvenient to developers like me who want to show diagnostic
info on stderr and in comments while working with the compression code.

Closes: #3408
2020-09-12 19:43:08 -05:00
Jeff Epler 40ab5c6b21 compression: Implement ciscorn's dictionary approach
Massive savings.  Thanks so much @ciscorn for providing the initial
code for choosing the dictionary.

This adds a bit of time to the build, both to find the dictionary
but also because (for reasons I don't fully understand), the binary
search in the compress() function no longer worked and had to be
replaced with a linear search.

I think this is because the intended invariant is that for codebook
entries that encode to the same number of bits, the entries are ordered
in ascending value.  However, I mis-placed the transition from "words"
to "byte/char values" so the codebook entries for words are in word-order
rather than their code order.

Because this price is only paid at build time, I didn't care to determine
exactly where the correct fix was.

I also commented out a line to produce the "estimated total memory size"
-- at least on the unix build with TRANSLATION=ja, this led to a build
time KeyError trying to compute the codebook size for all the strings.
I think this occurs because some single unicode code point ('ァ') is
no longer present as itself in the compressed strings, due to always
being replaced by a word.

As promised, this seems to save hundreds of bytes in the German translation
on the trinket m0.

Testing performed:
 - built trinket_m0 in several languages
 - built and ran unix port in several languages (en, de_DE, ja) and ran
   simple error-producing codes like ./micropython -c '1/0'
2020-09-12 10:10:45 -05:00
Scott Shawcroft 1ba28b3edc
Merge pull request #3370 from jepler/compression-bigrams
add bigram compression to makeqstrdata (save ~100 bytes on trinket m0 de_DE)
2020-09-10 11:44:56 -07:00
Scott Shawcroft 683462c1b1
Merge pull request #3326 from tannewt/native_wifi
Add native wifi API with ESP32S2 support
2020-09-10 11:20:44 -07:00
Jeff Epler bdb07adfcc translations: Make decompression clearer
Now this gets filled in with values e.g., 128 (0x80) and 159 (0x9f).
2020-09-08 19:07:53 -05:00
Jeff Epler 73858ea682 circuitpy_mpconfig: enable 3-arg pow() with CIRCUITPY_FULL_BUILD
This is needed for a port of python3's decimal.py module.
2020-09-06 10:07:57 -05:00
Jeff Epler 20c2dd0c08 core: add int.bit_length() when MICROPY_CYPTHON_COMPAT is enabled
This method of integer objects is needed for a port of python3's
decimal.py module.

MICROPY_CPYTHON_COMPAT is enabled by CIRCUITPY_FULL_BUILD.
2020-09-06 09:53:16 -05:00
Scott Shawcroft 96cf60fbbd
Merge remote-tracking branch 'adafruit/main' into native_wifi 2020-09-03 16:34:56 -07:00
Scott Shawcroft 0b94638aeb
Changes based on Dan's feedback 2020-09-03 16:32:12 -07:00
Jeff Epler cbfd38d1ce Rename functions to encode_ngrams / decode_ngrams 2020-09-02 19:09:23 -05:00
Jeff Epler c34cb82ecb makeqstrdata: correct range of low code points to 0x80..0x9f inclusive
The previous range was unintentionally big and overlaps some characters
we'd like to use (and also 0xa0, which we don't intentionally use)
2020-09-02 15:52:02 -05:00
Jeff Epler 07740d19f3 add bigram compression to makeqstrdata
Compress common unicode bigrams by making code points in the range
0x80 - 0xbf (inclusive) represent them.  Then, they can be greedily
encoded and the substituted code points handled by the existing Huffman
compression.  Normally code points in the range 0x80-0xbf are not used
in Unicode, so we stake our own claim.  Using the more arguably correct
"Private Use Area" (PUA) would mean that for scripts that only use
code points under 256 we would use more memory for the "values" table.

bigram means "two letters", and is also sometimes called a "digram".
It's nothing to do with "big RAM".  For our purposes, a bigram represents
two successive unicode code points, so for instance in our build on
trinket m0 for english the most frequent are:
['t ', 'e ', 'in', 'd ', ...].

The bigrams are selected based on frequency in the corpus, but the
selection is not necessarily optimal, for these reasons I can think of:
 * Suppose the corpus was just "tea" repeated 100 times.  The
   top bigrams would be "te", and "ea".  However,
   overlap, "te" could never be used.  Thus, some bigrams might actually
   waste space
    * I _assume_ this has to be why e.g., bigram 0x86 "s " is more
      frequent than bigram 0x85 " a" in English for Trinket M0, because
      sequences like "can't add" would get the "t " digram and then
      be unable to use the " a" digram.

 * And generally, if a bigram is frequent then so are its constituents.
   Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman
   code for "in" had better compress to 10 or fewer bits or it's a net
   loss!
    * I checked though!  "i" is 5 bits, "n" is 6 bits (lucky guess)
      but the bigram 0x83 also just 6 bits, so this one is a win of
      5 bits for every "it" minus overhead.  Yay, this round goes to team
      compression.
    * On the other hand, the least frequent bigram 0x9d " n" is 10 bits
      long and its constituent code points are 4+6 bits so there's no
      savings, but there is the cost of the table entry.
    * and somehow 0x9f 'an' is never used at all!

With or without accounting for overlaps, there is some optimum number
of bigrams.  Adding one more bigram uses at least 2 bytes (for the
entry in the bigram table; 4 bytes if code points >255 are in the
source text) and also needs a slot in the Huffman dictionary, so
adding bigrams beyond the optimim number makes compression worse again.

If it's an improvement, the fact that it's not guaranteed optimal
doesn't seem to matter too much.  It just leaves a little more fruit
for the next sweep to pick up.  Perhaps try adding the most frequent
bigram not yet present, until it doesn't improve compression overall.

Right now, de_DE is again the "fullest" build on trinket_m0.  (It's
reclaimed that spot from the ja translation somehow)  This change saves
104 bytes there, increasing free space about 6.8%.  In the larger
(but not critically full) pyportal build it saves 324 bytes.

The specific number of bigrams used (32) was chosen as it is the max
number that fit within the 0x80..0xbf range.  Larger tables would
require the use of 16 bit code points in the de_DE build, losing savings
overall.

(Side note: The most frequent letters in English have been said
to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)
2020-09-01 17:12:22 -05:00
Scott Shawcroft f0e60da51f
Merge pull request #3310 from dhalbert/ble_hci
_bleio HCI implementation
2020-09-01 11:28:05 -07:00
Dan Halbert 6dbd369272 merge from upstream 2020-08-30 14:39:03 -04:00
Dan Halbert b27d511251 address review; use constructor for HCI Adapter 2020-08-30 14:06:48 -04:00
Jeff Epler 455226ffde builtinimport: Fix a crash with 'import ulab.linalg' on unix port only
A crash like the following occurs in the unix port:
```
Program received signal SIGSEGV, Segmentation fault.
0x00005555555a2d7a in mp_obj_module_set_globals (self_in=0x55555562c860 <ulab_user_cmodule>, globals=0x55555562c840 <mp_module_ulab_globals>) at ../../py/objmodule.c:145
145	    self->globals = globals;
(gdb) up
#1  0x00005555555b2781 in mp_builtin___import__ (n_args=5, args=0x7fffffffdbb0) at ../../py/builtinimport.c:496
496	                mp_obj_module_set_globals(outer_module_obj,
(gdb)
#2  0x00005555555940c9 in mp_import_name (name=824, fromlist=0x555555621f10 <mp_const_none_obj>, level=0x1) at ../../py/runtime.c:1392
1392	    return mp_builtin___import__(5, args);
```

I don't understand how it doesn't happen on the embedded ports, because
the module object should reside in ROM and the assignment of self->globals
should trigger a Hard Fault.

By checking VERIFY_PTR, we know that the pointed-to data is on the heap
so we can do things like mutate it.
2020-08-30 11:09:49 -05:00
Scott Shawcroft 767ca5c3dc
Merge remote-tracking branch 'adafruit/main' into native_wifi 2020-08-27 11:42:31 -07:00
Jeff Epler 2e0a109331
Merge pull request #3318 from jepler/interrupt-serial-rx
supervisor: check for interrupt during rx_chr
2020-08-25 21:01:33 -05:00
Scott Shawcroft 8b71e26abd
Merge remote-tracking branch 'adafruit/main' into native_wifi 2020-08-25 16:39:23 -07:00
Jeff Epler c0753c1afb mp_obj_print_helper: Handle a ctrl-c that comes in during printing
In #2689, hitting ctrl-c during the printing of an object with a lot of sub-objects could cause the screen to stop updating (without showing a KeyboardInterrupt).  This makes the printing of such objects acutally interruptable, and also correctly handles the KeyboardInterrupt:

```
>>> l = ["a" * 100] * 200
>>> l
['aaaaaaaaaaaaaaaaaaaaaa...aaaaaaaaaaa', Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyboardInterrupt:
>>>
```
2020-08-25 11:47:50 -05:00
Scott Shawcroft 701e80a025
Make socket reads interruptable 2020-08-21 11:00:02 -07:00
Dan Halbert 0e30dd8bcc merge from upstream; working; includes debug_out code for debugging via Saleae for posterity 2020-08-20 20:29:57 -04:00
Scott Shawcroft eb8b42aff1
Add basic error handling 2020-08-19 14:23:28 -07:00
Scott Shawcroft 1034cc1217
Add espidf module. 2020-08-19 14:23:28 -07:00
Scott Shawcroft 430530c74b
SSL works until it runs out of memory 2020-08-19 14:23:28 -07:00
Scott Shawcroft c9ece21c28
SocketPool stubbed out 2020-08-19 14:22:13 -07:00
Scott Shawcroft 3860991111
Ping work and start to add socketpool 2020-08-19 14:22:13 -07:00
Scott Shawcroft c53a72d3f5
Fix ipaddress import and parse ipv4 strings 2020-08-19 14:22:13 -07:00
Scott Shawcroft c62ab6e09a
Add ipaddress 2020-08-19 14:22:12 -07:00
Scott Shawcroft 1a6f4e0fe0
Scanning WIP. Need to sort out supervisor memory 2020-08-19 14:22:12 -07:00
Scott Shawcroft c5b8401a15
First crack at native wifi API 2020-08-19 14:21:59 -07:00
Scott Shawcroft 6857f98426
Split pulseio.PWMOut into pwmio
This gives us better granularity when implementing new ports because
PWMOut is commonly implemented before PulseIn and PulseOut.

Fixes #3211
2020-08-18 13:08:33 -07:00
Scott Shawcroft 24ca5c0218
Merge pull request #3295 from tannewt/turn_off_terminalio
Turn off terminalio for ja and ko
2020-08-18 12:10:31 -07:00
Taku Fukada 79a3796b1c Calculate the Huffman codebook without MP_QSTRs 2020-08-18 23:21:14 +09:00
Scott Shawcroft d01f5dc0bd
Turn off terminalio for ja and ko
The font is missing many characters and the build needs the space.
We can optimize font storage when we get a good font.

The serial output will work as usual.
2020-08-17 17:17:59 -07:00
Jeff Epler 08ed09acc6 makeqstrdata: don't print "compression incrased length" messages
This check as implemented is misleading, because it compares the
compressed size in bytes (including the length indication) with the source
string length in Unicode code points.  For English this is approximately
fair, but for Japanese this is quite unfair and produces an excess of
"increased length" messages.

This message might have existed for one of two reasons:
 * to alert to an improperly function huffman compression
 * to call attention to a need for a "string is stored uncompressed" case
We know by now that the huffman compression is functioning as designed and
effective in general.

Just to be on the safe side, I did some back-of-the-envelope estimates.
I considered these three replacements for "the true source string size, in bytes":
+    decompressed_len_utf8 = len(decompressed.encode('utf-8'))
+    decompressed_len_utf16 = len(decompressed.encode('utf-16be'))
+    decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8

The third counts how many bits each character requires (fewer than 128
characters in the source character set = 7, fewer than 256 = 8, fewer than 512
= 9, etc, adding a string-terminating value) and is in some way representative
of the best way we would be able to store "uncompressed strings".  The Japanese
translation (largest as of writing) has just a few strings which increase by
this metric.  However, the amount of loss due to expansion in those cases is
outweighed by the cost of adding 1 bit per string to indicate whether it's
compressed or not.  For instance, in the BOARD=trinket_m0 TRANSLATION=ja build
the loss is 47 bytes over 300 strings.  Adding 1 bit to each of 300 strings will
cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to
check and decode "uncompressed" strings in order to break even.
2020-08-16 20:50:48 -05:00
Jeff Epler cff448205f Don't define SHARPDISPLAY when !DISPLAYIO
.. even if FULL_BUILD
2020-08-12 07:39:28 -05:00
Jeff Epler c1400bae9b sharpmemory: Implement support for Sharp Memory Displays in framebufferio 2020-08-12 07:32:18 -05:00
Jeff Epler 93b373d617 "pop from empty %q"
Saves 12 bytes code on trinket m0
2020-08-04 18:42:09 -05:00
Jeff Epler 65e26f4a06 py: mp_obj_get_type_qstr as macro saves 24 bytes 2020-08-04 14:45:45 -05:00
Jeff Epler 024c8da578 Combine some "can't convert" messages 2020-08-04 14:45:45 -05:00
Jeff Epler c849b781c0 Combine 'index out of range' messages 2020-08-04 14:45:45 -05:00
Jeff Epler 89797fd3f9 various: Use mp_obj_get_type_qstr more widely
This removes runtime allocations of the cstring version of the qstring.

It is not a size improvement
2020-08-04 14:45:45 -05:00
Jeff Epler c37a25f0e5 Use qstrs to save an additional 4 bytes 2020-08-04 14:45:45 -05:00
Jeff Epler 92917b84f1 fix exception type for pop from empty set 2020-08-04 13:58:29 -05:00
Jeff Epler 67eb93fc98 py: introduce, use mp_raise_msg_vlist
This saves a very small amount of flash, 8 bytes on trinket_m0
2020-08-04 13:34:29 -05:00
Jeff Epler dddd25a776 Combine similar strings to reduce size of translations
This is a slight trade-off with code size, in places where a "_varg"
mp_raise variant is now used.  The net savings on trinket_m0 is
just 32 bytes.

It also means that the translation will include the original English
text, and cannot be translated.  These are usually names of Python
types such as int, set, or dict or special values such as "inf" or
"Nan".
2020-08-04 13:34:29 -05:00
Dan Halbert 0a60aee3e4 wip: compiles 2020-08-02 11:36:38 -04:00