Compress common unicode bigrams by making code points in the range
0x80 - 0xbf (inclusive) represent them. Then, they can be greedily
encoded and the substituted code points handled by the existing Huffman
compression. Normally code points in the range 0x80-0xbf are not used
in Unicode, so we stake our own claim. Using the more arguably correct
"Private Use Area" (PUA) would mean that for scripts that only use
code points under 256 we would use more memory for the "values" table.
bigram means "two letters", and is also sometimes called a "digram".
It's nothing to do with "big RAM". For our purposes, a bigram represents
two successive unicode code points, so for instance in our build on
trinket m0 for english the most frequent are:
['t ', 'e ', 'in', 'd ', ...].
The bigrams are selected based on frequency in the corpus, but the
selection is not necessarily optimal, for these reasons I can think of:
* Suppose the corpus was just "tea" repeated 100 times. The
top bigrams would be "te", and "ea". However,
overlap, "te" could never be used. Thus, some bigrams might actually
waste space
* I _assume_ this has to be why e.g., bigram 0x86 "s " is more
frequent than bigram 0x85 " a" in English for Trinket M0, because
sequences like "can't add" would get the "t " digram and then
be unable to use the " a" digram.
* And generally, if a bigram is frequent then so are its constituents.
Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman
code for "in" had better compress to 10 or fewer bits or it's a net
loss!
* I checked though! "i" is 5 bits, "n" is 6 bits (lucky guess)
but the bigram 0x83 also just 6 bits, so this one is a win of
5 bits for every "it" minus overhead. Yay, this round goes to team
compression.
* On the other hand, the least frequent bigram 0x9d " n" is 10 bits
long and its constituent code points are 4+6 bits so there's no
savings, but there is the cost of the table entry.
* and somehow 0x9f 'an' is never used at all!
With or without accounting for overlaps, there is some optimum number
of bigrams. Adding one more bigram uses at least 2 bytes (for the
entry in the bigram table; 4 bytes if code points >255 are in the
source text) and also needs a slot in the Huffman dictionary, so
adding bigrams beyond the optimim number makes compression worse again.
If it's an improvement, the fact that it's not guaranteed optimal
doesn't seem to matter too much. It just leaves a little more fruit
for the next sweep to pick up. Perhaps try adding the most frequent
bigram not yet present, until it doesn't improve compression overall.
Right now, de_DE is again the "fullest" build on trinket_m0. (It's
reclaimed that spot from the ja translation somehow) This change saves
104 bytes there, increasing free space about 6.8%. In the larger
(but not critically full) pyportal build it saves 324 bytes.
The specific number of bigrams used (32) was chosen as it is the max
number that fit within the 0x80..0xbf range. Larger tables would
require the use of 16 bit code points in the de_DE build, losing savings
overall.
(Side note: The most frequent letters in English have been said
to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)
The expectations of displayio.Display and frambufferio.FramebufferDisplay
are different when it comes to rotation.
In displayio.Display, if you call core_construct with a WxH = 64x32
and rotation=90, you get something that is 32 pixels wide and 64 pixels
tall in the LCD's coordinate system.
This is fine, as the existing definitions were written to work like this.
With framebuffer displays, however, the underlying framebuffer (such as
RGBMatrix) says "I am WxH pixels wide in my coordinate system" and the
constructor is given a rotation; when the rotation indicates a transpose
that means "exchange rows and columns, so that to the Groups displayed
on it, there is an effectively HxW pixel region for use".
Happily, we already have a set_rotation method. Thus (modulo the time
spent debugging things anyway:) the fix is simple: Always request no
rotation from core_construct, then immediately fix up the rotation
to match what was requested.
Testing performed: 32x16 RGBMatrix on Metro M4 Express (but using
the Airlift firmware, as this is the configuration the error was reported
on):
* initially construct display at 0, 90, 180, 270 degrees
* later change angle to 0, 90, 180, 270 degrees
* no garbled display
* no safe mode crashes
e.g., allocating a 192x32x6bpp matrix would be enough to trigger this
reliably on a Metro M4 Express using the "memory hogging" layout.
Allocating 64x32x6bpp could trigger it, but somewhat unreliably.
There are several things going on here:
* we make the failing call with interrupts off
* we were throwing an exception with interrupts off
* protomatter failed badly in _PM_free when it was partially-initialized
Incorporate the fix from protomatter, switch to a non-throwing malloc
variant, and ensure that interrupts get turned back on.
This decreases the quality of the MemoryError (it cannot report the size
of the failed allocation) but allows CircuitPython to survive, rather
than faulting.
Otherwise, out of range writes would occur in tilegrid_set_tile, causing a safe mode reset.
```
Hardware watchpoint 6: -location *stack_alloc->ptr
Old value = 24652061
New value = 24641565
0x000444f2 in common_hal_displayio_tilegrid_set_tile (self=0x200002c8 <supervisor_terminal_text_grid>, x=1, y=1, tile_index=0 '\000')
at ../../shared-module/displayio/TileGrid.c:236
236 if (!self->partial_change) {
(gdb)
```
Builds of the esp32s2 targets frequently fail:
```
-- Found Git: /usr/bin/git (found version "2.28.0")
-- Initialising new submodule components/asio/asio...
warning: could not look up configuration 'remote.origin.url'. Assuming this repository is its own authoritative upstream.
Submodule 'components/asio/asio' (/home/runner/work/circuitpython/circuitpython/ports/espressif/asio.git) registered for path 'components/asio/asio'
fatal: repository '/home/runner/work/circuitpython/circuitpython/ports/espressif/asio.git' does not exist
fatal: clone of '/home/runner/work/circuitpython/circuitpython/ports/espressif/asio.git' into submodule path '/home/runner/work/circuitpython/circuitpython/ports/esp32s2/esp-idf/components/asio/asio' failed
Failed to clone 'components/asio/asio'. Retry scheduled
fatal: repository '/home/runner/work/circuitpython/circuitpython/ports/espressif/asio.git' does not exist
fatal: clone of '/home/runner/work/circuitpython/circuitpython/ports/espressif/asio.git' into submodule path '/home/runner/work/circuitpython/circuitpython/ports/esp32s2/esp-idf/components/asio/asio' failed
Failed to clone 'components/asio/asio' a second time, aborting
CMake Error at esp-idf/tools/cmake/git_submodules.cmake:48 (message):
Git submodule init failed for components/asio/asio
Call Stack (most recent call first):
esp-idf/tools/cmake/build.cmake:78 (git_submodule_check)
esp-idf/tools/cmake/build.cmake:160 (__build_get_idf_git_revision)
esp-idf/tools/cmake/idf.cmake:49 (__build_init)
esp-idf/tools/cmake/project.cmake:7 (include)
CMakeLists.txt:8 (include)
```
It's not clear how/why this happens--is it something to do with our
multithreaded build?. Attempt to clear it up by manually checking out these
submodules ourselves.
A crash like the following occurs in the unix port:
```
Program received signal SIGSEGV, Segmentation fault.
0x00005555555a2d7a in mp_obj_module_set_globals (self_in=0x55555562c860 <ulab_user_cmodule>, globals=0x55555562c840 <mp_module_ulab_globals>) at ../../py/objmodule.c:145
145 self->globals = globals;
(gdb) up
#1 0x00005555555b2781 in mp_builtin___import__ (n_args=5, args=0x7fffffffdbb0) at ../../py/builtinimport.c:496
496 mp_obj_module_set_globals(outer_module_obj,
(gdb)
#2 0x00005555555940c9 in mp_import_name (name=824, fromlist=0x555555621f10 <mp_const_none_obj>, level=0x1) at ../../py/runtime.c:1392
1392 return mp_builtin___import__(5, args);
```
I don't understand how it doesn't happen on the embedded ports, because
the module object should reside in ROM and the assignment of self->globals
should trigger a Hard Fault.
By checking VERIFY_PTR, we know that the pointed-to data is on the heap
so we can do things like mutate it.
.. hard-coding ulab for now.
It also fixes a problem where board_name was unassigned when
use_branded_name was False, which only happened at release-building
time.
Trying to change this caused multiple problems in the release process.