This allows configuring the pre-allocated size of sys.modules dict, in
order to prevent unwanted reallocations at run-time (3 sys-modules is
really not quite enough for a larger project).
When building with STATIC undefined (e.g., -DSTATIC=), there are two
instances of mp_type_code that collide at link time: in profile.c and in
builtinevex.c. This patch resolves the collision by renaming one of them.
This allows the compiler to merge strings: e.g. "update",
"difference_update" and "symmetric_difference_update"
will all point to the same memory.
Shaves ~1KB off the image size, and potentially allows
bigger savings if qstr attrs are initialized in qstr_init(),
and not stored in the image.
The parts that are generic are added to py/ so they can be used by other
ports that use CMake.
py/usermod.cmake:
* Creates a usermod target to hang user C/CXX modules from.
* Gathers sources from user C/CXX modules and libs for QSTR scan.
ports/rp2/CMakeLists.txt:
* Includes py/usermod.cmake.
* Links the resulting usermod library to the MicroPython target.
py/mkrules.cmake:
Add cxxflags to qstr.i.last custom command for CXX modules:
* MICROPY_CPP_FLAGS so CXX modules will find includes.
* -DNO_QSTR to fix fatal error missing "genhdr/qstrdefs.generated.h".
Usage:
The rp2 port can be linked against user C modules by running:
make USER_C_MODULES=/path/to/module/micropython.cmake
CMake will print a list of included modules.
Co-authored-by: Graham Sanderson <graham.sanderson@raspberrypi.org>
Co-authored-by: Michael O'Cleirigh <michael.ocleirigh@rivulet.ca>
Signed-off-by: Phil Howard <phil@pimoroni.com>
mp_printf should be used to print the prefix because it's also used in
mp_bytecode_print2 (otherwise, depending on the system, different output
streams may be used).
Also print the current thread state when threading is enabled to easily see
which thread executes what opcode.
Signed-off-by: Damien George <damien@micropython.org>
This adds some additional code in mkfs which doesn't seem necessary, and
Disabling it saves 172 bytes flash.
Testing performed: Using a Feather M0 Adalogger, checked that
* an sdcard could still be mounted (using adafruit_sdcard)
* os.listdir() of "/" and "/sd" worked
* CIRCUITPY still mounted
This also adds a bit of code everywhere we DISPATCH(), but the net is
+232 bytes free on Feather M0 Adalogger.
Key assumption: All of the offsets in mp_execute_bytecode fit in 16 bits;
it is not clear whether the compiler will verify this assumption (e.g.,
by warning that a constant will be truncated)
* Always clear the peripheral interrupt so we don't hang when full
* Store the ringbuf in the object so it gets collected when we're alive
* Make UART objects have a finaliser so they are deinit when their
memory is freed
* Copy bytes into the ringbuf from the FIFO after we read to ensure
the interrupt is enabled ASAP
* Copy bytes into the ringbuf from the FIFO before measuring our
rx available because the interrupt is based on a threshold (not
> 0). For example, a single byte won't trigger an interrupt.
This allows a port to specify a custom qstrdefsport.h file, the same as the
QSTR_DEFS variable in a Makefile.
Signed-off-by: Damien George <damien@micropython.org>
The core cmake rules use custom commands to invoke qstr processing
scripts. For the zephyr port, it's possible that list arguments to these
commands may contain generator expressions, therefore we need to expand
them properly.
Signed-off-by: Maureen Helm <maureen.helm@nxp.com>
For certain operands to mpn_div, the existing code path for
`DIG_SIZE == MPZ_DBL_DIG_SIZE / 2` had a bug in it where borrow could still
overflow in the `(x >= *n || *n - x <= borrow)` branch, ie
`borrow + x - (mpz_dbl_dig_t)*n` overflows the borrow variable. In such
cases the subsequent right-shift of borrow would not bring in the overflow
bit, leading to an error in the result. An example division that had
overflow when MPZ_DIG_SIZE = 16 is `(2 ** 48 - 1) ** 2 // (2 ** 48 - 1)`.
This is fixed in this commit by simplifying the code and handling the low
digits of borrow first, and then the upper bits (to shift down) separately.
There is no longer a distinction between `DIG_SIZE < MPZ_DBL_DIG_SIZE / 2`
and `DIG_SIZE == MPZ_DBL_DIG_SIZE / 2`.
This commit also simplifies the second part of the calculation so that
borrow does not need to be negated (instead the code just works knowing
that borrow is negative and using + instead of - in calculations involving
borrow).
Fixes#6777.
Signed-off-by: Damien George <damien@micropython.org>
The "word" referred to by BYTES_PER_WORD is actually the size of mp_obj_t
which is not always the same as the size of a pointer on the target
architecture. So rename this config value to better reflect what it
measures, and also prefix it with MP_.
For uses of BYTES_PER_WORD in setting the stack limit this has been
changed to sizeof(void *), because the stack usually grows with
machine-word sized values (eg an nlr_buf_t has many machine words in it).
Signed-off-by: Damien George <damien@micropython.org>
It's only used in one location, to test if << or >> will overflow when
shifting mp_uint_t. For such a test it's clearer to use sizeof(lhs_val),
which will be valid even if the type of lhs_val changes.
Signed-off-by: Damien George <damien@micropython.org>
This environment variable, if defined during the build process,
indicates a fixed time that should be used in place of "now" when
such a time is explicitely referenced.
This allows for reproducible builds of micropython.
See https://reproducible-builds.org/specs/source-date-epoch/
Signed-off-by: iTitou <moiandme@gmail.com>
This should be enabled when the mp_raw_code_save_file function is needed.
It is enabled for mpy-cross, and a check for defined(__APPLE__) is added to
cover Mac M1 systems.
It practically does the same as qstr_from_str and was only used in one
place, which should actually use the compile-time MP_QSTR_XXX form for
consistency; qstr_from_str is for runtime strings only.
Adds a new compile-time option MICROPY_EMIT_THUMB_ARMV7M which is enabled
by default (to get existing behaviour) and which should be disabled (set to
0) when building native emitter support (@micropython.native) on ARMv6M
targets.
This returns a reference to the globals dict associated with the function,
ie the global scope that the function was defined in. This attribute is
read-only but the dict itself is modifiable, per CPython behaviour.
Signed-off-by: Damien George <damien@micropython.org>
The RP2040 is new microcontroller from Raspberry Pi that features
two Cortex M0s and eight PIO state machines that are good for
crunching lots of data. It has 264k RAM and a built in UF2
bootloader too.
Datasheet: https://pico.raspberrypi.org/files/rp2040_datasheet.pdf
Several issues have been found in the implementation. While they're
unresolved, it may be better to disable the built-in module. (This
means that to work on fixing the module, it'll be necessary to
revert this commit)
* Better messaging when code is stopped by an auto-reload.
* Auto-reload works during sleeps on ESP32-S2. Ticks wake up the
main task each time.
* Made internal naming consistent. CamelCase Python names are NOT
separated by an underscore.
As a general pattern, required positional arguments that are not named do
not need to be parsed using mp_arg_parse_all().
Signed-off-by: Damien George <damien@micropython.org>
This changes lots of files to unify `board.h` across ports. It adds
`board_deinit` when CIRCUITPY_ALARM is set. `main.c` uses it to
deinit the board before deep sleeping (even when pretending.)
Deep sleep is now a two step process for the port. First, the
port should prepare to deep sleep based on the given alarms. It
should set alarms for both deep and pretend sleep. In particular,
the pretend versions should be set immediately so that we don't
miss an alarm as we shutdown. These alarms should also wake from
`port_idle_until_interrupt` which is used when pretending to deep
sleep.
Second, when real deep sleeping, `alarm_enter_deep_sleep` is called.
The port should set any alarms it didn't during prepare based on
data it saved internally during prepare.
ESP32-S2 sleep is a bit reorganized to locate more logic with
TimeAlarm. This will help it scale to more alarm types.
Fixes#3786
Two issues are tackled:
1. The calculation of the correct length to print is fixed to treat the
precision as a maximum length instead as the exact length.
This is done for both qstr (%q) and for regular str (%s).
2. Fix the incorrect use of mp_printf("%.*s") to mp_print_strn().
Because of the fix of above issue, some testcases that would print
an embedded null-byte (^@ in test-output) would now fail.
The bug here is that "%s" was used to print null-bytes. Instead,
mp_print_strn is used to make sure all bytes are outputted and the
exact length is respected.
Test-cases are added for both %s and %q with a combination of precision
and padding specifiers.
This allows calls to `allocate_memory()` while the VM is running, it will then allocate from the GC heap (unless there is a suitable hole among the supervisor allocations), and when the VM exits and the GC heap is freed, the allocation will be moved to the bottom of the former GC heap and transformed into a proper supervisor allocation. Existing movable allocations will also be moved to defragment the supervisor heap and ensure that the next VM run gets as much memory as possible for the GC heap.
By itself this breaks terminalio because it violates the assumption that supervisor_display_move_memory() still has access to an undisturbed heap to copy the tilegrid from. It will work in many cases, but if you're unlucky you will get garbled terminal contents after exiting from the vm run that created the display. This will be fixed in the following commit, which is separate to simplify review.
`pow(a, b, c)` can compute `(a ** b) % c` efficiently (in time and memory).
This can be useful for extremely specific applications, like implementing
the RSA cryptosystem. For typical uses of CircuitPython, this is not an
important feature. A survey of the bundle and learn system didn't find
any uses.
Disable it on M0 builds so that we can fit in needed upgrades to the USB
stack.
Disable certain classes of diagnostic when building ulab. We should
submit patches upstream to (A) fix these errors and (B) upgrade their
CI so that the problems are caught before we want to integrate with
CircuitPython, but not right now.
Also known as L2CAP "connection oriented channels". This provides a
socket-like data transfer mechanism for BLE.
Currently only implemented for NimBLE on STM32 / Unix.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
I like to use local makefile overrides, in the file GNUmakefile
(or, on case-sensitive systems, makefile) to set compilation choices.
However, writing
TRANSLATION := de_DE
include Makefile
did not work, because py.mk would override the TRANSLATION := specified
in an earlier part of the makefiles (but not from the commandline).
By using ?= instead of := the local makefile override works, but when
TRANSLATION is not specified it continues to work as before.
This ensures that only the translate("") alternative that will be used
is seen after preprocessing. Improves the quality of the Huffman encoding
and reduces binary size slightly.
Also makes one "enhanced" error message only occur when ERROR_REPORTING_DETAILED:
Instead of the word-for-word python3 error message
"Type object has no attribute '%q'", the message will be
"'type' object has no attribute '%q'". Also reduces binary size.
(that's rolled into this commit as it was right next to a change to
use the preprocessor for MICROPY_ERROR_REPORTING)
Note that the odd semicolon after "value_error:" in parsenum.c is necessary
due to a detail of the C grammar, in which a declaration cannot follow
a label directly.
This reclaims over 1kB of flash space by simplifying certain exception
messages. e.g., it will no longer display the requested/actual length
when a fixed list/tuple of N items is needed:
if (MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_TERSE) {
mp_raise_ValueError(translate("tuple/list has wrong length"));
} else {
mp_raise_ValueError_varg(translate("requested length %d but object has length %d"),
(int)len, (int)seq_len);
Other chip families including samd51 keep their current error reporting
capabilities.
* No weak link for modules. It only impacts _os and _time and is
already disabled for non-full builds.
* Turn off PA00 and PA01 because they are the crystal on the Metro
M0 Express.
* Change ejected default to false to move it to BSS. It is set on
USB connection anyway.
* Set sinc_filter to const. Doesn't help flash but keeps it out of
RAM.
This gives a substantial speedup of the preprocessing step, i.e. the
generation of qstr.i.last. For example on a clean build, making
qstr.i.last:
21s -> 4s on STM32 (WB55)
8.9 -> 1.8s on Unix (dev).
Done in collaboration with @stinos.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
Support C++ code in .cpp files by providing CXX counterparts of the
_USERMOD_ flags we have for C already. This merely enables the Makefile of
user C modules to use variables specific to C++ compilation, it is still up
to each port's main Makefile to also include these in the build.
When SCR_QSTR contains C++ files they should be preprocessed with the same
compiler flags (CXXFLAGS) as they will be compiled with, to make sure code
scanned for QSTR occurrences is effectively the code used in the rest of
the build. The 'split SCR_QSTR in .c and .cpp files and process each with
different flags' logic isn't trivial to express in a Makefile and the
existing principle for deciding which files to preprocess was already
rather complicated, so the actual preprocessing is moved into
makeqstrdefs.py completely.
When process_file() is passed a preprocessed C++ file for instance it won't
find any lines containing .c files and the last_fname variable remains
None, so handle that gracefully.
Newer GCC versions are able to warn about switch cases that fall
through. This is usually a sign of a forgotten break statement, but in
the few cases where a fall through is intended we annotate it with this
macro to avoid the warning.
Like Clang, GCC warns about this file, but only with -Woverride-init
which is enabled by -Wextra. Disable the warnings for this file just
like we do for Clang to make -Wextra happy.
When compiling with -Wextra which includes -Wmissing-field-initializers
GCC will warn that the defval field of mp_arg_val_t is not initialized.
This is just a warning as it is defined to be zero initialized, but since
it is a union it makes sense to be explicit about which member we're
going to use, so add the explicit initializers and get rid of the
warning.
On x86 chars are signed, but we're comparing a char to '0' + unsigned int,
which is promoted to an unsigned int. Let's promote the char to unsigned
before doing the comparison to avoid weird corner cases.
The function scope_find_or_add_id used to take a scope_kind_t enum and
save it in an uint8_t. Saving an enum in a uint8_t is fine, but
everywhere this function is called it is not actually given a
scope_kind_t but an anonymous enum instead. Let's give this enum a name
and use that as the argument type.
This doesn't change the generated code, but is a C type mismatch that
unfortunately doesn't show up unless you enable -Wenum-conversion.
This gets a further speedup of about 2s (12s -> 9.5s elapsed build time)
for stm32f405_feather
For what are probably historical reasons, the qstr process involves
preprocessing a large number of source files into a single "qstr.i.last"
file, then reading this and splitting it into one "qstr" file for each
original source ("*.c") file.
By eliminating the step of writing qstr.i.last as well as making the
regular-expression-matching part be parallelized, build speed is further
improved.
Because the step to build QSTR_DEFS_COLLECTED does not access
qstr.i.last, the path is replaced with "-" in the Makefile.
Rather than simply invoking gcc in preprocessor mode with a list of files, use
a Python script with the (python3) ThreadPoolExecutor to invoke the
preprocessor in parallel.
The amount of concurrency is the number of system CPUs, not the makefile "-j"
parallelism setting, because there is no simple and correct way for a Python
program to correctly work together with make's idea of parallelism.
This reduces the build time of stm32f405 feather (a non-LTO build) from 16s to
12s on my 16-thread Ryzen machine.
Some examples of improved compliance with CPython that currently
have divergent behavior in CircuitPython are listed below:
* yield from is not allowed in async methods
```
>>> async def f():
... yield from 'abc'
...
Traceback (most recent call last):
File "<stdin>", line 2, in f
SyntaxError: 'yield from' inside async function
```
* await only works on awaitable expressions
```
>>> async def f():
... await 'not awaitable'
...
>>> f().send(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in f
AttributeError: 'str' object has no attribute '__await__'
```
* only __await__()able expressions are awaitable
Okay this one actually does not work in circuitpython at all today.
This is how CPython works though and pretending __await__ does not
exist will only bite users who write both.
```
>>> class c:
... pass
...
>>> def f(self):
... yield
... yield
... return 'f to pay respects'
...
>>> c.__await__ = f # could just as easily have put it on the class but this shows how it's wired
>>> async def g():
... awaitable_thing = c()
... partial = await awaitable_thing
... return 'press ' + partial
...
>>> q = g()
>>> q.send(None)
>>> q.send(None)
>>> q.send(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration: press f to pay respects
```
This adds the `async def` and `await` verbs to valid CircuitPython syntax using the Micropython implementation.
Consider:
```
>>> class Awaitable:
... def __iter__(self):
... for i in range(3):
... print('awaiting', i)
... yield
... return 42
...
>>> async def wait_for_it():
... a = Awaitable()
... result = await a
... return result
...
>>> task = wait_for_it()
>>> next(task)
awaiting 0
>>> next(task)
awaiting 1
>>> next(task)
awaiting 2
>>> next(task)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration: 42
>>>
```
and more excitingly:
```
>>> async def it_awaits_a_subtask():
... value = await wait_for_it()
... print('twice as good', value * 2)
...
>>> task = it_awaits_a_subtask()
>>> next(task)
awaiting 0
>>> next(task)
awaiting 1
>>> next(task)
awaiting 2
>>> next(task)
twice as good 84
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration:
```
Note that this is just syntax plumbing, not an all-encompassing implementation of an asynchronous task scheduler or asynchronous hardware apis.
uasyncio might be a good module to bring in, or something else - but the standard Python syntax does not _strictly require_ deeper hardware
support.
Micropython implements the await verb via the __iter__ function rather than __await__. It's okay.
The syntax being present will enable users to write clean and expressive multi-step state machines that are written serially and interleaved
according to the rules provided by those users.
Given that this does not include an all-encompassing C scheduler, this is expected to be an advanced functionality until the community settles
on the future of deep hardware support for async/await in CircuitPython. Users will implement yield-based schedulers and tasks wrapping
synchronous hardware APIs with polling to avoid blocking, while their application business logic gets simple `await` statements.
Some downstream projects may use tags in their repositories for more than
just designating MicroPython releases. In those cases, the
makeversionhdr.py script would end up using a different tag than intended.
So tell `git describe` to only match tags that look like a MicroPython
version tag, such as `v1.12` or `v2.0`.
This already begins obscuring things, because now there are two sets of
shared-module functions for manipulating the same structure, e.g.,
common_hal_canio_remote_transmission_request_get_id and
common_hal_canio_message_get_id
Calling the bytes constructor on a bytes object returns the original bytes
object. This saves allocating a new instance, and matches CPython.
Signed-off-by: Iyassou Shimels <s.iyassou@gmail.com>
New contributor @mdroberts1243 encountered an interesting problem in
which the argument they had named "column_underscore_and_page_addressing"
simply couldn't be used; I discovered that internally this had been
transformed into "column_underscore∧page_addressing", because QSTR
makes _ENTITY_ stand for the same thing as &ENTITY; does in HTML.
This might be nice for some things, but we don't want it here!
I was unable to find a sensible way to "escape" and prevent this entity
coding, so instead I ripped out support for the _and_ and _or_ escapes.
Tested & working:
* Send standard packets
* Receive standard packets (1 FIFO, no filter)
Interoperation between SAM E54 Xplained running this tree and
MicroPython running on STM32F405 Feather with an external
transceiver was also tested.
Many other aspects of a full implementation are not yet present,
such as error detection and recovery.
Discord user Folknology encountered a problem building with Python 3.6.9,
`TypeError: ord() expected a character, but string of length 0 found`.
I was able to reproduce the problem using Python3.5*, and discovered that
the meaning of the regular expression `"|."` had changed in 3.7. Before,
```
>>> [m.group(0) for m in re.finditer("|.", "hello")]
['', '', '', '', '', '']
```
After:
```
>>> [m.group(0) for m in re.finditer("|.", "hello")]
['', 'h', '', 'e', '', 'l', '', 'l', '', 'o', '']
```
Check if `words` is empty and if so use `"."` as the regular expression
instead. This gives the same result on both versions:
```
['h', 'e', 'l', 'l', 'o']
```
and fixes the generation of the huffman dictionary.
Folknology verified that this fix worked for them.
* I could easily install 3.5 but not 3.6. 3.5 reproduced the same problem
This construct (which I added without sufficient testing,
apparently) is only supported in Python 3.7 and newer. Make it
optional so that this script works on other Python versions. This
means that if you have a system with non-UTF-8 encoding you will
need to use Python 3.7.
In particular, this affects a problem building circuitpython in
github's ubuntu-18.04 virtual environment when Python 3.7 is not
explicitly installed. cookie-cuttered libraries call for Python
3.6:
```
- name: Set up Python 3.6
uses: actions/setup-python@v1
with:
python-version: 3.6
```
Since CircuitPython's own build calls for 3.8, this problem was not
detected.
This problem was also encountered by discord user mdroberts1243.
The failure I encountered was here:
https://github.com/jepler/Jepler_CircuitPython_udecimal/runs/1138045020?check_suite_focus=true
.. while my step of "clone and build circuitpython unix port" is
unusual, I think the same problem would have affected "build assets"
if that step had been reached.
For time-based functions that work with absolute time there is the need for
an Epoch, to set the zero-point at which the absolute time starts counting.
Such functions include time.time() and filesystem stat return values. And
different ports may use a different Epoch.
To make it clearer what functions use the Epoch (whatever it may be), and
make the ports more consistent with their use of the Epoch, this commit
renames all Epoch related functions to include the word "epoch" in their
name (and remove references to "2000").
Along with this rename, the following things have changed:
- mp_hal_time_ns() is now specified to return the number of nanoseconds
since the Epoch, rather than since 1970 (but since this is an internal
function it doesn't change anything for the user).
- littlefs timestamps on the esp8266 have been fixed (they were previously
off by 30 years in nanoseconds).
Otherwise, there is no functional change made by this commit.
Signed-off-by: Damien George <damien@micropython.org>
Most users and the CI system are running in configurations where Python
configures stdout and stderr in UTF-8 mode. However, Windows is different,
setting values like CP1252. This led to a build failure on Windows, because
makeqstrdata printed Unicode strings to its stdout, expecting them to be
encoded as UTF-8.
This script is writing (stdout) to a compiler input file and potentially
printing messages (stderr) to a log or console. Explicitly configure stdout to
use utf-8 to get consistent behavior on all platforms, and configure stderr so
that if any log/diagnostic messages are printed that cannot be displayed
correctly, they are still displayed instead of creating an error while trying
to print the diagnostic information.
I considered setting the encodings both to ascii, but this would just be
occasionally inconvenient to developers like me who want to show diagnostic
info on stderr and in comments while working with the compression code.
Closes: #3408
While checking whether we can enable -Wimplicit-fallthrough, I encountered
a diagnostic in mp_binary_set_val_array_from_int which led to discovering
the following bug:
```
>>> struct.pack("xb", 3)
b'\x03\x03'
```
That is, the next value (3) was used as the value of a padding byte, while
standard Python always fills "x" bytes with zeros. I initially thought
this had to do with the unintentional fallthrough, but it doesn't.
Instead, this code would relate to an array.array with a typecode of
padding ('x'), which is ALSO not desktop Python compliant:
```
>>> array.array('x', (1, 2, 3))
array('x', [1, 0, 0])
```
Possibly this is dead code that used to be shared between struct-setting
and array-setting, but it no longer is.
I also discovered that the argument list length for struct.pack
and struct.pack_into were not checked, and that the length of binary data
passed to array.array was not checked to be a multiple of the element
size.
I have corrected all of these to conform more closely to standard Python
and revised some tests where necessary. Some tests for micropython-specific
behavior that does not conform to standard Python and is not present
in CircuitPython was deleted outright.
Massive savings. Thanks so much @ciscorn for providing the initial
code for choosing the dictionary.
This adds a bit of time to the build, both to find the dictionary
but also because (for reasons I don't fully understand), the binary
search in the compress() function no longer worked and had to be
replaced with a linear search.
I think this is because the intended invariant is that for codebook
entries that encode to the same number of bits, the entries are ordered
in ascending value. However, I mis-placed the transition from "words"
to "byte/char values" so the codebook entries for words are in word-order
rather than their code order.
Because this price is only paid at build time, I didn't care to determine
exactly where the correct fix was.
I also commented out a line to produce the "estimated total memory size"
-- at least on the unix build with TRANSLATION=ja, this led to a build
time KeyError trying to compute the codebook size for all the strings.
I think this occurs because some single unicode code point ('ァ') is
no longer present as itself in the compressed strings, due to always
being replaced by a word.
As promised, this seems to save hundreds of bytes in the German translation
on the trinket m0.
Testing performed:
- built trinket_m0 in several languages
- built and ran unix port in several languages (en, de_DE, ja) and ran
simple error-producing codes like ./micropython -c '1/0'
Prior to this commit, pow(-2, float('nan')) would return (nan+nanj), or
raise an exception on targets that don't support complex numbers. This is
fixed to return simply nan, as CPython does.
Signed-off-by: Damien George <damien@micropython.org>
This is consistent with the other 'micro' modules and allows implementing
additional features in Python via e.g. micropython-lib's sys.
Note this is a breaking change (not backwards compatible) for ports which
do not enable weak links, as "import sys" must now be replaced with
"import usys".
Compress common unicode bigrams by making code points in the range
0x80 - 0xbf (inclusive) represent them. Then, they can be greedily
encoded and the substituted code points handled by the existing Huffman
compression. Normally code points in the range 0x80-0xbf are not used
in Unicode, so we stake our own claim. Using the more arguably correct
"Private Use Area" (PUA) would mean that for scripts that only use
code points under 256 we would use more memory for the "values" table.
bigram means "two letters", and is also sometimes called a "digram".
It's nothing to do with "big RAM". For our purposes, a bigram represents
two successive unicode code points, so for instance in our build on
trinket m0 for english the most frequent are:
['t ', 'e ', 'in', 'd ', ...].
The bigrams are selected based on frequency in the corpus, but the
selection is not necessarily optimal, for these reasons I can think of:
* Suppose the corpus was just "tea" repeated 100 times. The
top bigrams would be "te", and "ea". However,
overlap, "te" could never be used. Thus, some bigrams might actually
waste space
* I _assume_ this has to be why e.g., bigram 0x86 "s " is more
frequent than bigram 0x85 " a" in English for Trinket M0, because
sequences like "can't add" would get the "t " digram and then
be unable to use the " a" digram.
* And generally, if a bigram is frequent then so are its constituents.
Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman
code for "in" had better compress to 10 or fewer bits or it's a net
loss!
* I checked though! "i" is 5 bits, "n" is 6 bits (lucky guess)
but the bigram 0x83 also just 6 bits, so this one is a win of
5 bits for every "it" minus overhead. Yay, this round goes to team
compression.
* On the other hand, the least frequent bigram 0x9d " n" is 10 bits
long and its constituent code points are 4+6 bits so there's no
savings, but there is the cost of the table entry.
* and somehow 0x9f 'an' is never used at all!
With or without accounting for overlaps, there is some optimum number
of bigrams. Adding one more bigram uses at least 2 bytes (for the
entry in the bigram table; 4 bytes if code points >255 are in the
source text) and also needs a slot in the Huffman dictionary, so
adding bigrams beyond the optimim number makes compression worse again.
If it's an improvement, the fact that it's not guaranteed optimal
doesn't seem to matter too much. It just leaves a little more fruit
for the next sweep to pick up. Perhaps try adding the most frequent
bigram not yet present, until it doesn't improve compression overall.
Right now, de_DE is again the "fullest" build on trinket_m0. (It's
reclaimed that spot from the ja translation somehow) This change saves
104 bytes there, increasing free space about 6.8%. In the larger
(but not critically full) pyportal build it saves 324 bytes.
The specific number of bigrams used (32) was chosen as it is the max
number that fit within the 0x80..0xbf range. Larger tables would
require the use of 16 bit code points in the de_DE build, losing savings
overall.
(Side note: The most frequent letters in English have been said
to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)