This commit removes all parts of code associated with the existing
MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE optimisation option, including the
-mcache-lookup-bc option to mpy-cross.
This feature originally provided a significant performance boost for Unix,
but wasn't able to be enabled for MCU targets (due to frozen bytecode), and
added significant extra complexity to generating and distributing .mpy
files.
The equivalent performance gain is now provided by the combination of
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE (which has
been enabled on the unix port in the previous commit).
It's hard to provide precise performance numbers, but tests have been run
on a wide variety of architectures (x86-64, ARM Cortex, Aarch64, RISC-V,
xtensa) and they all generally agree on the qualitative improvements seen
by the combination of MICROPY_OPT_LOAD_ATTR_FAST_PATH and
MICROPY_OPT_MAP_LOOKUP_CACHE.
For example, on a "quiet" Linux x64 environment (i3-5010U @ 2.10GHz) the
change from CACHE_MAP_LOOKUP_IN_BYTECODE, to LOAD_ATTR_FAST_PATH combined
with MAP_LOOKUP_CACHE is:
diff of scores (higher is better)
N=2000 M=2000 bccache -> attrmapcache diff diff% (error%)
bm_chaos.py 13742.56 -> 13905.67 : +163.11 = +1.187% (+/-3.75%)
bm_fannkuch.py 60.13 -> 61.34 : +1.21 = +2.012% (+/-2.11%)
bm_fft.py 113083.20 -> 114793.68 : +1710.48 = +1.513% (+/-1.57%)
bm_float.py 256552.80 -> 243908.29 : -12644.51 = -4.929% (+/-1.90%)
bm_hexiom.py 521.93 -> 625.41 : +103.48 = +19.826% (+/-0.40%)
bm_nqueens.py 197544.25 -> 217713.12 : +20168.87 = +10.210% (+/-3.01%)
bm_pidigits.py 8072.98 -> 8198.75 : +125.77 = +1.558% (+/-3.22%)
misc_aes.py 17283.45 -> 16480.52 : -802.93 = -4.646% (+/-0.82%)
misc_mandel.py 99083.99 -> 128939.84 : +29855.85 = +30.132% (+/-5.88%)
misc_pystone.py 83860.10 -> 82592.56 : -1267.54 = -1.511% (+/-2.27%)
misc_raytrace.py 21490.40 -> 22227.23 : +736.83 = +3.429% (+/-1.88%)
This shows that the new optimisations are at least as good as the existing
inline-bytecode-caching, and are sometimes much better (because the new
ones apply caching to a wider variety of map lookups).
The new optimisations can also benefit code generated by the native
emitter, because they apply to the runtime rather than the generated code.
The improvement for the native emitter when LOAD_ATTR_FAST_PATH and
MAP_LOOKUP_CACHE are enabled is (same Linux environment as above):
diff of scores (higher is better)
N=2000 M=2000 native -> nat-attrmapcache diff diff% (error%)
bm_chaos.py 14130.62 -> 15464.68 : +1334.06 = +9.441% (+/-7.11%)
bm_fannkuch.py 74.96 -> 76.16 : +1.20 = +1.601% (+/-1.80%)
bm_fft.py 166682.99 -> 168221.86 : +1538.87 = +0.923% (+/-4.20%)
bm_float.py 233415.23 -> 265524.90 : +32109.67 = +13.756% (+/-2.57%)
bm_hexiom.py 628.59 -> 734.17 : +105.58 = +16.796% (+/-1.39%)
bm_nqueens.py 225418.44 -> 232926.45 : +7508.01 = +3.331% (+/-3.10%)
bm_pidigits.py 6322.00 -> 6379.52 : +57.52 = +0.910% (+/-5.62%)
misc_aes.py 20670.10 -> 27223.18 : +6553.08 = +31.703% (+/-1.56%)
misc_mandel.py 138221.11 -> 152014.01 : +13792.90 = +9.979% (+/-2.46%)
misc_pystone.py 85032.14 -> 105681.44 : +20649.30 = +24.284% (+/-2.25%)
misc_raytrace.py 19800.01 -> 23350.73 : +3550.72 = +17.933% (+/-2.79%)
In summary, compared to MICROPY_OPT_CACHE_MAP_LOOKUP_IN_BYTECODE, the new
MICROPY_OPT_LOAD_ATTR_FAST_PATH and MICROPY_OPT_MAP_LOOKUP_CACHE options:
- are simpler;
- take less code size;
- are faster (generally);
- work with code generated by the native emitter;
- can be used on embedded targets with a small and constant RAM overhead;
- allow the same .mpy bytecode to run on all targets.
See #7680 for further discussion. And see also #7653 for a discussion
about simplifying mpy-cross options.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
This adds the Python files in the tests/ directory to be formatted with
./tools/codeformat.py. The basics/ subdirectory is excluded for now so we
aren't changing too much at once.
In a few places `# fmt: off`/`# fmt: on` was used where the code had
special formatting for readability or where the test was actually testing
the specific formatting.
This check follows CPython's behaviour, because 'import *' always populates
the globals with the imported names, not locals.
Since it's safe to do this (doesn't lead to a crash or undefined behaviour)
the check is only enabled for MICROPY_CPYTHON_COMPAT.
Fixes issue #5121.
This patch compresses the second part of the bytecode prelude which
contains the source file name, function name, source-line-number mapping
and cell closure information. This part of the prelude now begins with a
single varible length unsigned integer which encodes 2 numbers, being the
byte-size of the following 2 sections in the header: the "source info
section" and the "closure section". After decoding this variable unsigned
integer it's possible to skip over one or both of these sections very
easily.
This scheme saves about 2 bytes for most functions compared to the original
format: one in the case that there are no closure cells, and one because
padding was eliminated.
From the beginning of this project the RAISE_VARARGS opcode was named and
implemented following CPython, where it has an argument (to the opcode)
counting how many args the raise takes:
raise # 0 args (re-raise previous exception)
raise exc # 1 arg
raise exc from exc2 # 2 args (chained raise)
In the bytecode this operation therefore takes 2 bytes, one for
RAISE_VARARGS and one for the number of args.
This patch splits this opcode into 3, where each is now a single byte.
This reduces bytecode size by 1 byte for each use of raise. Every byte
counts! It also has the benefit of reducing code size (on all ports except
nanbox).
To make progress towards MicroPython supporting Python 3.5, adding the
matmul operator is important because it's a really "low level" part of the
language, being a new token and modifications to the grammar.
It doesn't make sense to make it configurable because 1) it would make the
grammar and lexer complicated/messy; 2) no other operators are
configurable; 3) it's not a feature that can be "dynamically plugged in"
via an import.
And matmul can be useful as a general purpose user-defined operator, it
doesn't have to be just for numpy use.
Based on work done by Jim Mussared.
POP_BLOCK and POP_EXCEPT are now the same, and are always followed by a
JUMP. So this optimisation reduces code size, and RAM usage of bytecode by
two bytes for each try-except handler.
This patch fixes a bug in the VM when breaking within a try-finally. The
bug has to do with executing a break within the finally block of a
try-finally statement. For example:
def f():
for x in (1,):
print('a', x)
try:
raise Exception
finally:
print(1)
break
print('b', x)
f()
Currently in uPy the above code will print:
a 1
1
1
segmentation fault (core dumped) micropython
Not only is there a seg fault, but the "1" in the finally block is printed
twice. This is because when the VM executes a finally block it doesn't
really know if that block was executed due to a fall-through of the try (no
exception raised), or because an exception is active. In particular, for
nested finallys the VM has no idea which of the nested ones have active
exceptions and which are just fall-throughs. So when a break (or continue)
is executed it tries to unwind all of the finallys, when in fact only some
may be active.
It's questionable whether break (or return or continue) should be allowed
within a finally block, because they implicitly swallow any active
exception, but nevertheless it's allowed by CPython (although almost never
used in the standard library). And uPy should at least not crash in such a
case.
The solution here relies on the fact that exception and finally handlers
always appear in the bytecode after the try body.
Note: there was a similar bug with a return in a finally block, but that
was previously fixed in b735208403
This is to allow to place reverse ops immediately after normal ops, so
they can be tested as one range (which is optimization for reverse ops
introduction in the next patch).
It starts a dichotomy of mp_binary_op_t values which can't appear in the
bytecode. Another reason to move it is to VALUES of OP_* and OP_INPLACE_*
nicely adjacent. This also will be needed for OP_REVERSE_*, to be soon
introduced.
The output might contain more than one line ending in 5b so properly skip
everything until the next known point.
This fixes test failures in appveyor debug builds.
Previous to this patch each time a bytes object was referenced a new
instance (with the same data) was created. With this patch a single
bytes object is created in the compiler and is loaded directly at execute
time as a true constant (similar to loading bignum and float objects).
This saves on allocating RAM and means that bytes objects can now be
used when the memory manager is locked (eg in interrupts).
The MP_BC_LOAD_CONST_BYTES bytecode was removed as part of this.
Generated bytecode is slightly larger due to storing a pointer to the
bytes object instead of the qstr identifier.
Code size is reduced by about 60 bytes on Thumb2 architectures.
Hashing is now done using mp_unary_op function with MP_UNARY_OP_HASH as
the operator argument. Hashing for int, str and bytes still go via
fast-path in mp_unary_op since they are the most common objects which
need to be hashed.
This lead to quite a bit of code cleanup, and should be more efficient
if anything. It saves 176 bytes code space on Thumb2, and 360 bytes on
x86.
The only loss is that the error message "unhashable type" is now the
more generic "unsupported type for __hash__".
Before this patch a "with" block needed to create a bound method object
on the heap for the __exit__ call. Now it doesn't because we use
load_method instead of load_attr, and save the method+self on the stack.