I.e. in this mode, C stack will never be used to call a Python function,
but if there's no free heap for a call, it will be reported as
RuntimeError (as expected), not MemoryError.
Previous to this patch, a big-int, float or imag constant was interned
(made into a qstr) and then parsed at runtime to create an object each
time it was needed. This is wasteful in RAM and not efficient. Now,
these constants are parsed straight away in the parser and turned into
objects. This allows constants with large numbers of digits (so
addresses issue #1103) and takes us a step closer to #722.
This is a simple optimisation inspired by JITing technology: we cache in
the bytecode (using 1 byte) the offset of the last successful lookup in
a map. This allows us next time round to check in that location in the
hash table (mp_map_t) for the desired entry, and if it's there use that
entry straight away. Otherwise fallback to a normal map lookup.
Works for LOAD_NAME, LOAD_GLOBAL, LOAD_ATTR and STORE_ATTR opcodes.
On a few tests it gives >90% cache hit and greatly improves speed of
code.
Disabled by default. Enabled for unix and stmhal ports.
This patch consolidates all global variables in py/ core into one place,
in a global structure. Root pointers are all located together to make
GC tracing easier and more efficient.
This is for efficiency, so we don't need to subtract 1 from the ip
before storing it to code_state->ip. It saves a lot of ROM bytes on
unix and stmhal.
Mirroring ip to a volatile memory variable for each opcode is an expensive
operation. For quite a lot of often executed opcodes like stack manipulation
or jumps, exceptions cannot actually happen. So, record ip only for opcode
where that's possible.
This optimisation reduces the VM exception stack element (mp_exc_stack_t)
by 1 word, by using bit 1 of a pointer to store whether the opcode was a
FINALLY or WITH opcode. This optimisation was pending, waiting for
maturity of the exception handling code, which has now proven itself.
Saves 1 machine word RAM for each exception (4->3 words per exception).
Increases stmhal code by 4 bytes, and decreases unix x64 code by 32
bytes.
This allows to implement KeyboardInterrupt on unix, and a much safer
ctrl-C in stmhal port. First ctrl-C is a soft one, with hope that VM
will notice it; second ctrl-C is a hard one that kills anything (for
both unix and stmhal).
One needs to check for a pending exception in the VM only for jump
opcodes. Others can't produce an infinite loop (infinite recursion is
caught by stack check).
There is a lot potential in compress bytecodes and make more use of the
coding space. This patch introduces "multi" bytecodes which have their
argument included in the bytecode (by addition).
UNARY_OP and BINARY_OP now no longer take a 1 byte argument for the
opcode. Rather, the opcode is included in the first byte itself.
LOAD_FAST_[0,1,2] and STORE_FAST_[0,1,2] are removed in favour of their
multi versions, which can take an argument between 0 and 15 inclusive.
The majority of LOAD_FAST/STORE_FAST codes fit in this range and so this
saves a byte for each of these.
LOAD_CONST_SMALL_INT_MULTI is used to load small ints between -16 and 47
inclusive. Such ints are quite common and now only need 1 byte to
store, and now have much faster decoding.
In all this patch saves about 2% RAM for typically bytecode (1.8% on
64-bit test, 2.5% on pyboard test). It also reduces the binary size
(because bytecodes are simplified) and doesn't harm performance.
Code-info size, block name, source name, n_state and n_exc_stack now use
variable length encoded uints. This saves 7-9 bytes per bytecode
function for most functions.
With a file with 1 line (and an error on that line), used to show the
line as number 0. Now shows it correctly as line number 1.
But, when line numbers are disabled, it now prints line number 1 for any
line that has an error (instead of 0 as previously). This might end up
being confusing, but requires extra RAM and/or hack logic to make it
print something special in the case of no line numbers.
Reduces by about a factor of 10 on average the amount of RAM needed to
store the line-number to bytecode map in the bytecode prelude.
Using CPython3.4's stdlib for statistics: previously, an average of
13 bytes were used per (bytecode offset, line-number offset) pair, and
now with this improvement, that's down to 1.3 bytes on average.
Large RAM usage before was due to some very large steps in line numbers,
both from the start of the first line in a function way down in the
file, and also functions that have big comments and/or big strings in
them (both cases were significant).
Although the savings are large on average for the CPython stdlib, it
won't have such a big effect for small scripts used in embedded
programming.
Addresses issue #648.
This may seem a bit of a risky change, in that it may introduce crazy
bugs with respect to volatile variables in the VM loop. But, I think it
should be fine: code_state points to some external memory, so the
compiler should always read/write to that memory when accessing the
ip/sp variables (ie not put them in registers).
Anyway, it passes all tests and improves on all efficiency fronts: about
2-4% faster (64-bit unix), 16 bytes less stack space per call (64-bit
unix) and slightly less executable size (unix and stmhal).
The reason it's more efficient is save_ip and save_sp were volatile
variables, so were anyway stored on the stack (in memory, not regs).
Thus converting them to code_state->{ip, sp} doesn't cost an extra
memory dereference (except maybe to get code_state, but that can be put
in a register and then made more efficient for other uses of it).
Conflicts:
py/vm.c
Fixed stack underflow check. Use UINT_FMT/INT_FMT where necessary.
Specify maximum VM-stack byte size by multiple of machine word size, so
that on 64 bit machines it has same functionality as 32 bit.
This improves stack usage in callers to mp_execute_bytecode2, and is step
forward towards unifying execution interface for function and generators
(which is important because generators don't even support full forms
of arguments passing (keywords, etc.)).
Needed to pop the iterator object when breaking out of a for loop. Need
also to be careful to unwind exception handler before popping iterator.
Addresses issue #635.
This helps the compiler do its optimisation, makes it clear which
variables are local per opcode and which global, and makes it consistent
when extra variables are needed in an opcode (in addition to old obj1,
obj2 pair, for example).
Could also make unum local, but that's for another time.