2016-03-11 10:50:46 +00:00
|
|
|
Maximising Python Speed
|
|
|
|
=======================
|
|
|
|
|
|
|
|
This tutorial describes ways of improving the performance of MicroPython code.
|
|
|
|
Optimisations involving other languages are covered elsewhere, namely the use
|
|
|
|
of modules written in C and the MicroPython inline ARM Thumb-2 assembler.
|
|
|
|
|
|
|
|
The process of developing high performance code comprises the following stages
|
|
|
|
which should be performed in the order listed.
|
|
|
|
|
|
|
|
* Design for speed.
|
|
|
|
* Code and debug.
|
|
|
|
|
|
|
|
Optimisation steps:
|
|
|
|
|
|
|
|
* Identify the slowest section of code.
|
|
|
|
* Improve the efficiency of the Python code.
|
|
|
|
* Use the native code emitter.
|
|
|
|
* Use the viper code emitter.
|
|
|
|
|
|
|
|
Designing for speed
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
Performance issues should be considered at the outset. This involves taking a view
|
|
|
|
on the sections of code which are most performance critical and devoting particular
|
|
|
|
attention to their design. The process of optimisation begins when the code has
|
|
|
|
been tested: if the design is correct at the outset optimisation will be
|
|
|
|
straightforward and may actually be unnecessary.
|
|
|
|
|
|
|
|
Algorithms
|
|
|
|
~~~~~~~~~~
|
|
|
|
|
|
|
|
The most important aspect of designing any routine for performance is ensuring that
|
|
|
|
the best algorithm is employed. This is a topic for textbooks rather than for a
|
|
|
|
MicroPython guide but spectacular performance gains can sometimes be achieved
|
|
|
|
by adopting algorithms known for their efficiency.
|
|
|
|
|
|
|
|
RAM Allocation
|
|
|
|
~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
To design efficient MicroPython code it is necessary to have an understanding of the
|
|
|
|
way the interpreter allocates RAM. When an object is created or grows in size
|
|
|
|
(for example where an item is appended to a list) the necessary RAM is allocated
|
|
|
|
from a block known as the heap. This takes a significant amount of time;
|
|
|
|
further it will on occasion trigger a process known as garbage collection which
|
|
|
|
can take several milliseconds.
|
|
|
|
|
|
|
|
Consequently the performance of a function or method can be improved if an object is created
|
|
|
|
once only and not permitted to grow in size. This implies that the object persists
|
|
|
|
for the duration of its use: typically it will be instantiated in a class constructor
|
|
|
|
and used in various methods.
|
|
|
|
|
|
|
|
This is covered in further detail :ref:`Controlling garbage collection <gc>` below.
|
|
|
|
|
|
|
|
Buffers
|
|
|
|
~~~~~~~
|
|
|
|
|
|
|
|
An example of the above is the common case where a buffer is required, such as one
|
|
|
|
used for communication with a device. A typical driver will create the buffer in the
|
|
|
|
constructor and use it in its I/O methods which will be called repeatedly.
|
|
|
|
|
2016-04-15 17:24:56 +03:00
|
|
|
The MicroPython libraries typically provide support for pre-allocated buffers. For
|
|
|
|
example, objects which support stream interface (e.g., file or UART) provide ``read()``
|
|
|
|
method which allocate new buffer for read data, but also a ``readinto()`` method
|
|
|
|
to read data into an existing buffer.
|
2016-03-11 10:50:46 +00:00
|
|
|
|
|
|
|
Floating Point
|
|
|
|
~~~~~~~~~~~~~~
|
|
|
|
|
2016-04-15 17:43:03 +03:00
|
|
|
Some MicroPython ports allocate floating point numbers on heap. Some other ports
|
|
|
|
may lack dedicated floating-point coprocessor, and perform arithmetic operations
|
|
|
|
on them in "software" at considerably lower speed than on integers. Where
|
|
|
|
performance is important, use integer operations and restrict the use of floating
|
|
|
|
point to sections of the code where performance is not paramount. For example,
|
|
|
|
capture ADC readings as integers values to an array in one quick go, and only then
|
|
|
|
convert them to floating-point numbers for signal processing.
|
2016-03-11 10:50:46 +00:00
|
|
|
|
|
|
|
Arrays
|
|
|
|
~~~~~~
|
|
|
|
|
|
|
|
Consider the use of the various types of array classes as an alternative to lists.
|
|
|
|
The ``array`` module supports various element types with 8-bit elements supported
|
|
|
|
by Python's built in ``bytes`` and ``bytearray`` classes. These data structures all store
|
|
|
|
elements in contiguous memory locations. Once again to avoid memory allocation in critical
|
|
|
|
code these should be pre-allocated and passed as arguments or as bound objects.
|
|
|
|
|
|
|
|
When passing slices of objects such as ``bytearray`` instances, Python creates
|
2016-04-15 18:18:18 +03:00
|
|
|
a copy which involves allocation of the size proportional to the size of slice.
|
|
|
|
This can be alleviated using a ``memoryview`` object. ``memoryview`` itself
|
|
|
|
is allocated on heap, but is a small, fixed-size object, regardless of the size
|
|
|
|
of slice it points too.
|
2016-03-11 10:50:46 +00:00
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
2016-04-15 18:18:18 +03:00
|
|
|
ba = bytearray(10000) # big array
|
|
|
|
func(ba[30:2000]) # a copy is passed, ~2K new allocation
|
|
|
|
mv = memoryview(ba) # small object is allocated
|
|
|
|
func(mv[30:2000]) # a pointer to memory is passed
|
2016-03-11 10:50:46 +00:00
|
|
|
|
|
|
|
A ``memoryview`` can only be applied to objects supporting the buffer protocol - this
|
2016-04-15 18:18:18 +03:00
|
|
|
includes arrays but not lists. Small caveat is that while memoryview object is live,
|
2016-04-15 20:08:51 +03:00
|
|
|
it also keeps alive the original buffer object. So, a memoryview isn't a universal
|
2016-04-15 18:18:18 +03:00
|
|
|
panacea. For instance, in the example above, if you are done with 10K buffer and
|
|
|
|
just need those bytes 30:2000 from it, it may be better to make a slice, and let
|
|
|
|
the 10K buffer go (be ready for garbage collection), instead of making a
|
|
|
|
long-living memoryview and keeping 10K blocked for GC.
|
|
|
|
|
|
|
|
Nonetheless, ``memoryview`` is indispensable for advanced preallocated buffer
|
|
|
|
management. ``.readinto()`` method discussed above puts data at the beginning
|
|
|
|
of buffer and fills in entire buffer. What if you need to put data in the
|
|
|
|
middle of existing buffer? Just create a memoryview into the needed section
|
|
|
|
of buffer and pass it to ``.readinto()``.
|
2016-03-11 10:50:46 +00:00
|
|
|
|
|
|
|
Identifying the slowest section of code
|
|
|
|
---------------------------------------
|
|
|
|
|
|
|
|
This is a process known as profiling and is covered in textbooks and
|
|
|
|
(for standard Python) supported by various software tools. For the type of
|
|
|
|
smaller embedded application likely to be running on MicroPython platforms
|
|
|
|
the slowest function or method can usually be established by judicious use
|
|
|
|
of the timing ``ticks`` group of functions documented
|
|
|
|
`here <http://docs.micropython.org/en/latest/pyboard/library/time.html>`_.
|
|
|
|
Code execution time can be measured in ms, us, or CPU cycles.
|
|
|
|
|
|
|
|
The following enables any function or method to be timed by adding an
|
|
|
|
``@timed_function`` decorator:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
def timed_function(f, *args, **kwargs):
|
|
|
|
myname = str(f).split(' ')[1]
|
|
|
|
def new_func(*args, **kwargs):
|
|
|
|
t = time.ticks_us()
|
|
|
|
result = f(*args, **kwargs)
|
2016-11-07 14:47:16 +09:00
|
|
|
delta = time.ticks_diff(time.ticks_us(), t)
|
2016-03-11 10:50:46 +00:00
|
|
|
print('Function {} Time = {:6.3f}ms'.format(myname, delta/1000))
|
|
|
|
return result
|
|
|
|
return new_func
|
|
|
|
|
|
|
|
MicroPython code improvements
|
|
|
|
-----------------------------
|
|
|
|
|
|
|
|
The const() declaration
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
MicroPython provides a ``const()`` declaration. This works in a similar way
|
|
|
|
to ``#define`` in C in that when the code is compiled to bytecode the compiler
|
|
|
|
substitutes the numeric value for the identifier. This avoids a dictionary
|
|
|
|
lookup at runtime. The argument to ``const()`` may be anything which, at
|
|
|
|
compile time, evaluates to an integer e.g. ``0x100`` or ``1 << 8``.
|
|
|
|
|
|
|
|
.. _Caching:
|
|
|
|
|
|
|
|
Caching object references
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
Where a function or method repeatedly accesses objects performance is improved
|
|
|
|
by caching the object in a local variable:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
class foo(object):
|
|
|
|
def __init__(self):
|
|
|
|
ba = bytearray(100)
|
|
|
|
def bar(self, obj_display):
|
|
|
|
ba_ref = self.ba
|
|
|
|
fb = obj_display.framebuffer
|
|
|
|
# iterative code using these two objects
|
|
|
|
|
|
|
|
This avoids the need repeatedly to look up ``self.ba`` and ``obj_display.framebuffer``
|
|
|
|
in the body of the method ``bar()``.
|
|
|
|
|
|
|
|
.. _gc:
|
|
|
|
|
|
|
|
Controlling garbage collection
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
When memory allocation is required, MicroPython attempts to locate an adequately
|
|
|
|
sized block on the heap. This may fail, usually because the heap is cluttered
|
|
|
|
with objects which are no longer referenced by code. If a failure occurs, the
|
|
|
|
process known as garbage collection reclaims the memory used by these redundant
|
|
|
|
objects and the allocation is then tried again - a process which can take several
|
|
|
|
milliseconds.
|
|
|
|
|
|
|
|
There are benefits in pre-empting this by periodically issuing ``gc.collect()``.
|
|
|
|
Firstly doing a collection before it is actually required is quicker - typically on the
|
|
|
|
order of 1ms if done frequently. Secondly you can determine the point in code
|
|
|
|
where this time is used rather than have a longer delay occur at random points,
|
|
|
|
possibly in a speed critical section. Finally performing collections regularly
|
|
|
|
can reduce fragmentation in the heap. Severe fragmentation can lead to
|
|
|
|
non-recoverable allocation failures.
|
|
|
|
|
|
|
|
Accessing hardware directly
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
This comes into the category of more advanced programming and involves some knowledge
|
|
|
|
of the target MCU. Consider the example of toggling an output pin on the Pyboard. The
|
|
|
|
standard approach would be to write
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
mypin.value(mypin.value() ^ 1) # mypin was instantiated as an output pin
|
|
|
|
|
|
|
|
This involves the overhead of two calls to the ``Pin`` instance's ``value()``
|
|
|
|
method. This overhead can be eliminated by performing a read/write to the relevant bit
|
|
|
|
of the chip's GPIO port output data register (odr). To facilitate this the ``stm``
|
|
|
|
module provides a set of constants providing the addresses of the relevant registers.
|
|
|
|
A fast toggle of pin ``P4`` (CPU pin ``A14``) - corresponding to the green LED -
|
|
|
|
can be performed as follows:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
BIT14 = const(1 << 14)
|
|
|
|
stm.mem16[stm.GPIOA + stm.GPIO_ODR] ^= BIT14
|
|
|
|
|
|
|
|
The Native code emitter
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
This causes the MicroPython compiler to emit ARM native opcodes rather than
|
|
|
|
bytecode. It covers the bulk of the Python language so most functions will require
|
|
|
|
no adaptation (but see below). It is invoked by means of a function decorator:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
@micropython.native
|
|
|
|
def foo(self, arg):
|
|
|
|
buf = self.linebuf # Cached object
|
|
|
|
# code
|
|
|
|
|
|
|
|
There are certain limitations in the current implementation of the native code emitter.
|
|
|
|
|
|
|
|
* Context managers are not supported (the ``with`` statement).
|
|
|
|
* Generators are not supported.
|
|
|
|
* If ``raise`` is used an argument must be supplied.
|
|
|
|
|
|
|
|
The trade-off for the improved performance (roughly twices as fast as bytecode) is an
|
|
|
|
increase in compiled code size.
|
|
|
|
|
|
|
|
The Viper code emitter
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
The optimisations discussed above involve standards-compliant Python code. The
|
|
|
|
Viper code emitter is not fully compliant. It supports special Viper native data types
|
|
|
|
in pursuit of performance. Integer processing is non-compliant because it uses machine
|
|
|
|
words: arithmetic on 32 bit hardware is performed modulo 2**32.
|
|
|
|
|
|
|
|
Like the Native emitter Viper produces machine instructions but further optimisations
|
|
|
|
are performed, substantially increasing performance especially for integer arithmetic and
|
|
|
|
bit manipulations. It is invoked using a decorator:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
@micropython.viper
|
|
|
|
def foo(self, arg: int) -> int:
|
|
|
|
# code
|
|
|
|
|
|
|
|
As the above fragment illustrates it is beneficial to use Python type hints to assist the Viper optimiser.
|
|
|
|
Type hints provide information on the data types of arguments and of the return value; these
|
|
|
|
are a standard Python language feature formally defined here `PEP0484 <https://www.python.org/dev/peps/pep-0484/>`_.
|
|
|
|
Viper supports its own set of types namely ``int``, ``uint`` (unsigned integer), ``ptr``, ``ptr8``,
|
|
|
|
``ptr16`` and ``ptr32``. The ``ptrX`` types are discussed below. Currently the ``uint`` type serves
|
|
|
|
a single purpose: as a type hint for a function return value. If such a function returns ``0xffffffff``
|
|
|
|
Python will interpret the result as 2**32 -1 rather than as -1.
|
|
|
|
|
|
|
|
In addition to the restrictions imposed by the native emitter the following constraints apply:
|
|
|
|
|
|
|
|
* Functions may have up to four arguments.
|
|
|
|
* Default argument values are not permitted.
|
|
|
|
* Floating point may be used but is not optimised.
|
|
|
|
|
|
|
|
Viper provides pointer types to assist the optimiser. These comprise
|
|
|
|
|
|
|
|
* ``ptr`` Pointer to an object.
|
|
|
|
* ``ptr8`` Points to a byte.
|
|
|
|
* ``ptr16`` Points to a 16 bit half-word.
|
|
|
|
* ``ptr32`` Points to a 32 bit machine word.
|
|
|
|
|
|
|
|
The concept of a pointer may be unfamiliar to Python programmers. It has similarities
|
|
|
|
to a Python ``memoryview`` object in that it provides direct access to data stored in memory.
|
|
|
|
Items are accessed using subscript notation, but slices are not supported: a pointer can return
|
|
|
|
a single item only. Its purpose is to provide fast random access to data stored in contiguous
|
|
|
|
memory locations - such as data stored in objects which support the buffer protocol, and
|
|
|
|
memory-mapped peripheral registers in a microcontroller. It should be noted that programming
|
|
|
|
using pointers is hazardous: bounds checking is not performed and the compiler does nothing to
|
|
|
|
prevent buffer overrun errors.
|
|
|
|
|
|
|
|
Typical usage is to cache variables:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
@micropython.viper
|
|
|
|
def foo(self, arg: int) -> int:
|
|
|
|
buf = ptr8(self.linebuf) # self.linebuf is a bytearray or bytes object
|
|
|
|
for x in range(20, 30):
|
|
|
|
bar = buf[x] # Access a data item through the pointer
|
|
|
|
# code omitted
|
|
|
|
|
|
|
|
In this instance the compiler "knows" that ``buf`` is the address of an array of bytes;
|
|
|
|
it can emit code to rapidly compute the address of ``buf[x]`` at runtime. Where casts are
|
|
|
|
used to convert objects to Viper native types these should be performed at the start of
|
|
|
|
the function rather than in critical timing loops as the cast operation can take several
|
|
|
|
microseconds. The rules for casting are as follows:
|
|
|
|
|
|
|
|
* Casting operators are currently: ``int``, ``bool``, ``uint``, ``ptr``, ``ptr8``, ``ptr16`` and ``ptr32``.
|
|
|
|
* The result of a cast will be a native Viper variable.
|
|
|
|
* Arguments to a cast can be a Python object or a native Viper variable.
|
|
|
|
* If argument is a native Viper variable, then cast is a no-op (i.e. costs nothing at runtime)
|
|
|
|
that just changes the type (e.g. from ``uint`` to ``ptr8``) so that you can then store/load
|
|
|
|
using this pointer.
|
|
|
|
* If the argument is a Python object and the cast is ``int`` or ``uint``, then the Python object
|
|
|
|
must be of integral type and the value of that integral object is returned.
|
|
|
|
* The argument to a bool cast must be integral type (boolean or integer); when used as a return
|
|
|
|
type the viper function will return True or False objects.
|
|
|
|
* If the argument is a Python object and the cast is ``ptr``, ``ptr``, ``ptr16`` or ``ptr32``,
|
|
|
|
then the Python object must either have the buffer protocol with read-write capabilities
|
|
|
|
(in which case a pointer to the start of the buffer is returned) or it must be of integral
|
|
|
|
type (in which case the value of that integral object is returned).
|
|
|
|
|
|
|
|
The following example illustrates the use of a ``ptr16`` cast to toggle pin X1 ``n`` times:
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
BIT0 = const(1)
|
|
|
|
@micropython.viper
|
|
|
|
def toggle_n(n: int):
|
|
|
|
odr = ptr16(stm.GPIOA + stm.GPIO_ODR)
|
|
|
|
for _ in range(n):
|
|
|
|
odr[0] ^= BIT0
|
|
|
|
|
|
|
|
A detailed technical description of the three code emitters may be found
|
|
|
|
on Kickstarter here `Note 1 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/664832>`_
|
|
|
|
and here `Note 2 <https://www.kickstarter.com/projects/214379695/micro-python-python-for-microcontrollers/posts/665145>`_
|