MP_REGISTER_MODULE would use identifiers like
"MODULE_DEF_MP_QSTR___FUTURE__" which would in turn cause
a QSTR to be generated for it. This wasn't desirable, because the
qstr would never be used.
This clears out quite a bit of flash storage on the proxlight trinkey.
Due to a number of problems, an error calling the preprocessor wasn't
making the whole genlast process fail.
Now, after an execption during preprocess is printed, it is re-raised.
Then, by actually collating the results of executor.map, the exception
will be raised in the main thread context. Any CalledProcessError is simply
converted to a nonzero exit status.
This gets a further speedup of about 2s (12s -> 9.5s elapsed build time)
for stm32f405_feather
For what are probably historical reasons, the qstr process involves
preprocessing a large number of source files into a single "qstr.i.last"
file, then reading this and splitting it into one "qstr" file for each
original source ("*.c") file.
By eliminating the step of writing qstr.i.last as well as making the
regular-expression-matching part be parallelized, build speed is further
improved.
Because the step to build QSTR_DEFS_COLLECTED does not access
qstr.i.last, the path is replaced with "-" in the Makefile.
Rather than simply invoking gcc in preprocessor mode with a list of files, use
a Python script with the (python3) ThreadPoolExecutor to invoke the
preprocessor in parallel.
The amount of concurrency is the number of system CPUs, not the makefile "-j"
parallelism setting, because there is no simple and correct way for a Python
program to correctly work together with make's idea of parallelism.
This reduces the build time of stm32f405 feather (a non-LTO build) from 16s to
12s on my 16-thread Ryzen machine.