"b" on Thumb might not be long enough for the jump to nlr_push_tail so it must be done indirectly.
LTO can't "see" inside naked functions, but we can mark `nlr_push_tail` as used.
Now only the bits that really need to be written in assembler are written in it, otherwise C is used. This means that the assembler code no longer needs to know about the global state structure which makes it much easier to maintain.