circuitpython

Commit Graph

Author	SHA1	Message	Date
Artyom Skrobov	c3e40d50ab	[qstr] Separate hash and len from string data This allows the compiler to merge strings: e.g. "update", "difference_update" and "symmetric_difference_update" will all point to the same memory. Shaves ~1KB off the image size, and potentially allows bigger savings if qstr attrs are initialized in qstr_init(), and not stored in the image.	2021-04-06 12:58:42 -04:00
Jeff Epler	234fa2a226	decompress: Fix decompression when length takes 7 bits This manifested as incorrect error messages from mpy-cross, like ``` $ mpy-cross doesnotexist.py OSError: [Errno 2] cno such file/director ``` The remaining bits in `b` must be shifted to the correct position before entering the loop. For most (all?) actual builds, compress_max_length_bits was 8 and the problem went unnoticed.	2021-04-04 11:15:33 -05:00
microDev	a52eb88031	run code formatting script	2021-03-15 19:27:36 +05:30
Dan Halbert	255ffa979c	avoid inline compile errors	2021-01-08 23:07:21 -05:00
Taku Fukada	d18d79ac47	Small improvements to the dictionary compression	2020-09-14 01:50:01 +09:00
Jeff Epler	40ab5c6b21	compression: Implement ciscorn's dictionary approach Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary. This adds a bit of time to the build, both to find the dictionary but also because (for reasons I don't fully understand), the binary search in the compress() function no longer worked and had to be replaced with a linear search. I think this is because the intended invariant is that for codebook entries that encode to the same number of bits, the entries are ordered in ascending value. However, I mis-placed the transition from "words" to "byte/char values" so the codebook entries for words are in word-order rather than their code order. Because this price is only paid at build time, I didn't care to determine exactly where the correct fix was. I also commented out a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. As promised, this seems to save hundreds of bytes in the German translation on the trinket m0. Testing performed: - built trinket_m0 in several languages - built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0'	2020-09-12 10:10:45 -05:00
Jeff Epler	0eee93729a	Fix decompression of unicode values above 2047 Two problems: The lead byte for 3-byte sequences was wrong, and one mid-byte was not even filled in due to a missing "++"! Apparently this was broken ever since the first "Compress as unicode, not bytes" commit, but I believed I'd "tested" it by running on the Pinyin translation. This rendered at least the Korean and Japanese translations completely illegible, affecting 5.0 and all later releases.	2020-09-08 20:54:47 -05:00
Jeff Epler	bdb07adfcc	translations: Make decompression clearer Now this gets filled in with values e.g., 128 (0x80) and 159 (0x9f).	2020-09-08 19:07:53 -05:00
Jeff Epler	07740d19f3	add bigram compression to makeqstrdata Compress common unicode bigrams by making code points in the range 0x80 - 0xbf (inclusive) represent them. Then, they can be greedily encoded and the substituted code points handled by the existing Huffman compression. Normally code points in the range 0x80-0xbf are not used in Unicode, so we stake our own claim. Using the more arguably correct "Private Use Area" (PUA) would mean that for scripts that only use code points under 256 we would use more memory for the "values" table. bigram means "two letters", and is also sometimes called a "digram". It's nothing to do with "big RAM". For our purposes, a bigram represents two successive unicode code points, so for instance in our build on trinket m0 for english the most frequent are: ['t ', 'e ', 'in', 'd ', ...]. The bigrams are selected based on frequency in the corpus, but the selection is not necessarily optimal, for these reasons I can think of: * Suppose the corpus was just "tea" repeated 100 times. The top bigrams would be "te", and "ea". However, overlap, "te" could never be used. Thus, some bigrams might actually waste space * I _assume_ this has to be why e.g., bigram 0x86 "s " is more frequent than bigram 0x85 " a" in English for Trinket M0, because sequences like "can't add" would get the "t " digram and then be unable to use the " a" digram. * And generally, if a bigram is frequent then so are its constituents. Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman code for "in" had better compress to 10 or fewer bits or it's a net loss! * I checked though! "i" is 5 bits, "n" is 6 bits (lucky guess) but the bigram 0x83 also just 6 bits, so this one is a win of 5 bits for every "it" minus overhead. Yay, this round goes to team compression. * On the other hand, the least frequent bigram 0x9d " n" is 10 bits long and its constituent code points are 4+6 bits so there's no savings, but there is the cost of the table entry. * and somehow 0x9f 'an' is never used at all! With or without accounting for overlaps, there is some optimum number of bigrams. Adding one more bigram uses at least 2 bytes (for the entry in the bigram table; 4 bytes if code points >255 are in the source text) and also needs a slot in the Huffman dictionary, so adding bigrams beyond the optimim number makes compression worse again. If it's an improvement, the fact that it's not guaranteed optimal doesn't seem to matter too much. It just leaves a little more fruit for the next sweep to pick up. Perhaps try adding the most frequent bigram not yet present, until it doesn't improve compression overall. Right now, de_DE is again the "fullest" build on trinket_m0. (It's reclaimed that spot from the ja translation somehow) This change saves 104 bytes there, increasing free space about 6.8%. In the larger (but not critically full) pyportal build it saves 324 bytes. The specific number of bigrams used (32) was chosen as it is the max number that fit within the 0x80..0xbf range. Larger tables would require the use of 16 bit code points in the de_DE build, losing savings overall. (Side note: The most frequent letters in English have been said to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)	2020-09-01 17:12:22 -05:00
Jeff Epler	fe3e8d1589	string compression: save a few bits per string Length was stored as a 16-bit number always. Most translations have a max length far less. For example, US English translation lengths always fit in just 8 bits. probably all languages fit in 9 bits. This also has the side effect of reducing the alignment of compressed_string_t from 2 bytes to 1. testing performed: ran in german and english on pyruler, printed messages looked right. Firmware size, en_US Before: 3044 bytes free in flash After: 3408 bytes free in flash Firmware size, de_DE (with #2967 merged to restore translations) Before: 1236 bytes free in flash After: 1600 bytes free in flash	2020-05-28 08:36:08 -05:00
Jeff Epler	e06a3bbceb	translation: Compress as unicode, not bytes By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence. Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0) Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI) English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter. Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài. and Press any key to enter the REPL. Use CTRL-D to reload. were properly displayed.	2019-12-02 09:46:46 -06:00
Scott Shawcroft	6ef8639971	Rework safe mode and have heap overwrite trigger it. This creates a common safe mode mechanic that ports can share. As a result, the nRF52 now has safe mode support as well. The common safe mode adds a 700ms delay at startup where a reset during that window will cause a reset into safe mode. This window is designated by a yellow status pixel and flashing the single led three times. A couple NeoPixel fixes are included for the nRF52 as well. Fixes #1034. Fixes #990. Fixes #615.	2018-12-06 14:24:20 -08:00
Scott Shawcroft	2cd166b573	Fix esp and samd	2018-08-16 17:41:35 -07:00
Scott Shawcroft	137a30ad75	fix mpy-cross	2018-08-16 17:40:57 -07:00
Scott Shawcroft	de5a9d72dc	Compress all translated strings with Huffman coding. This saves code space in builds which use link-time optimization. The optimization drops the untranslated strings and replaces them with a compressed_string_t struct. It can then be decompressed to a c string. Builds without LTO work as well but include both untranslated strings and compressed strings. This work could be expanded to include QSTRs and loaded strings if a compress method is added to C. Its tracked in #531.	2018-08-16 17:40:57 -07:00
Scott Shawcroft	933add6cd8	Support internationalisation.	2018-08-07 14:58:57 -07:00

16 Commits