circuitpython

Author	SHA1	Message	Date
Jeff Epler	8836198ff1	TextSplitter: don't mutate 'words' I was puzzled by why the dictionary words were sorted by length. It was because TextSplitter sorted its parameter, instead of a copy. This doesn't affect encoding size, but does affect the encoding NUMBER of the found words. We'll deliberately restore sorting by length next, for other reasons, but not by spooky action.	2021-07-09 14:02:31 -05:00
Jeff Epler	99abd03b7a	makeqstrdata: use an extremely accurate dictionary heuristic Try to accurately measure the costs of including a word in the dictionary vs the gains from using it in messages. This saves about 160 bytes on trinket_m0 ja, the fullest translation for that board. Other translations on the same board all have savings, ranging from 24 to 228 bytes. ``` Translation Before After Savings ja 1164 1324 160 de_DE 1260 1396 136 fr 1424 1652 228 zh_Latn_pinyin 1448 1520 72 pt_BR 1584 1736 152 pl 1592 1640 48 es 1724 1816 92 ko 1724 1816 92 fil 1764 1800 36 it_IT 1896 2040 144 nl 1956 2136 180 ID 2072 2180 108 cs 2124 2148 24 sv 2340 2448 108 en_x_pirate 2644 2740 96 en_GB 2652 2752 100 el 2656 2768 112 en_US 2656 2768 112 hi 2656 2768 112 ```	2021-07-09 12:45:49 -05:00
Jeff Epler	45dc0953a5	makeqstrdata.py: Remove a problematic print .. it contained non-ASCII characters, even when building the standard English translation. This may help resolve the build problems reported at #4750.	2021-05-11 21:48:21 -05:00
Scott Shawcroft	b35fa44c8a	Merge MicroPython 1.12 into CircuitPython	2021-05-03 14:01:18 -07:00
Jeff Epler	dfa7c3d32d	codeformat: Fix handling of `` After discussing with danh, I noticed that `a//b` would not match `a/b`. After correcting this and re-running "pre-commit run --all", additional files were reindented, including the codeformat script itself.	2021-04-30 15:30:13 -05:00
Scott Shawcroft	76033d5115	Merge MicroPython v1.11 into CircuitPython	2021-04-26 15:47:41 -07:00
Scott Shawcroft	e54e5e3575	Merge pull request #4564 from tyomitch/patch-1 [build] simplify makeqstrdata heuristic	2021-04-19 14:50:42 -07:00
Artyom Skrobov	dcee89ade7	build: simplify compute_huffman_coding() No functional change.	2021-04-09 08:36:26 -04:00
Artyom Skrobov	68920682b6	[build] simplify makeqstrdata heuristic The simpler one saves, on average, 51 more bytes per translation; the biggest translation per board is reduced, on average, by 85 bytes.	2021-04-09 07:18:40 -04:00
Artyom Skrobov	c3e40d50ab	[qstr] Separate hash and len from string data This allows the compiler to merge strings: e.g. "update", "difference_update" and "symmetric_difference_update" will all point to the same memory. Shaves ~1KB off the image size, and potentially allows bigger savings if qstr attrs are initialized in qstr_init(), and not stored in the image.	2021-04-06 12:58:42 -04:00
microDev	a52eb88031	run code formatting script	2021-03-15 19:27:36 +05:30
Jeff Epler	0318eb359f	makeqstrdata: Work around python3.6 compatibility problem Discord user Folknology encountered a problem building with Python 3.6.9, `TypeError: ord() expected a character, but string of length 0 found`. I was able to reproduce the problem using Python3.5, and discovered that the meaning of the regular expression `"\|."` had changed in 3.7. Before, ``` >>> [m.group(0) for m in re.finditer("\|.", "hello")] ['', '', '', '', '', ''] ``` After: ``` >>> [m.group(0) for m in re.finditer("\|.", "hello")] ['', 'h', '', 'e', '', 'l', '', 'l', '', 'o', ''] ``` Check if `words` is empty and if so use `"."` as the regular expression instead. This gives the same result on both versions: ``` ['h', 'e', 'l', 'l', 'o'] ``` and fixes the generation of the huffman dictionary. Folknology verified that this fix worked for them. I could easily install 3.5 but not 3.6. 3.5 reproduced the same problem	2020-09-21 10:03:07 -05:00
Jeff Epler	bfbbbd6c5c	makeqstrdata: Work with older Python This construct (which I added without sufficient testing, apparently) is only supported in Python 3.7 and newer. Make it optional so that this script works on other Python versions. This means that if you have a system with non-UTF-8 encoding you will need to use Python 3.7. In particular, this affects a problem building circuitpython in github's ubuntu-18.04 virtual environment when Python 3.7 is not explicitly installed. cookie-cuttered libraries call for Python 3.6: ``` - name: Set up Python 3.6 uses: actions/setup-python@v1 with: python-version: 3.6 ``` Since CircuitPython's own build calls for 3.8, this problem was not detected. This problem was also encountered by discord user mdroberts1243. The failure I encountered was here: https://github.com/jepler/Jepler_CircuitPython_udecimal/runs/1138045020?check_suite_focus=true .. while my step of "clone and build circuitpython unix port" is unusual, I think the same problem would have affected "build assets" if that step had been reached.	2020-09-19 10:16:13 -05:00
Jeff Epler	a8e98cda83	makeqstrdata: comment my understanding of @ciscorn's code	2020-09-16 08:28:15 -05:00
Taku Fukada	d18d79ac47	Small improvements to the dictionary compression	2020-09-14 01:50:01 +09:00
Jeff Epler	15964a4750	makeqstrdata: Avoid encoding problems Most users and the CI system are running in configurations where Python configures stdout and stderr in UTF-8 mode. However, Windows is different, setting values like CP1252. This led to a build failure on Windows, because makeqstrdata printed Unicode strings to its stdout, expecting them to be encoded as UTF-8. This script is writing (stdout) to a compiler input file and potentially printing messages (stderr) to a log or console. Explicitly configure stdout to use utf-8 to get consistent behavior on all platforms, and configure stderr so that if any log/diagnostic messages are printed that cannot be displayed correctly, they are still displayed instead of creating an error while trying to print the diagnostic information. I considered setting the encodings both to ascii, but this would just be occasionally inconvenient to developers like me who want to show diagnostic info on stderr and in comments while working with the compression code. Closes: #3408	2020-09-12 19:43:08 -05:00
Jeff Epler	40ab5c6b21	compression: Implement ciscorn's dictionary approach Massive savings. Thanks so much @ciscorn for providing the initial code for choosing the dictionary. This adds a bit of time to the build, both to find the dictionary but also because (for reasons I don't fully understand), the binary search in the compress() function no longer worked and had to be replaced with a linear search. I think this is because the intended invariant is that for codebook entries that encode to the same number of bits, the entries are ordered in ascending value. However, I mis-placed the transition from "words" to "byte/char values" so the codebook entries for words are in word-order rather than their code order. Because this price is only paid at build time, I didn't care to determine exactly where the correct fix was. I also commented out a line to produce the "estimated total memory size" -- at least on the unix build with TRANSLATION=ja, this led to a build time KeyError trying to compute the codebook size for all the strings. I think this occurs because some single unicode code point ('ァ') is no longer present as itself in the compressed strings, due to always being replaced by a word. As promised, this seems to save hundreds of bytes in the German translation on the trinket m0. Testing performed: - built trinket_m0 in several languages - built and ran unix port in several languages (en, de_DE, ja) and ran simple error-producing codes like ./micropython -c '1/0'	2020-09-12 10:10:45 -05:00
Jeff Epler	bdb07adfcc	translations: Make decompression clearer Now this gets filled in with values e.g., 128 (0x80) and 159 (0x9f).	2020-09-08 19:07:53 -05:00
Jeff Epler	cbfd38d1ce	Rename functions to encode_ngrams / decode_ngrams	2020-09-02 19:09:23 -05:00
Jeff Epler	c34cb82ecb	makeqstrdata: correct range of low code points to 0x80..0x9f inclusive The previous range was unintentionally big and overlaps some characters we'd like to use (and also 0xa0, which we don't intentionally use)	2020-09-02 15:52:02 -05:00
Jeff Epler	07740d19f3	add bigram compression to makeqstrdata Compress common unicode bigrams by making code points in the range 0x80 - 0xbf (inclusive) represent them. Then, they can be greedily encoded and the substituted code points handled by the existing Huffman compression. Normally code points in the range 0x80-0xbf are not used in Unicode, so we stake our own claim. Using the more arguably correct "Private Use Area" (PUA) would mean that for scripts that only use code points under 256 we would use more memory for the "values" table. bigram means "two letters", and is also sometimes called a "digram". It's nothing to do with "big RAM". For our purposes, a bigram represents two successive unicode code points, so for instance in our build on trinket m0 for english the most frequent are: ['t ', 'e ', 'in', 'd ', ...]. The bigrams are selected based on frequency in the corpus, but the selection is not necessarily optimal, for these reasons I can think of: * Suppose the corpus was just "tea" repeated 100 times. The top bigrams would be "te", and "ea". However, overlap, "te" could never be used. Thus, some bigrams might actually waste space * I _assume_ this has to be why e.g., bigram 0x86 "s " is more frequent than bigram 0x85 " a" in English for Trinket M0, because sequences like "can't add" would get the "t " digram and then be unable to use the " a" digram. * And generally, if a bigram is frequent then so are its constituents. Say that "i" and "n" both encode to just 5 or 6 bits, then the huffman code for "in" had better compress to 10 or fewer bits or it's a net loss! * I checked though! "i" is 5 bits, "n" is 6 bits (lucky guess) but the bigram 0x83 also just 6 bits, so this one is a win of 5 bits for every "it" minus overhead. Yay, this round goes to team compression. * On the other hand, the least frequent bigram 0x9d " n" is 10 bits long and its constituent code points are 4+6 bits so there's no savings, but there is the cost of the table entry. * and somehow 0x9f 'an' is never used at all! With or without accounting for overlaps, there is some optimum number of bigrams. Adding one more bigram uses at least 2 bytes (for the entry in the bigram table; 4 bytes if code points >255 are in the source text) and also needs a slot in the Huffman dictionary, so adding bigrams beyond the optimim number makes compression worse again. If it's an improvement, the fact that it's not guaranteed optimal doesn't seem to matter too much. It just leaves a little more fruit for the next sweep to pick up. Perhaps try adding the most frequent bigram not yet present, until it doesn't improve compression overall. Right now, de_DE is again the "fullest" build on trinket_m0. (It's reclaimed that spot from the ja translation somehow) This change saves 104 bytes there, increasing free space about 6.8%. In the larger (but not critically full) pyportal build it saves 324 bytes. The specific number of bigrams used (32) was chosen as it is the max number that fit within the 0x80..0xbf range. Larger tables would require the use of 16 bit code points in the de_DE build, losing savings overall. (Side note: The most frequent letters in English have been said to be: ETA OIN SHRDLU; but we have UAC EIL MOPRST in our corpus)	2020-09-01 17:12:22 -05:00
Taku Fukada	79a3796b1c	Calculate the Huffman codebook without MP_QSTRs	2020-08-18 23:21:14 +09:00
Jeff Epler	08ed09acc6	makeqstrdata: don't print "compression incrased length" messages This check as implemented is misleading, because it compares the compressed size in bytes (including the length indication) with the source string length in Unicode code points. For English this is approximately fair, but for Japanese this is quite unfair and produces an excess of "increased length" messages. This message might have existed for one of two reasons: * to alert to an improperly function huffman compression * to call attention to a need for a "string is stored uncompressed" case We know by now that the huffman compression is functioning as designed and effective in general. Just to be on the safe side, I did some back-of-the-envelope estimates. I considered these three replacements for "the true source string size, in bytes": + decompressed_len_utf8 = len(decompressed.encode('utf-8')) + decompressed_len_utf16 = len(decompressed.encode('utf-16be')) + decompressed_len_bitsize = ((1+len(decompressed)) * math.ceil(math.log(1+len(values), 2)) + 7) // 8 The third counts how many bits each character requires (fewer than 128 characters in the source character set = 7, fewer than 256 = 8, fewer than 512 = 9, etc, adding a string-terminating value) and is in some way representative of the best way we would be able to store "uncompressed strings". The Japanese translation (largest as of writing) has just a few strings which increase by this metric. However, the amount of loss due to expansion in those cases is outweighed by the cost of adding 1 bit per string to indicate whether it's compressed or not. For instance, in the BOARD=trinket_m0 TRANSLATION=ja build the loss is 47 bytes over 300 strings. Adding 1 bit to each of 300 strings will cost about 37 bytes, leaving just 5 Thumb instructions to implement the code to check and decode "uncompressed" strings in order to break even.	2020-08-16 20:50:48 -05:00
Jeff Epler	d0f9b5901e	translations: document the compressed format	2020-05-28 11:30:46 -05:00
Jeff Epler	fe3e8d1589	string compression: save a few bits per string Length was stored as a 16-bit number always. Most translations have a max length far less. For example, US English translation lengths always fit in just 8 bits. probably all languages fit in 9 bits. This also has the side effect of reducing the alignment of compressed_string_t from 2 bytes to 1. testing performed: ran in german and english on pyruler, printed messages looked right. Firmware size, en_US Before: 3044 bytes free in flash After: 3408 bytes free in flash Firmware size, de_DE (with #2967 merged to restore translations) Before: 1236 bytes free in flash After: 1600 bytes free in flash	2020-05-28 08:36:08 -05:00
Jeff Epler	1a0dcb5caa	makeqstrdata: reclaim some more bytes on some translations If a translation only has unicode code points 255 and below, the "values" array can be 8 bits instead of 16 bits. This reclaims some code size, e.g., in a local build, trinket_m0 / en_US reclaimed 112 bytes and de_DE reclaimed 104 bytes. However, languages like zh_Latn_pinyin, which use code points above 255, did not benefit.	2019-12-02 14:49:23 -06:00
Jeff Epler	879e1041c9	makeqstrdata: fix printing of 'increased length' message	2019-12-02 10:18:48 -06:00
Jeff Epler	e06a3bbceb	translation: Compress as unicode, not bytes By treating each unicode code-point as a single entity for huffman compression, the overall compression rate can be somewhat improved without changing the algorithm. On the decompression side, when compressed values above 127 are encountered, they need to be converted from a 16-bit Unicode code point into a UTF-8 byte sequence. Doing this returns approximately 1.5kB of flash storage with the zh_Latn_pinyin translation. (292 -> 1768 bytes remaining in my build of trinket_m0) Other "more ASCII" translations benefit less, and in fact zh_Latn_pinyin is no longer the most constrained translation! (de_DE 1156 -> 1384 bytes free in flash, I didn't check others before pushing for CI) English is slightly pessimized, 2840 -> 2788 bytes, probably mostly because the "values" array was changed from uint8_t to uint16_t, which is strictly not required for an all-ASCII translation. This could probably be avoided in this case, but as English is not the most constrained translation it doesn't really matter. Testing performed: built for feather nRF52840 express and trinket m0 in English and zh_Latn_pinyin; ran and verified the localized messages such as Àn xià rènhé jiàn jìnrù REPL. Shǐyòng CTRL-D chóngxīn jiāzài. and Press any key to enter the REPL. Use CTRL-D to reload. were properly displayed.	2019-12-02 09:46:46 -06:00
Jeff Epler	c4f3a02b3b	makeqstrdata: permit longer "compressed" outputs It is possible for this routine to expand some inputs, and in fact it does for certan strings in the proposed Korean translation of CircuitPython (#1858). I did not determine what the maximum expansion is -- it's probably modest, like len()/7+2 bytes or something -- so I tried to just make enc[] an adequate over-allocation, and then ensured that all the strings in the proposed ko.po now worked. The worst actual expansion seems to be a string that goes from 65 UTF-8-encoded bytes to 68 compressed bytes (+4.6%). Only a few out of all strings are reported as non-compressed.	2019-08-06 07:39:09 -05:00
Scott Shawcroft	355abc835e	Fix output overflow and make help translatable	2018-11-09 16:41:08 -08:00
Scott Shawcroft	137a30ad75	fix mpy-cross	2018-08-16 17:40:57 -07:00
Scott Shawcroft	de5a9d72dc	Compress all translated strings with Huffman coding. This saves code space in builds which use link-time optimization. The optimization drops the untranslated strings and replaces them with a compressed_string_t struct. It can then be decompressed to a c string. Builds without LTO work as well but include both untranslated strings and compressed strings. This work could be expanded to include QSTRs and loaded strings if a compress method is added to C. Its tracked in #531.	2018-08-16 17:40:57 -07:00
Scott Shawcroft	4513bd6ea3	Fix translation newlines Escape table was incorrect	2018-08-10 16:17:03 -07:00
Scott Shawcroft	24e53ad591	Rework escaping and fix ESP build.	2018-08-09 15:58:45 -07:00
Scott Shawcroft	96ebf5bc3f	Two fixes and translate more strings. * Fix finding translations with escaped characters. * Add back \r to translations since its needed by screen.	2018-08-09 13:29:30 -07:00
Scott Shawcroft	933add6cd8	Support internationalisation.	2018-08-07 14:58:57 -07:00
Damien George	3678a6bdc6	py/modbuiltins: Make built-in dir support the __dir__ special method. If MICROPY_PY_ALL_SPECIAL_METHODS is enabled then dir() will now delegate to the special method __dir__ if the object it is listing has this method.	2018-05-10 23:14:23 +10:00
Paul Sokolovsky	9956fd0710	py/objtype: Fit qstrs for special methods in byte type. Update makeqstrdata.py to sort strings starting with "__" to the beginning of qstr list, so they get low qstr id's, guaranteedly fitting in 8 bits. Then use this property to further compact op_id => qstr mapping arrays.	2017-10-21 11:06:32 +03:00
Damien George	f127bef3e4	py/makeqstrdata.py: Compute the qstr hash from bytes, not characters.	2016-09-02 14:32:47 +10:00
Damien George	202d5acd06	py/makeqstrdata.py: Allow to have double-quote characters in qstrs. When rendering the qstr for a C header file, the double-quate character must be escaped.	2016-05-23 15:18:55 +01:00
Damien George	a649d72606	py/makeqstrdata: Add special case to handle \n qstr.	2016-04-14 15:22:36 +01:00
Damien George	2243d68345	py/makeqstrdata: Reinstate Python2 compatibility.	2016-04-14 14:37:04 +01:00
Damien George	49bb04ee64	py/makeqstrdata: Fix rendering of qstrs that have non-printable ASCII. The qstr data needs to be turned into a proper C string so non-ASCII chars must be properly escaped according to C rules.	2016-04-14 14:20:25 +01:00
Damien George	f30b6f0af5	py/makeqstrdata: Add more names for escaped chars and esc non-printable. Non-printable characters are escaped as 0xXX, where XX are the hex digits of the character value.	2016-04-13 22:12:39 +01:00
Damien George	594fa73411	py/makeqstrdata: Factor out some code to functions that can be reused.	2016-04-13 16:05:43 +01:00
Paul Sokolovsky	53ca6ae1f3	py/makeqstrdata.py: Catch and report case of empty input file. The usual cause would be that a cross-compiler for a port is not in PATH.	2015-10-11 11:09:57 +03:00
Tony Abboud	8d8fdcb4be	stmhal: add option to query for the current usb mode Fetch the current usb mode and return a string representation when pyb.usb_mode() is called with no args. The possible string values are interned as qstr's. None will be returned if an incorrect mode is set.	2015-09-03 23:30:43 +01:00
Paul Sokolovsky	3a2fb201a5	makeqstrdata.py: Typo fix in comment.	2015-07-31 14:58:14 +03:00
Damien George	c3bd9415cc	py: Make qstr hash size configurable, defaults to 2 bytes. This patch makes configurable, via MICROPY_QSTR_BYTES_IN_HASH, the number of bytes used for a qstr hash. It was originally fixed at 2 bytes, and now defaults to 2 bytes. Setting it to 1 byte will save ROM and RAM at a small expense of hash collisions.	2015-07-20 11:03:13 +00:00
Damien George	26b512ea1b	py: Get makeqstrdata.py and makeversionhdr.py running under Python 2.6. These scripts should run under as wide a range of Python versions as possible.	2015-05-30 23:11:16 +01:00

1 2

77 Commits