circuitpython/supervisor/shared/translate/compressed_string.h

/*
 * This file is part of the MicroPython project, http://micropython.org/
 *
 * The MIT License (MIT)
 *
 * Copyright (c) 2018 Scott Shawcroft for Adafruit Industries
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
 */

#pragma once

#include <stddef.h>
#include <stdint.h>
#include <string.h>

// The format of the compressed data is:
// - the size of the uncompressed string in UTF-8 bytes, encoded as a
//   (compress_max_length_bits)-bit number.  compress_max_length_bits is
//   computed during dictionary generation time, and happens to be 8
//   for all current platforms.  However, it'll probably end up being
//   9 in some translations sometime in the future.  This length excludes
//   the trailing NUL, though notably decompress_length includes it.
//
// - followed by the huffman encoding of the individual UTF-16 code
//   points that make up the string.  The trailing "\0" is not
//   represented by a huffman code, but is implied by the length.
//   (building the huffman encoding on UTF-16 code points gave better
//   compression than building it on UTF-8 bytes)
//
// - code points starting at 128 (word_start) and potentially extending
//   to 255 (word_end) (but never interfering with the target
//   language's used code points) stand for dictionary entries in a
//   dictionary with size up to 256 code points.  The dictionary entries
//   are computed with a heuristic based on frequent substrings of 2 to
//   9 code points.  These are called "words" but are not, grammatically
//   speaking, words.  They're just spans of code points that frequently
//   occur together.  They are ordered shortest to longest.
//
// - dictionary entries are non-overlapping, and the _ending_ index of each
//   entry is stored in an array.  A count of words of each length, from
//   minlen to maxlen, is given in the array called wlencount.  From
//   this small array, the start and end of the N'th word can be
//   calculated by an efficient, small loop.  (A bit of time is traded
//   to reduce the size of this table indicating lengths)
//
// The "data" / "tail" construct is so that the struct's last member is a
// "flexible array".  However, the _only_ member is not permitted to be
// a flexible member, so we have to declare the first byte as a separate
// member of the structure.
//
// For translations where length needs 8 bits, this saves about 1.5
// bytes per string on average compared to a structure of {uint16_t,
// flexible array}, but is also future-proofed against strings with
// UTF-8 length above 256, with a savings of about 1.375 bytes per
// string.
typedef struct compressed_string {
    uint8_t data;
    const uint8_t tail[];
} compressed_string_t;

// Return the compressed, translated version of a source string
// Usually, due to LTO, this is optimized into a load of a constant
// pointer.
// const compressed_string_t *translate(const char *c);
void serial_write_compressed(const compressed_string_t *compressed);
char *decompress(const compressed_string_t *compressed, char *decompressed);
uint16_t decompress_length(const compressed_string_t *compressed);
atmel-samd: Support auto-reset based on USB write activity. It will soft-reboot micropython after a burst of writes to the file system. This means that after you save files on your computer they will be automatically rerun. This can be disabled in the build by unsetting AUTORESET_TIMER in mpconfigboard.h. Using the REPL will also prevent the soft resets until you reset with CTRL-D manually. 2016-10-25 17:27:59 -04:00			`/*`
Modernize module and class static dicts; update freetouch 2017-08-27 15:02:50 -04:00			`* This file is part of the MicroPython project, http://micropython.org/`
atmel-samd: Support auto-reset based on USB write activity. It will soft-reboot micropython after a burst of writes to the file system. This means that after you save files on your computer they will be automatically rerun. This can be disabled in the build by unsetting AUTORESET_TIMER in mpconfigboard.h. Using the REPL will also prevent the soft resets until you reset with CTRL-D manually. 2016-10-25 17:27:59 -04:00			`*`
			`* The MIT License (MIT)`
			`*`
Support internationalisation. 2018-07-31 19:53:54 -04:00			`* Copyright (c) 2018 Scott Shawcroft for Adafruit Industries`
atmel-samd: Support auto-reset based on USB write activity. It will soft-reboot micropython after a burst of writes to the file system. This means that after you save files on your computer they will be automatically rerun. This can be disabled in the build by unsetting AUTORESET_TIMER in mpconfigboard.h. Using the REPL will also prevent the soft resets until you reset with CTRL-D manually. 2016-10-25 17:27:59 -04:00			`*`
			`* Permission is hereby granted, free of charge, to any person obtaining a copy`
			`* of this software and associated documentation files (the "Software"), to deal`
			`* in the Software without restriction, including without limitation the rights`
			`* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell`
			`* copies of the Software, and to permit persons to whom the Software is`
			`* furnished to do so, subject to the following conditions:`
			`*`
			`* The above copyright notice and this permission notice shall be included in`
			`* all copies or substantial portions of the Software.`
			`*`
			`* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR`
			`* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,`
			`* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE`
			`* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER`
			`* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,`
			`* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN`
			`* THE SOFTWARE.`
			`*/`

Conditionalize LTO 2022-05-27 15:59:54 -04:00			`#pragma once`
atmel-samd: Support auto-reset based on USB write activity. It will soft-reboot micropython after a burst of writes to the file system. This means that after you save files on your computer they will be automatically rerun. This can be disabled in the build by unsetting AUTORESET_TIMER in mpconfigboard.h. Using the REPL will also prevent the soft resets until you reset with CTRL-D manually. 2016-10-25 17:27:59 -04:00
Switch translate() to the header file This allows the compile stage to optimize most of the translate() function away and saves a ton of space (~40k on ESP). However, it requires us to wait for the qstr output before we compile the rest of our .o files. (Only qstr.o used to wait.) This isn't as good as the current setup with LTO though. Trinket M0 loses <1k with this setup. So, we should probably conditionalize this along with LTO. 2022-05-26 19:44:48 -04:00			`#include <stddef.h>`
Compress all translated strings with Huffman coding. This saves code space in builds which use link-time optimization. The optimization drops the untranslated strings and replaces them with a compressed_string_t struct. It can then be decompressed to a c string. Builds without LTO work as well but include both untranslated strings and compressed strings. This work could be expanded to include QSTRs and loaded strings if a compress method is added to C. Its tracked in #531. 2018-08-15 21:32:37 -04:00			`#include <stdint.h>`
Switch translate() to the header file This allows the compile stage to optimize most of the translate() function away and saves a ton of space (~40k on ESP). However, it requires us to wait for the qstr output before we compile the rest of our .o files. (Only qstr.o used to wait.) This isn't as good as the current setup with LTO though. Trinket M0 loses <1k with this setup. So, we should probably conditionalize this along with LTO. 2022-05-26 19:44:48 -04:00			`#include <string.h>`
Compress all translated strings with Huffman coding. This saves code space in builds which use link-time optimization. The optimization drops the untranslated strings and replaces them with a compressed_string_t struct. It can then be decompressed to a c string. Builds without LTO work as well but include both untranslated strings and compressed strings. This work could be expanded to include QSTRs and loaded strings if a compress method is added to C. Its tracked in #531. 2018-08-15 21:32:37 -04:00
translations: document the compressed format 2020-05-28 12:29:28 -04:00			`// The format of the compressed data is:`
			`// - the size of the uncompressed string in UTF-8 bytes, encoded as a`
			`// (compress_max_length_bits)-bit number. compress_max_length_bits is`
			`// computed during dictionary generation time, and happens to be 8`
			`// for all current platforms. However, it'll probably end up being`
			`// 9 in some translations sometime in the future. This length excludes`
			`// the trailing NUL, though notably decompress_length includes it.`
			`//`
			`// - followed by the huffman encoding of the individual UTF-16 code`
			`// points that make up the string. The trailing "\0" is not`
			`// represented by a huffman code, but is implied by the length.`
			`// (building the huffman encoding on UTF-16 code points gave better`
			`// compression than building it on UTF-8 bytes)`
			`//`
supervisor translate: explain the dictionary 2020-09-15 14:18:04 -04:00			`// - code points starting at 128 (word_start) and potentially extending`
			`// to 255 (word_end) (but never interfering with the target`
			`// language's used code points) stand for dictionary entries in a`
			`// dictionary with size up to 256 code points. The dictionary entries`
			`// are computed with a heuristic based on frequent substrings of 2 to`
			`// 9 code points. These are called "words" but are not, grammatically`
			`// speaking, words. They're just spans of code points that frequently`
Compress word offset table By storing "count of words by length", the long `wends` table can be replaced with a short `wlencount` table. This saves flash storage space. Extend the range of string lengths that can be in the dictionary. Originally it was to 2 to 9; at one point it was changed to 3 to 9. Putting the lower bound back at 2 has a positive impact on the French translation (a bunch of them, such as "ch", "\r\n", "%q", are used). Increasing the maximum length gets 'mpossible', ' doit être ', and 'CircuitPyth' at the long end. This adds a bit of processing time to makeqstrdata. The specific 2/11 values are again empirical based on the French translation on the adafruit_proxlight_trinkey_m0. 2021-08-07 10:17:41 -04:00			`// occur together. They are ordered shortest to longest.`
supervisor translate: explain the dictionary 2020-09-15 14:18:04 -04:00			`//`
			`// - dictionary entries are non-overlapping, and the _ending_ index of each`
Compress word offset table By storing "count of words by length", the long `wends` table can be replaced with a short `wlencount` table. This saves flash storage space. Extend the range of string lengths that can be in the dictionary. Originally it was to 2 to 9; at one point it was changed to 3 to 9. Putting the lower bound back at 2 has a positive impact on the French translation (a bunch of them, such as "ch", "\r\n", "%q", are used). Increasing the maximum length gets 'mpossible', ' doit être ', and 'CircuitPyth' at the long end. This adds a bit of processing time to makeqstrdata. The specific 2/11 values are again empirical based on the French translation on the adafruit_proxlight_trinkey_m0. 2021-08-07 10:17:41 -04:00			`// entry is stored in an array. A count of words of each length, from`
			`// minlen to maxlen, is given in the array called wlencount. From`
			`// this small array, the start and end of the N'th word can be`
			`// calculated by an efficient, small loop. (A bit of time is traded`
			`// to reduce the size of this table indicating lengths)`
supervisor translate: explain the dictionary 2020-09-15 14:18:04 -04:00			`//`
translations: document the compressed format 2020-05-28 12:29:28 -04:00			`// The "data" / "tail" construct is so that the struct's last member is a`
			`// "flexible array". However, the _only_ member is not permitted to be`
fix typos (partial) detected by codepell 2023-03-18 11:17:02 -04:00			`// a flexible member, so we have to declare the first byte as a separate`
translations: document the compressed format 2020-05-28 12:29:28 -04:00			`// member of the structure.`
			`//`
			`// For translations where length needs 8 bits, this saves about 1.5`
			`// bytes per string on average compared to a structure of {uint16_t,`
			`// flexible array}, but is also future-proofed against strings with`
			`// UTF-8 length above 256, with a savings of about 1.375 bytes per`
			`// string.`
Improve mp_printf with support for compressed strings * The new nonstandard '%S' format takes a pointer to compressed_string_t and prints it * The new mp_cprintf and mp_vcprintf take a format string that is a compressed_string_t 2021-08-08 11:27:50 -04:00			`typedef struct compressed_string {`
string compression: save a few bits per string Length was stored as a 16-bit number always. Most translations have a max length far less. For example, US English translation lengths always fit in just 8 bits. probably all languages fit in 9 bits. This also has the side effect of reducing the alignment of compressed_string_t from 2 bytes to 1. testing performed: ran in german and english on pyruler, printed messages looked right. Firmware size, en_US Before: 3044 bytes free in flash After: 3408 bytes free in flash Firmware size, de_DE (with #2967 merged to restore translations) Before: 1236 bytes free in flash After: 1600 bytes free in flash 2020-05-28 08:40:56 -04:00			`uint8_t data;`
			`const uint8_t tail[];`
Compress all translated strings with Huffman coding. This saves code space in builds which use link-time optimization. The optimization drops the untranslated strings and replaces them with a compressed_string_t struct. It can then be decompressed to a c string. Builds without LTO work as well but include both untranslated strings and compressed strings. This work could be expanded to include QSTRs and loaded strings if a compress method is added to C. Its tracked in #531. 2018-08-15 21:32:37 -04:00			`} compressed_string_t;`

translations: document the compressed format 2020-05-28 12:29:28 -04:00			`// Return the compressed, translated version of a source string`
			`// Usually, due to LTO, this is optimized into a load of a constant`
			`// pointer.`
Switch translate() to the header file This allows the compile stage to optimize most of the translate() function away and saves a ton of space (~40k on ESP). However, it requires us to wait for the qstr output before we compile the rest of our .o files. (Only qstr.o used to wait.) This isn't as good as the current setup with LTO though. Trinket M0 loses <1k with this setup. So, we should probably conditionalize this along with LTO. 2022-05-26 19:44:48 -04:00			`// const compressed_string_t translate(const char c);`
run code formatting script 2021-03-15 09:57:36 -04:00			`void serial_write_compressed(const compressed_string_t *compressed);`
			`char decompress(const compressed_string_t compressed, char *decompressed);`
			`uint16_t decompress_length(const compressed_string_t *compressed);`