simon-git: putty (main): Simon Tatham
Commits to Tartarus hosted VCS
tartarus-commits at lists.tartarus.org
Fri Nov 11 08:53:23 GMT 2022
TL;DR:
991e22c9 Implement a BinarySink writing to a fixed-size buffer.
834b58e3 Make encode_utf8() output to a BinarySink.
d89f2bfc Fix typo in decode_utf8 tests.
69e217d2 Make decode_utf8() read from a BinarySource.
b72c9aba New script to generate Unicode data tables.
430af47a Polish the output of read_ucd.py.
4bb37233 Commit read_ucd.py's output and switch over to it.
4cb429e3 Update to Unicode 15.
b35d23f6 Implement Unicode normalisation.
d3e186e8 Function to check a UTF-8 string for unknown characters.
Repository: https://git.tartarus.org/simon/putty.git
On the web: https://git.tartarus.org/?p=simon/putty.git
Branch updated: main
Committer: Simon Tatham <anakin at pobox.com>
Date: 2022-11-11 08:53:23
commit 991e22c9bb83b67dcb74a2338ebd99da0d87dc49
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=991e22c9bb83b67dcb74a2338ebd99da0d87dc49;hp=c8ba48be43957498d2510f249f9d3e5f70b9833b
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 18:53:34 2022 +0000
Implement a BinarySink writing to a fixed-size buffer.
This is one of marshal.c's small collection of handy BinarySink
adapters to existing kinds of thing, alongside stdio_sink and
bufchain_sink. It writes into a fixed-size buffer, discarding all
writes after the buffer fills up, and sets a flag to let you know if
it overflowed.
There was one of these in Windows Pageant a while back, under the name
'struct PageantReply' (introduced in commit b6cbad89fc56c1f, removed
again in 98538caa39d20f3 when the named-pipe revamp made it
unnecessary). This is the same idea but centralised for reusability.
defs.h | 1 +
marshal.h | 7 +++++++
utils/marshal.c | 20 ++++++++++++++++++++
3 files changed, 28 insertions(+)
commit 834b58e39b2c6eddc717532411d90e05746b9df2
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=834b58e39b2c6eddc717532411d90e05746b9df2;hp=991e22c9bb83b67dcb74a2338ebd99da0d87dc49
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 18:56:51 2022 +0000
Make encode_utf8() output to a BinarySink.
Previously it output to an ordinary char buffer, and returned the
number of bytes it had written. But three out of the four call sites
immediately chucked the resulting bytes into a BinarySink anyway. The
fourth, in windows/unicode.c, really is writing into successive
locations of a fixed-size buffer - but we can make that into a
BinarySink too, using the buffer_sink added in the previous commit.
So now encode_utf8() is renamed put_utf8_char, and the call sites all
look simpler than they started out.
marshal.h | 5 +++++
misc.h | 5 -----
terminal/terminal.c | 3 +--
utils/encode_utf8.c | 25 +++++++++++--------------
utils/encode_wide_string_as_utf8.c | 4 +---
utils/stripctrl.c | 11 ++---------
windows/unicode.c | 19 +++++++------------
7 files changed, 27 insertions(+), 45 deletions(-)
commit d89f2bfc55920e8de52d7bf1824cfdc2365339f4
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=d89f2bfc55920e8de52d7bf1824cfdc2365339f4;hp=834b58e39b2c6eddc717532411d90e05746b9df2
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 19:18:45 2022 +0000
Fix typo in decode_utf8 tests.
The test in question was supposed to contain the spurious UTF-8
encoding that 0xD800 would have if it were not a surrogate. But the
final continuation character 0x80 was instead 0x00.
The test passed anyway, because ED A0 was regarded as a truncated
sequence, instead of ED A0 80 being regarded as an illegal encoding of
a surrogate, and both return the same output!
utils/decode_utf8.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
commit 69e217d23a00b54f84e9d96469f0798aa05b88f7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=69e217d23a00b54f84e9d96469f0798aa05b88f7;hp=d89f2bfc55920e8de52d7bf1824cfdc2365339f4
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 19:01:04 2022 +0000
Make decode_utf8() read from a BinarySource.
This enables it to handle data that isn't presented as a
NUL-terminated string.
In particular, the NUL byte can appear _within_ the string and be
correctly translated to the NUL wide character. So I've been able to
remove the awkwardness in the test rig of having to include the
terminating NUL in every test to ensure NUL has been tested, and
instead, insert a single explicit test for it.
Similarly to the previous commit, the simplification at the (one) call
site gives me a strong feeling of 'this is what the API should have
been all along'!
misc.h | 7 ++++---
utils/decode_utf8.c | 49 +++++++++++++++++++++++++++-----------------
utils/decode_utf8_to_wchar.c | 4 ++--
windows/unicode.c | 16 +++++----------
4 files changed, 41 insertions(+), 35 deletions(-)
commit b72c9aba287e4933eca5254f82c2b3d0e1e437c7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=b72c9aba287e4933eca5254f82c2b3d0e1e437c7;hp=69e217d23a00b54f84e9d96469f0798aa05b88f7
Author: Simon Tatham <anakin at pobox.com>
Date: Tue Nov 8 18:11:44 2022 +0000
New script to generate Unicode data tables.
This will replace the various pieces of Perl scattered throughout the
code base in comments above long boring data tables. The idea is that
those long boring tables will move into header files in the new
'unicode' directory, and will be #included from the source files that
use the tables.
One benefit is that I won't have to page tediously past the tables to
get to the actual code I want to edit. But more importantly, it should
now become easy to update to a new version of Unicode, by re-running
just one script and committing the changed versions of all the headers
in the 'unicode' subdir.
This version of the script regenerates six Unicode-derived tables in
the existing source code in a byte-for-byte identical form. In the
next commits I'll clean it up, commit the output, and delete the
tables from their previous locations.
(One table I _haven't_ incorporated into this system is the Arabic
shaping table in bidi.c, because my attempt to regenerate it came out
not matching the original at all. That _might_ be because the table is
based on an old Unicode standard and desperately needs updating, but
it might also be because I misunderstood how it works. So I'll leave
sorting that out for another time.)
unicode/read_ucd.py | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 295 insertions(+)
commit 430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a;hp=b72c9aba287e4933eca5254f82c2b3d0e1e437c7
Author: Simon Tatham <anakin at pobox.com>
Date: Tue Nov 8 18:04:46 2022 +0000
Polish the output of read_ucd.py.
The initial outputs were all deliberately inconsistent with each
other, so that each one exactly matched the existing table I was
trying to replace.
Now I've done that check, I can clean them up. Normalised spacing and
case to be consistent; removed pointless indentation (these are now
include files, so they don't have to be indented to the same level as
the array declaration surrounding each one's #include); added a header
comment in each autogenerated file, saying that it's autogenerated,
what it's for, and who it's used by.
The currently supported version number of Unicode is also exposed in a
header file, so that I can put it in diagnostics.
unicode/read_ucd.py | 108 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 103 insertions(+), 5 deletions(-)
commit 4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2;hp=430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a
Author: Simon Tatham <anakin at pobox.com>
Date: Fri Nov 11 08:40:59 2022 +0000
Commit read_ucd.py's output and switch over to it.
This removes the superseded tables in source files, and also all the
code snippets in comments that generated them.
terminal/bidi.c | 1930 +---------------------------------------
unicode/ambiguous_wide_chars.h | 189 ++++
unicode/bidi_brackets.h | 139 +++
unicode/bidi_mirror.h | 437 +++++++++
unicode/bidi_type.h | 1306 +++++++++++++++++++++++++++
unicode/nonspacing_chars.h | 357 ++++++++
unicode/version.h | 9 +
unicode/wide_chars.h | 130 +++
utils/wcwidth.c | 688 +-------------
9 files changed, 2575 insertions(+), 2610 deletions(-)
commit 4cb429e3f4009c290f3b992a1b2847959c47b2b7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4cb429e3f4009c290f3b992a1b2847959c47b2b7;hp=4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 08:48:54 2022 +0000
Update to Unicode 15.
Now I have a script I can easily re-run, there's no reason not to do
just that! This updates all of the new generated header files for the
UCD.zip that comes with Unicode 15.0.0.
I've re-run my bidi test suite against 15.0.0's file of test cases,
and confirmed they all pass.
terminal/bidi.c | 6 +++---
unicode/ambiguous_wide_chars.h | 2 +-
unicode/bidi_brackets.h | 2 +-
unicode/bidi_mirror.h | 2 +-
unicode/bidi_type.h | 37 +++++++++++++++++++++++++++++++------
unicode/nonspacing_chars.h | 15 ++++++++++++---
unicode/wide_chars.h | 22 +++++++++++-----------
7 files changed, 60 insertions(+), 26 deletions(-)
commit b35d23f6999b8511ea719deb5c8739de93cb95d7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=b35d23f6999b8511ea719deb5c8739de93cb95d7;hp=4cb429e3f4009c290f3b992a1b2847959c47b2b7
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 19:28:51 2022 +0000
Implement Unicode normalisation.
A new module in 'utils' computes NFC and NFD, via a new set of data
tables generated by read_ucd.py.
The new module comes with a new test program, which can read the
NormalizationTest.txt that appears in the Unicode Character Database.
All the tests pass, as of Unicode 15.
CMakeLists.txt | 5 +
misc.h | 3 +
unicode/canonical_comp.h | 950 ++++++++++++++++++++
unicode/canonical_decomp.h | 2071 +++++++++++++++++++++++++++++++++++++++++++
unicode/combining_classes.h | 398 +++++++++
unicode/read_ucd.py | 121 ++-
utils/CMakeLists.txt | 1 +
utils/unicode-norm.c | 446 ++++++++++
8 files changed, 3985 insertions(+), 10 deletions(-)
commit d3e186e81b1d30c8b0c42ac98ef0a2e15a4838ec
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=d3e186e81b1d30c8b0c42ac98ef0a2e15a4838ec;hp=b35d23f6999b8511ea719deb5c8739de93cb95d7
Author: Simon Tatham <anakin at pobox.com>
Date: Wed Nov 9 08:56:11 2022 +0000
Function to check a UTF-8 string for unknown characters.
So we can reject things we don't know how to NFC yet.
misc.h | 4 +
unicode/known_chars.h | 716 ++++++++++++++++++++++++++++++++++++++++++++++++++
unicode/read_ucd.py | 16 ++
utils/CMakeLists.txt | 1 +
utils/unicode-known.c | 53 ++++
5 files changed, 790 insertions(+)
More information about the tartarus-commits
mailing list