simon-git: putty (main): Simon Tatham

Fri Nov 11 08:53:23 GMT 2022

TL;DR:
  991e22c9 Implement a BinarySink writing to a fixed-size buffer.
  834b58e3 Make encode_utf8() output to a BinarySink.
  d89f2bfc Fix typo in decode_utf8 tests.
  69e217d2 Make decode_utf8() read from a BinarySource.
  b72c9aba New script to generate Unicode data tables.
  430af47a Polish the output of read_ucd.py.
  4bb37233 Commit read_ucd.py's output and switch over to it.
  4cb429e3 Update to Unicode 15.
  b35d23f6 Implement Unicode normalisation.
  d3e186e8 Function to check a UTF-8 string for unknown characters.

Repository:     https://git.tartarus.org/simon/putty.git
On the web:     https://git.tartarus.org/?p=simon/putty.git
Branch updated: main
Committer:      Simon Tatham <anakin at pobox.com>
Date:           2022-11-11 08:53:23

commit 991e22c9bb83b67dcb74a2338ebd99da0d87dc49
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=991e22c9bb83b67dcb74a2338ebd99da0d87dc49;hp=c8ba48be43957498d2510f249f9d3e5f70b9833b
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 18:53:34 2022 +0000

    Implement a BinarySink writing to a fixed-size buffer.

    This is one of marshal.c's small collection of handy BinarySink
    adapters to existing kinds of thing, alongside stdio_sink and
    bufchain_sink. It writes into a fixed-size buffer, discarding all
    writes after the buffer fills up, and sets a flag to let you know if
    it overflowed.

    There was one of these in Windows Pageant a while back, under the name
    'struct PageantReply' (introduced in commit b6cbad89fc56c1f, removed
    again in 98538caa39d20f3 when the named-pipe revamp made it
    unnecessary). This is the same idea but centralised for reusability.

 defs.h          |  1 +
 marshal.h       |  7 +++++++
 utils/marshal.c | 20 ++++++++++++++++++++
 3 files changed, 28 insertions(+)

commit 834b58e39b2c6eddc717532411d90e05746b9df2
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=834b58e39b2c6eddc717532411d90e05746b9df2;hp=991e22c9bb83b67dcb74a2338ebd99da0d87dc49
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 18:56:51 2022 +0000

    Make encode_utf8() output to a BinarySink.

    Previously it output to an ordinary char buffer, and returned the
    number of bytes it had written. But three out of the four call sites
    immediately chucked the resulting bytes into a BinarySink anyway. The
    fourth, in windows/unicode.c, really is writing into successive
    locations of a fixed-size buffer - but we can make that into a
    BinarySink too, using the buffer_sink added in the previous commit.

    So now encode_utf8() is renamed put_utf8_char, and the call sites all
    look simpler than they started out.

 marshal.h                          |  5 +++++
 misc.h                             |  5 -----
 terminal/terminal.c                |  3 +--
 utils/encode_utf8.c                | 25 +++++++++++--------------
 utils/encode_wide_string_as_utf8.c |  4 +---
 utils/stripctrl.c                  | 11 ++---------
 windows/unicode.c                  | 19 +++++++------------
 7 files changed, 27 insertions(+), 45 deletions(-)

commit d89f2bfc55920e8de52d7bf1824cfdc2365339f4
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=d89f2bfc55920e8de52d7bf1824cfdc2365339f4;hp=834b58e39b2c6eddc717532411d90e05746b9df2
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 19:18:45 2022 +0000

    Fix typo in decode_utf8 tests.

    The test in question was supposed to contain the spurious UTF-8
    encoding that 0xD800 would have if it were not a surrogate. But the
    final continuation character 0x80 was instead 0x00.

    The test passed anyway, because ED A0 was regarded as a truncated
    sequence, instead of ED A0 80 being regarded as an illegal encoding of
    a surrogate, and both return the same output!

 utils/decode_utf8.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 69e217d23a00b54f84e9d96469f0798aa05b88f7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=69e217d23a00b54f84e9d96469f0798aa05b88f7;hp=d89f2bfc55920e8de52d7bf1824cfdc2365339f4
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 19:01:04 2022 +0000

    Make decode_utf8() read from a BinarySource.

    This enables it to handle data that isn't presented as a
    NUL-terminated string.

    In particular, the NUL byte can appear _within_ the string and be
    correctly translated to the NUL wide character. So I've been able to
    remove the awkwardness in the test rig of having to include the
    terminating NUL in every test to ensure NUL has been tested, and
    instead, insert a single explicit test for it.

    Similarly to the previous commit, the simplification at the (one) call
    site gives me a strong feeling of 'this is what the API should have
    been all along'!

 misc.h                       |  7 ++++---
 utils/decode_utf8.c          | 49 +++++++++++++++++++++++++++-----------------
 utils/decode_utf8_to_wchar.c |  4 ++--
 windows/unicode.c            | 16 +++++----------
 4 files changed, 41 insertions(+), 35 deletions(-)

commit b72c9aba287e4933eca5254f82c2b3d0e1e437c7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=b72c9aba287e4933eca5254f82c2b3d0e1e437c7;hp=69e217d23a00b54f84e9d96469f0798aa05b88f7
Author: Simon Tatham <anakin at pobox.com>
Date:   Tue Nov 8 18:11:44 2022 +0000

    New script to generate Unicode data tables.

    This will replace the various pieces of Perl scattered throughout the
    code base in comments above long boring data tables. The idea is that
    those long boring tables will move into header files in the new
    'unicode' directory, and will be #included from the source files that
    use the tables.

    One benefit is that I won't have to page tediously past the tables to
    get to the actual code I want to edit. But more importantly, it should
    now become easy to update to a new version of Unicode, by re-running
    just one script and committing the changed versions of all the headers
    in the 'unicode' subdir.

    This version of the script regenerates six Unicode-derived tables in
    the existing source code in a byte-for-byte identical form. In the
    next commits I'll clean it up, commit the output, and delete the
    tables from their previous locations.

    (One table I _haven't_ incorporated into this system is the Arabic
    shaping table in bidi.c, because my attempt to regenerate it came out
    not matching the original at all. That _might_ be because the table is
    based on an old Unicode standard and desperately needs updating, but
    it might also be because I misunderstood how it works. So I'll leave
    sorting that out for another time.)

 unicode/read_ucd.py | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 295 insertions(+)

commit 430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a;hp=b72c9aba287e4933eca5254f82c2b3d0e1e437c7
Author: Simon Tatham <anakin at pobox.com>
Date:   Tue Nov 8 18:04:46 2022 +0000

    Polish the output of read_ucd.py.

    The initial outputs were all deliberately inconsistent with each
    other, so that each one exactly matched the existing table I was
    trying to replace.

    Now I've done that check, I can clean them up. Normalised spacing and
    case to be consistent; removed pointless indentation (these are now
    include files, so they don't have to be indented to the same level as
    the array declaration surrounding each one's #include); added a header
    comment in each autogenerated file, saying that it's autogenerated,
    what it's for, and who it's used by.

    The currently supported version number of Unicode is also exposed in a
    header file, so that I can put it in diagnostics.

 unicode/read_ucd.py | 108 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 103 insertions(+), 5 deletions(-)

commit 4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2;hp=430af47a38f6c3ac710bd7c8c9f86e7497c6dc5a
Author: Simon Tatham <anakin at pobox.com>
Date:   Fri Nov 11 08:40:59 2022 +0000

    Commit read_ucd.py's output and switch over to it.

    This removes the superseded tables in source files, and also all the
    code snippets in comments that generated them.

 terminal/bidi.c                | 1930 +---------------------------------------
 unicode/ambiguous_wide_chars.h |  189 ++++
 unicode/bidi_brackets.h        |  139 +++
 unicode/bidi_mirror.h          |  437 +++++++++
 unicode/bidi_type.h            | 1306 +++++++++++++++++++++++++++
 unicode/nonspacing_chars.h     |  357 ++++++++
 unicode/version.h              |    9 +
 unicode/wide_chars.h           |  130 +++
 utils/wcwidth.c                |  688 +-------------
 9 files changed, 2575 insertions(+), 2610 deletions(-)

commit 4cb429e3f4009c290f3b992a1b2847959c47b2b7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4cb429e3f4009c290f3b992a1b2847959c47b2b7;hp=4bb37233a5f1b3ebc320231ac7f7e4a93ad0f9b2
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 08:48:54 2022 +0000

    Update to Unicode 15.

    Now I have a script I can easily re-run, there's no reason not to do
    just that! This updates all of the new generated header files for the
    UCD.zip that comes with Unicode 15.0.0.

    I've re-run my bidi test suite against 15.0.0's file of test cases,
    and confirmed they all pass.

 terminal/bidi.c                |  6 +++---
 unicode/ambiguous_wide_chars.h |  2 +-
 unicode/bidi_brackets.h        |  2 +-
 unicode/bidi_mirror.h          |  2 +-
 unicode/bidi_type.h            | 37 +++++++++++++++++++++++++++++++------
 unicode/nonspacing_chars.h     | 15 ++++++++++++---
 unicode/wide_chars.h           | 22 +++++++++++-----------
 7 files changed, 60 insertions(+), 26 deletions(-)

commit b35d23f6999b8511ea719deb5c8739de93cb95d7
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=b35d23f6999b8511ea719deb5c8739de93cb95d7;hp=4cb429e3f4009c290f3b992a1b2847959c47b2b7
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 19:28:51 2022 +0000

    Implement Unicode normalisation.

    A new module in 'utils' computes NFC and NFD, via a new set of data
    tables generated by read_ucd.py.

    The new module comes with a new test program, which can read the
    NormalizationTest.txt that appears in the Unicode Character Database.
    All the tests pass, as of Unicode 15.

 CMakeLists.txt              |    5 +
 misc.h                      |    3 +
 unicode/canonical_comp.h    |  950 ++++++++++++++++++++
 unicode/canonical_decomp.h  | 2071 +++++++++++++++++++++++++++++++++++++++++++
 unicode/combining_classes.h |  398 +++++++++
 unicode/read_ucd.py         |  121 ++-
 utils/CMakeLists.txt        |    1 +
 utils/unicode-norm.c        |  446 ++++++++++
 8 files changed, 3985 insertions(+), 10 deletions(-)

commit d3e186e81b1d30c8b0c42ac98ef0a2e15a4838ec
web diff https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=d3e186e81b1d30c8b0c42ac98ef0a2e15a4838ec;hp=b35d23f6999b8511ea719deb5c8739de93cb95d7
Author: Simon Tatham <anakin at pobox.com>
Date:   Wed Nov 9 08:56:11 2022 +0000

    Function to check a UTF-8 string for unknown characters.

    So we can reject things we don't know how to NFC yet.

 misc.h                |   4 +
 unicode/known_chars.h | 716 ++++++++++++++++++++++++++++++++++++++++++++++++++
 unicode/read_ucd.py   |  16 ++
 utils/CMakeLists.txt  |   1 +
 utils/unicode-known.c |  53 ++++
 5 files changed, 790 insertions(+)