simon-git: charset (master): charset.git
Commits to Tartarus hosted VCS
tartarus-commits at lists.tartarus.org
Thu Jan 5 19:06:34 GMT 2017
TL;DR:
386b6a4 Implement an HTML charset-inference function.
35279e2 convcs: refactor main() into several subfunctions.
634adce convcs: add an '--html' option.
a7353cf convcs: add a TODO.
Repository: https://git.tartarus.org/simon/charset.git
On the web: https://git.tartarus.org/?p=simon/charset.git
Branch updated: master
Committer: charset.git
Date: 2017-01-05 19:06:34
commit 386b6a4934ec96fa67d0529789fbbdd7a51be2ff
web diff https://git.tartarus.org/?p=simon/charset.git;a=commitdiff;h=386b6a4934ec96fa67d0529789fbbdd7a51be2ff;hp=5dc58dbc3743acbac96f1a6ad4e182eed8d0cdf8
Author: Simon Tatham <anakin at pobox.com>
Date: Thu Jan 5 18:38:57 2017 +0000
Implement an HTML charset-inference function.
This scans a stream of bytes, interpreted as the start of an HTML
document, looking for a <meta http-equiv='content-type'> tag
specifying a character encoding. If it finds one, it returns the
libcharset id of the character set in question, and also identifies a
substring of the input identifying the actual character-set name
inside the <meta> tag (permitting that part to be rewritten, if
desired).
Makefile | 5 +
charset.h | 16 ++
htmlcs.c | 546 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 567 insertions(+)
commit 35279e2e2265f21716613b9d017b1488ff87c07a
web diff https://git.tartarus.org/?p=simon/charset.git;a=commitdiff;h=35279e2e2265f21716613b9d017b1488ff87c07a;hp=386b6a4934ec96fa67d0529789fbbdd7a51be2ff
Author: Simon Tatham <anakin at pobox.com>
Date: Thu Jan 5 18:53:13 2017 +0000
convcs: refactor main() into several subfunctions.
There's now a top-level function for processing a single FILE *, which
does the reading from the file and then calls out to two more
functions, convert_got_data (processing the data received) and
convert_done (the final state-resetting output).
This opens the way to add features that treat the start of a file
specially; one is about to turn up using the new HTML charset
inferrer, and another possibility for the future is processing Unicode
byte-order marks and using them as a charset indicator.
(Also, having convert() separate from main() introduces the
possibility of processing multiple files in one run, _even_ if they
each need special start-of-file processing. But I'd have to decide how
multiple files should be output, if I did that.)
This patch moves most of the actual conversion code left by an
indentation level, so it's best viewed with whitespace ignored.
convcs.c | 144 ++++++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 91 insertions(+), 53 deletions(-)
commit 634adceca85ff7237f6abd3507bbacf98cc86513
web diff https://git.tartarus.org/?p=simon/charset.git;a=commitdiff;h=634adceca85ff7237f6abd3507bbacf98cc86513;hp=35279e2e2265f21716613b9d017b1488ff87c07a
Author: Simon Tatham <anakin at pobox.com>
Date: Thu Jan 5 19:02:24 2017 +0000
convcs: add an '--html' option.
This causes the input file to be assumed to be in HTML. The new
charset_from_html_prefix() function is run over the first kilobyte of
the file (that being the WhatWG-recommended limit) looking for a
<meta> tag specifying a charset. If one is found, it overrides the
source charset specified on the command line. Hence, if you've
received an HTML file together with a MIME header saying what charset
the _transport_ thought the file was encoded in, then you can use
'convcs --html' to convert from that charset to your intended
destination charset, and if the HTML file thinks it knows better than
the transport, then convcs will honour that.
Also, if a <meta> tag is found, it is rewritten in the output version
of the file, so that it describes the charset we've just converted the
file into. This ensures that if you pass the resulting translated HTML
file on to something else that _does_ honour <meta> tags, it won't be
fooled into expecting the previous encoding.
convcs.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
commit a7353cf60d896b33f103a26bc73cce535990d5c7
web diff https://git.tartarus.org/?p=simon/charset.git;a=commitdiff;h=a7353cf60d896b33f103a26bc73cce535990d5c7;hp=634adceca85ff7237f6abd3507bbacf98cc86513
Author: Simon Tatham <anakin at pobox.com>
Date: Thu Jan 5 19:02:35 2017 +0000
convcs: add a TODO.
convcs.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
More information about the tartarus-commits
mailing list