Narrow the definition of character to Unicode encoded character #795

tats-u · 2025-03-18T13:23:25Z

Fixes #791

tats-u · 2025-03-18T13:25:12Z

No entry of Surrogate Code Unit in https://www.unicode.org/glossary/.

tats-u · 2025-03-18T13:36:38Z

c.f. https://github.com/tats-u/markdown-cjk-friendly/blob/main/specification.md

tats-u · 2025-03-23T12:59:20Z

Another plan: anyhow exclude U+FFFD from Unicode punctuation characters (e.g. Exclude So)

spec.txt

notriddle · 2025-04-25T18:10:24Z

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

If an application that uses it accepts files from sources that might not contain valid unicode (such as files on the filesystem or the win32 text entry API), the application needs to convert its input to utf-8. The standard library offers an API to replace lone surrogates with U+FFFD, and it offers an API to return an error.

I always assumed, because the spec says that it doesn't specify an encoding, that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

tats-u · 2025-04-27T12:55:40Z

it offers an API to return an error

I changed the behavior to undefined behavior. By this pulldown-cmark will be allowed to do anything without notice.

that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

I have not assumed the case that parsers throw errors. I have assumed only the U+FFFD replacement.

dbuenzli · 2025-04-28T20:40:36Z

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

I also don't see any advantage of importing encoding matters into the specification. A long time ago I suggested to simply define the standard over sequences Unicode scalar values (which is the data you get as the result of any valid UTF decode), see #369.

notriddle · 2025-04-28T23:31:34Z

I would be fine with that.

tats-u · 2025-04-29T12:50:57Z

Unicode scalar values

It's sufficient for the time being, but I've advocated the CJK-friendly amendments as the fix of #650. In there, Unicode Noncharacters and Reserved Code Points obstructs the optimization of the CJK ranges. e.g. U+3097–U+3098 are reserved and U+2FFFE–U+2FFFF are noncharacters but both intervals should be treated as CJK to reduce the number of product terms (0xXXXX <= codePoints && codePoints <= 0xYYYY).

Neither of Unicode Noncharacters or Reserved Code Points are input by normal users other than testers.

tats-u · 2025-04-29T13:11:33Z

"surrogate characters" doesn't exist. Only "surrogate code points" and "surrogate code units" do. All of "Unicode scalar value", "Encoded character", and "Assigned character" exclude surrogate code points from their ranges.
Can I add "Fixes #369" to the top description?

tats-u · 2025-04-29T13:23:40Z

I also don't see any advantage of importing encoding matters into the specification.

UTF-8 has invalid verbose encodings. They must be treated as the same way as isolated surrogate code units especially in UTF-16 (UTF-8 can contain an encoding for surrogate code units like CESU-8). We should ignore all of them and leave them to implementations.

dbuenzli · 2025-04-29T14:15:50Z

I'm not sure I fully understand what you are trying to achieve with the definition. But for me a fix to #791 is to simply evacuate the notion of encoding from the specification.

The idea of #369 is to define the CommonMark grammar over a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF. Such a definition just says: a CommonMark document is defined over any valid Unicode text (and thus what happens on invalid encodings is unspecified by the specification). That way you don't even need to talk about surrogates or ill formed sequences. The only other thing you need to say is what happens if you input a surrogate code point using an escape, here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.

P.S. Excluding reserved code points without mentioning an explicit Unicode version doesn't make much sense. This set of code points is shrinking every year as new characters are added to the standard.

tats-u · 2025-04-29T14:39:37Z

a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF

There is a convenient term Well-formed Code Unit Sequence.

here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too. Of course replacing with U+FFFD is fine.

tats-u · 2025-04-29T14:43:00Z

what happens if you input a surrogate code point using an escape

Should be the same as HTML Living Standard.
The specs text about it should be revised too.
HTML Living Standard replaces them with U+FFFD.

dbuenzli · 2025-04-29T16:18:39Z

There is a convenient term Well-formed Code Unit Sequence.

No that applies to sequences of code units (8-bit, 16-bit or 32-bit depending on your UTF) you are still talking at the encoding level here. There's no need to. You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too.

Again, if you simply switch to a sequence of scalar values, you don't have to talk about that. Leave it to implementer, some UTF decoders fail hard on decode errors, some silently replace them with U+FFFD (according to different strategies), some give the choice. There's no need to talk about that in the CommonMark specification.

tats-u · 2025-05-01T10:25:24Z

Code units themselves are not dedicated to UTF. Only Well-Formed / Ill-Formed Code Unit (Sub)sequences are. (I overlooked them) e.g.

The same code unit sequence could, of course, be well-formed in the context of some other character encoding standard using 8-bit code units, such as ISO/IEC 8859-1, or vendor code pages.

From: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860

You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

I've taken non-Unicode encodings into account. i.e. file content <=> code units (UTF or legacy 8-bit) <=> encoded characters or scalar values
Also, we need to clarify the behavior when decode is not going well. Anyway we should make implementers feel secure.

This set of code points is shrinking every year as new characters are added to the standard.

This doesn't bring breaking changes to existing Markdown documents where older Unicode versions are used, which should be mostly avoided. Implementations have only to update their Unicode versions and prepare for the updates.

if you simply switch to a sequence of scalar values, you don't have to talk about that.

Not all strings, character arrays, and files contain only code units that are part of scalar values. We have to prepare for exceptions by explicitly clarifying that "you can do anything".

Add a note on flankingness around ill-formed code unit subsequences

5472ecd

notriddle reviewed Apr 24, 2025

View reviewed changes

spec.txt Outdated Show resolved Hide resolved

tats-u added 2 commits April 27, 2025 21:50

Move to the definition of character

faa49a0

Make the behavior undefined

4eaa0e4

tats-u changed the title ~~Add a note on flankingness around ill-formed code unit subsequences~~ Narrow the definition of character to Unicode encoded character Apr 27, 2025

notriddle mentioned this pull request May 1, 2025

Code points, scalar values, and validity #778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narrow the definition of character to Unicode encoded character #795

Narrow the definition of character to Unicode encoded character #795

tats-u commented Mar 18, 2025

tats-u commented Mar 18, 2025 •

edited

Loading

tats-u commented Mar 18, 2025

tats-u commented Mar 23, 2025

notriddle commented Apr 25, 2025 •

edited

Loading

tats-u commented Apr 27, 2025 •

edited

Loading

dbuenzli commented Apr 28, 2025 •

edited

Loading

notriddle commented Apr 28, 2025

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

dbuenzli commented Apr 29, 2025

tats-u commented Apr 29, 2025

tats-u commented Apr 29, 2025 •

edited

Loading

dbuenzli commented Apr 29, 2025

tats-u commented May 1, 2025 •

edited

Loading

Narrow the definition of character to Unicode encoded character #795

Are you sure you want to change the base?

Narrow the definition of character to Unicode encoded character #795

Conversation

tats-u commented Mar 18, 2025

tats-u commented Mar 18, 2025 • edited Loading

tats-u commented Mar 18, 2025

tats-u commented Mar 23, 2025

notriddle commented Apr 25, 2025 • edited Loading

tats-u commented Apr 27, 2025 • edited Loading

dbuenzli commented Apr 28, 2025 • edited Loading

notriddle commented Apr 28, 2025

tats-u commented Apr 29, 2025 • edited Loading

tats-u commented Apr 29, 2025 • edited Loading

tats-u commented Apr 29, 2025 • edited Loading

dbuenzli commented Apr 29, 2025

tats-u commented Apr 29, 2025

tats-u commented Apr 29, 2025 • edited Loading

dbuenzli commented Apr 29, 2025

tats-u commented May 1, 2025 • edited Loading

tats-u commented Mar 18, 2025 •

edited

Loading

notriddle commented Apr 25, 2025 •

edited

Loading

tats-u commented Apr 27, 2025 •

edited

Loading

dbuenzli commented Apr 28, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented Apr 29, 2025 •

edited

Loading

tats-u commented May 1, 2025 •

edited

Loading