Skip to content

Narrow the definition of character to Unicode encoded character #795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tats-u
Copy link
Contributor

@tats-u tats-u commented Mar 18, 2025

Fixes #791

@tats-u
Copy link
Contributor Author

tats-u commented Mar 18, 2025

@tats-u
Copy link
Contributor Author

tats-u commented Mar 18, 2025

@tats-u
Copy link
Contributor Author

tats-u commented Mar 23, 2025

Another plan: anyhow exclude U+FFFD from Unicode punctuation characters (e.g. Exclude So)

@notriddle
Copy link
Contributor

notriddle commented Apr 25, 2025

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

If an application that uses it accepts files from sources that might not contain valid unicode (such as files on the filesystem or the win32 text entry API), the application needs to convert its input to utf-8. The standard library offers an API to replace lone surrogates with U+FFFD, and it offers an API to return an error.

I always assumed, because the spec says that it doesn't specify an encoding, that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

@tats-u
Copy link
Contributor Author

tats-u commented Apr 27, 2025

it offers an API to return an error

I changed the behavior to undefined behavior. By this pulldown-cmark will be allowed to do anything without notice.

that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner.

I have not assumed the case that parsers throw errors. I have assumed only the U+FFFD replacement.

@tats-u tats-u changed the title Add a note on flankingness around ill-formed code unit subsequences Narrow the definition of character to Unicode encoded character Apr 27, 2025
@dbuenzli
Copy link

dbuenzli commented Apr 28, 2025

Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8).

I also don't see any advantage of importing encoding matters into the specification. A long time ago I suggested to simply define the standard over sequences Unicode scalar values (which is the data you get as the result of any valid UTF decode), see #369.

@notriddle
Copy link
Contributor

I would be fine with that.

@tats-u
Copy link
Contributor Author

tats-u commented Apr 29, 2025

Unicode scalar values

It's sufficient for the time being, but I've advocated the CJK-friendly amendments as the fix of #650. In there, Unicode Noncharacters and Reserved Code Points obstructs the optimization of the CJK ranges. e.g. U+3097–U+3098 are reserved and U+2FFFE–U+2FFFF are noncharacters but both intervals should be treated as CJK to reduce the number of product terms (0xXXXX <= codePoints && codePoints <= 0xYYYY).

Neither of Unicode Noncharacters or Reserved Code Points are input by normal users other than testers.

@tats-u
Copy link
Contributor Author

tats-u commented Apr 29, 2025

"surrogate characters" doesn't exist. Only "surrogate code points" and "surrogate code units" do. All of "Unicode scalar value", "Encoded character", and "Assigned character" exclude surrogate code points from their ranges.
Can I add "Fixes #369" to the top description?

@tats-u
Copy link
Contributor Author

tats-u commented Apr 29, 2025

I also don't see any advantage of importing encoding matters into the specification.

UTF-8 has invalid verbose encodings. They must be treated as the same way as isolated surrogate code units especially in UTF-16 (UTF-8 can contain an encoding for surrogate code units like CESU-8). We should ignore all of them and leave them to implementations.

@dbuenzli
Copy link

I'm not sure I fully understand what you are trying to achieve with the definition. But for me a fix to #791 is to simply evacuate the notion of encoding from the specification.

The idea of #369 is to define the CommonMark grammar over a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF. Such a definition just says: a CommonMark document is defined over any valid Unicode text (and thus what happens on invalid encodings is unspecified by the specification). That way you don't even need to talk about surrogates or ill formed sequences. The only other thing you need to say is what happens if you input a surrogate code point using an escape, here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.


P.S. Excluding reserved code points without mentioning an explicit Unicode version doesn't make much sense. This set of code points is shrinking every year as new characters are added to the standard.

@tats-u
Copy link
Contributor Author

tats-u commented Apr 29, 2025

a stream of Unicode scalar values, more precisely a stream of integers in the ranges 0x0000..0xD7FF to 0xE000..0x10FFFF

There is a convenient term Well-formed Code Unit Sequence.

here the text can simply indicate that it must be replaced by the unicode replacement character U+FFFD.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too. Of course replacing with U+FFFD is fine.

@tats-u
Copy link
Contributor Author

tats-u commented Apr 29, 2025

what happens if you input a surrogate code point using an escape

Should be the same as HTML Living Standard.
The specs text about it should be revised too.
HTML Living Standard replaces them with U+FFFD.

@dbuenzli
Copy link

There is a convenient term Well-formed Code Unit Sequence.

No that applies to sequences of code units (8-bit, 16-bit or 32-bit depending on your UTF) you are still talking at the encoding level here. There's no need to. You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

Implementations should be allowed to emit errors for ill-formed code unit subsequences, too.

Again, if you simply switch to a sequence of scalar values, you don't have to talk about that. Leave it to implementer, some UTF decoders fail hard on decode errors, some silently replace them with U+FFFD (according to different strategies), some give the choice. There's no need to talk about that in the CommonMark specification.

@tats-u
Copy link
Contributor Author

tats-u commented May 1, 2025

Code units themselves are not dedicated to UTF. Only Well-Formed / Ill-Formed Code Unit (Sub)sequences are. (I overlooked them) e.g.

The same code unit sequence could, of course, be well-formed in the context of some other character encoding standard using 8-bit code units, such as ISO/IEC 8859-1, or vendor code pages.

From: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860

You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.

I've taken non-Unicode encodings into account. i.e. file content <=> code units (UTF or legacy 8-bit) <=> encoded characters or scalar values
Also, we need to clarify the behavior when decode is not going well. Anyway we should make implementers feel secure.

This set of code points is shrinking every year as new characters are added to the standard.

This doesn't bring breaking changes to existing Markdown documents where older Unicode versions are used, which should be mostly avoided. Implementations have only to update their Unicode versions and prepare for the updates.

if you simply switch to a sequence of scalar values, you don't have to talk about that.

Not all strings, character arrays, and files contain only code units that are part of scalar values. We have to prepare for exceptions by explicitly clarifying that "you can do anything".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants