-
-
Notifications
You must be signed in to change notification settings - Fork 334
Narrow the definition of character to Unicode encoded character #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
No entry of Surrogate Code Unit in https://www.unicode.org/glossary/. |
Another plan: anyhow exclude U+FFFD from Unicode punctuation characters (e.g. Exclude So) |
Full disclosure: I want the commonmark spec to disclaim any requirements for lone surrogates anywhere in the document because pulldown-cmark uses Rust strings for input (which cannot contain lone surrogates because they're always valid utf-8). If an application that uses it accepts files from sources that might not contain valid unicode (such as files on the filesystem or the win32 text entry API), the application needs to convert its input to utf-8. The standard library offers an API to replace lone surrogates with U+FFFD, and it offers an API to return an error. I always assumed, because the spec says that it doesn't specify an encoding, that both choices were okay. If that were changed, then pulldown-cmark's API would become a lot more complicated to use in a conformant manner. |
I changed the behavior to undefined behavior. By this pulldown-cmark will be allowed to do anything without notice.
I have not assumed the case that parsers throw errors. I have assumed only the U+FFFD replacement. |
I also don't see any advantage of importing encoding matters into the specification. A long time ago I suggested to simply define the standard over sequences Unicode scalar values (which is the data you get as the result of any valid UTF decode), see #369. |
I would be fine with that. |
It's sufficient for the time being, but I've advocated the CJK-friendly amendments as the fix of #650. In there, Unicode Noncharacters and Reserved Code Points obstructs the optimization of the CJK ranges. e.g. U+3097–U+3098 are reserved and U+2FFFE–U+2FFFF are noncharacters but both intervals should be treated as CJK to reduce the number of product terms ( Neither of Unicode Noncharacters or Reserved Code Points are input by normal users other than testers. |
"surrogate characters" doesn't exist. Only "surrogate code points" and "surrogate code units" do. All of "Unicode scalar value", "Encoded character", and "Assigned character" exclude surrogate code points from their ranges. |
UTF-8 has invalid verbose encodings. They must be treated as the same way as isolated surrogate code units especially in UTF-16 (UTF-8 can contain an encoding for surrogate code units like CESU-8). We should ignore all of them and leave them to implementations. |
I'm not sure I fully understand what you are trying to achieve with the definition. But for me a fix to #791 is to simply evacuate the notion of encoding from the specification. The idea of #369 is to define the CommonMark grammar over a stream of Unicode scalar values, more precisely a stream of integers in the ranges P.S. Excluding reserved code points without mentioning an explicit Unicode version doesn't make much sense. This set of code points is shrinking every year as new characters are added to the standard. |
There is a convenient term Well-formed Code Unit Sequence.
Implementations should be allowed to emit errors for ill-formed code unit subsequences, too. Of course replacing with U+FFFD is fine. |
Should be the same as HTML Living Standard. |
No that applies to sequences of code units (8-bit, 16-bit or 32-bit depending on your UTF) you are still talking at the encoding level here. There's no need to. You want to define CommonMark on the output of an UTF decoding process which is: a sequence of scalar values.
Again, if you simply switch to a sequence of scalar values, you don't have to talk about that. Leave it to implementer, some UTF decoders fail hard on decode errors, some silently replace them with |
Code units themselves are not dedicated to UTF. Only Well-Formed / Ill-Formed Code Unit (Sub)sequences are. (I overlooked them) e.g.
I've taken non-Unicode encodings into account. i.e. file content <=> code units (UTF or legacy 8-bit) <=> encoded characters or scalar values
This doesn't bring breaking changes to existing Markdown documents where older Unicode versions are used, which should be mostly avoided. Implementations have only to update their Unicode versions and prepare for the updates.
Not all strings, character arrays, and files contain only code units that are part of scalar values. We have to prepare for exceptions by explicitly clarifying that "you can do anything". |
Fixes #791