> Each of the languages above reports the string length as the number of code un...

masklinn · on Sept 9, 2019

> python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints.

UTF32 and codepoints is an identity transformation.

> knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.

A UTF-16 code unit is 16 bits. The difference between "UTF16 encoding" and "UTF16 code units" is the latter makes no guarantee that the sequence of code units is actually validly encoded. Which is very much an issue in both Java and Javascript (and most languages which started from UCS2 and back-defined that as UTF-16): both languages expose and allow manipulation of raw code units and allow unpaired surrogates, and thus don't actually use UTF16 strings, however these strings are generally assumed to be and interpreted as UTF16.

Which I expect is what TFA means by "UTF-16 semantics".

> oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings..

Your comment makes no sense. There are byte sequences which are not valid UTF-8. They are also not valid as part of a Rust string. Creating a non-UTF8 rust string is UB.

hsivonen · on Sept 9, 2019

> Which I expect is what TFA means by "UTF-16 semantics".

The article says "(potentially-invalid) UTF-16 semantics". The "potentially-invalid" part means that the JavaScript programmer can materialize unpaired surrogates. The "semantics" part means that the JavaScript programmer sees the strings acting as if they were potentially-invalid UTF-16 even when the storage format in RAM is actually Latin1 in SpiderMonkey and V8.

account42 · on Sept 9, 2019

> Creating a non-UTF8 rust string is UB.

So how does rust deal with filenames under Linux? Use somethinge other than strings?

loonyphoenix · on Sept 9, 2019

Yep. Rust has a PathBuf[1] type for dealing with paths in a platform-native manner. You can convert it to a Rust string type, but it's a conversion that can fail[2] or may be lossy[3].

[1] https://doc.rust-lang.org/std/path/struct.PathBuf.html

[2] https://doc.rust-lang.org/std/path/struct.PathBuf.html#metho...

[3] https://doc.rust-lang.org/std/path/struct.PathBuf.html#metho...

TheCoelacanth · on Sept 9, 2019

Yes, this is what OsStr/OsString[1] are for.

[1] https://doc.rust-lang.org/std/ffi/struct.OsString.html

masklinn · on Sept 9, 2019

Yes: https://doc.rust-lang.org/std/ffi/struct.OsString.html

This is also used to deal with filenames on Windows, as they're a different flavour of "not unicode", IIRC that's the original use case for WTF8.

It also has a separate pseudo-string type to deal with C strings: https://doc.rust-lang.org/std/ffi/struct.CString.html