> Each of the languages above reports the string length as the number of code units that the string occupies. Python 3 strings have (guaranteed-valid) UTF-32 semantics, so the string occupies 5 code units. In UTF-32, each Unicode scalar value occupies one code unit. JavaScript (and Java) strings have (potentially-invalid) UTF-16 semantics, so the string occupies 7 code units. Rust strings are (guaranteed-valid) UTF-8, so the string occupies 17 code units. It is intentional that the phrasing for the Rust case differs from the phrasing for the Python and JavaScript cases. We’ll come to back to that later.
...And the OP is wrong.
1) python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints. Do you want byte-lenght? decode to a byte array and count them.
2) javascript doesn't know about bytes, nor characters, knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.
oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings..
(if you want an encoding that can map each byte sequence to a character, there are, like Latin1 and such on, but it's a different matter)
> python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints.
UTF32 and codepoints is an identity transformation.
> knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.
A UTF-16 code unit is 16 bits. The difference between "UTF16 encoding" and "UTF16 code units" is the latter makes no guarantee that the sequence of code units is actually validly encoded. Which is very much an issue in both Java and Javascript (and most languages which started from UCS2 and back-defined that as UTF-16): both languages expose and allow manipulation of raw code units and allow unpaired surrogates, and thus don't actually use UTF16 strings, however these strings are generally assumed to be and interpreted as UTF16.
Which I expect is what TFA means by "UTF-16 semantics".
> oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings..
Your comment makes no sense. There are byte sequences which are not valid UTF-8. They are also not valid as part of a Rust string. Creating a non-UTF8 rust string is UB.
> Which I expect is what TFA means by "UTF-16 semantics".
The article says "(potentially-invalid) UTF-16 semantics". The "potentially-invalid" part means that the JavaScript programmer can materialize unpaired surrogates. The "semantics" part means that the JavaScript programmer sees the strings acting as if they were potentially-invalid UTF-16 even when the storage format in RAM is actually Latin1 in SpiderMonkey and V8.
Yep. Rust has a PathBuf[1] type for dealing with paths in a platform-native manner. You can convert it to a Rust string type, but it's a conversion that can fail[2] or may be lossy[3].
...And the OP is wrong.
1) python doesn't count byte sizes, or UTFxxx stuff. python counts the number of codepoints. Do you want byte-lenght? decode to a byte array and count them. 2) javascript doesn't know about bytes, nor characters, knows only about the fact that "a char is a 16 bit chunk", with a UTF16 encoding. there is no such thing a "code unit with UTF-16 semantics". Similar for java.
oh, and by the way, there are byte sequences that are invalid if decoded from utf-8, so I'm non sure about the "guaranteed-valid" utf-8 rust strings.. (if you want an encoding that can map each byte sequence to a character, there are, like Latin1 and such on, but it's a different matter)