As strange as it is to say, I think avoiding problems like this might be one of ...

bbarnett · 2024-08-17T22:07:37 1723932457

I'm skeptical that rust magically deals with, for example, character sets in 30 year old subtitle files, in a way that makes C seem inadequate.

Legacy compatibility has value.

duskwuff · 2024-08-17T22:42:13 1723934533

> I'm skeptical that rust magically deals with, for example, character sets in 30 year old subtitle files, in a way that makes C seem inadequate.

It's not just that C is "inadequate" - C and its standard library provide no assistance in that task. As the mpv author explains in profane detail in the linked commit message, POSIX locales are an active hindrance, not a useful form of "legacy compatibility".

commodoreboxer · 2024-08-18T00:09:27 1723939767

Not "magically", but more reasonably and without forcing your entire program into a different state, breaking any ability for libraries to work with a huge range of functionality consistently. C locale handling is basically impossible to work with robustly, even before you get into how it can't be effectively used at all in a thread safe way.

Dylan16807 · 2024-08-17T22:36:32 1723934192

The correct place to handle character sets is when you're reading the file, not to sprinkle it all throughout your program.

josephg · 2024-08-18T00:54:30 1723942470

Right. And the rust standard library provides (in my mind) the right API for this. Strings are always internally utf8. But they have constructor methods to create strings from UTF16 bytes, or utf32 or whatever.

Rust isn’t unique. Swift, Go and Python3 all expose more or less the same api. C’s standard library, with the benefit of hindsight, is uniquely terrible here.

nine_k · 2024-08-18T04:23:48 1723955028

Locales are so much more than character sets. E.g. an Arabic locale changes the direction of writing, it also changes the characters used for numbers, and completely changes the way numbers and dates are formatted. This is where the C locale functions are problematic.

Character encoding is the easy and safe part.

Dylan16807 · 2024-08-18T05:13:04 1723957984

Locales are much more than character sets, but the question was about character sets.

Also for most of those things, you want to be explicit about when to use the locale and when to not.

josephg · 2024-08-19T02:56:24 1724036184

> Also for most of those things, you want to be explicit about when to use the locale and when to not.

Right. And that's where the POSIX C API falls down. The locale isn't named explicitly. Its not a function parameter. Its specified via a global variable that gets shared between all your threads.

You might think you can use scanf to parse a string in a JSON file. It might appear to work fine on your local computer. But scanf behaves differently depending on the system locale. You can wrap scanf with a helper function which sets the locale to something sensible, calls scanf, and restores the locale. But because the locale is shared with other threads, which might be depending on the locale in other ways. So this can introduce race conditions.

The whole thing is horribly designed - and it leads to buggy, unreliable code that is hard to reason about. Even in the best case, introducing thread syncronization into a function like sscanf will lead to a dramatic decrease in performance.

Its horrible. Just horrible.

account42 · 2024-08-19T12:27:34 1724070454

You can create/use a different string processing library without jumping to a completely different language.