*not using C doesn't help much if you need to talk to OS APIs* This means cdecl,...

joosters · on July 18, 2023

The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'

zerodensity · on July 18, 2023

> For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

I don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether.

> If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format

A c function with proper error, (that is something you want to have for all your interface functions). Normally looks something like this.

int name(T1 param_1, T2 param_2, ..., TN param_n, R1* return_1, R2* return_2, ..., RN* return_n);

Where the return int is the error code. param_1-param_n the input parameters. result_1-result_n the results of the function.

When writing these kinds of functions having an extra parameter for the size of the strings either for input or output is not a huge complexity increase.

> Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

Which memory management system you use does not impact if you use null terminated strings or a pointer + length pair. Both support stack, manual, managed or gc memory. It's just about the string representation.

For example:

I use a gc language.

I call a c library which returns a string that I get ownership of.

Now I want to leverage the gc to automatically free the string at some point. What I do is tell the gc how to free it, I have to do this no matter how the string is represented.

Or take the inverse.

I send in a string to the c library, which takes ownership of it.

Now the library must know how to free the memory. Typically this is done by allocating it with a library allocator (which can be malloc) before sending it to the function. Importantly the allocator is not the same as the one we use for everything else.

What I am getting at is that if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

kazinator · on July 18, 2023

> pointer + length string interface

If it's a 32 bit length, that will be limiting for some 64 bit programs.

If it's a 64 bit length, it means tiny strings take up more space.

Hey, do both! Have the length be a "size_t" and then have "compat_32" shim around single system call that takes at least one string argument.

Wee!

Imagine a parallel world in which mainstream OS kernel developers had seen the light 30 years ago and used len + data for system calls. You'd now have to be support ancient binary programs that are passing strings where the length is uint16. Oh right, I forgot! We can just screw programs that are more than five years old. All the cool users are on the latest version of everything.

> if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues. No multi-byte length field whose size and endianness we have to know. If they are UTF-8, their character encoding is already marshaled also (that's the point of using UTF-8 everywhere).

GoblinSlayer · on July 18, 2023

>Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues.

They have https://en.cppreference.com/w/c/string/wide

kazinator · on July 18, 2023

Why are you citing documentation about wide strings, in response to a comment about byte strings (that even mentions UTF-8)?

lelanthran · on July 18, 2023

> don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether

Not so simple.

32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

Zero length strings are easy, what about null strings? Are you going to design the pointer + length strict to be opaque so that callers can only ever use pointers to the struct? If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

How do callers free this string? You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

Composite data types are a lot more work and are more error prone in C.

jacquesm · on July 19, 2023

We're very much in agreement.

The whole 'null pointer style strings' makes no sense, I think they want to say 'nul terminated'. But fine.

Your examples are excellent, let me add a few more:

Big endian? Little endian? Do we count characters or bytes? Who owns the bloody thing? Can they be modified in place? Are they in ROM or RAM? Automatic? Static? Can they be transmitted over a network 'as is' or do they need to be sent via some serialization mechanism? What about storing them on disk? And can they then be retrieved on different architectures?

The problem really is that C more or less requires you to really know what you're doing with your data and that's impossible in a networked world because your toy library ends up integrated into something else and then that something else gets connected to the internet and suddenly all those negative test cases that you never thought of are potential security issues. So any simplistic view of string handling will end up with a broken implementation regardless of how well it worked in its initial target environment.

C's solution is simple: take the simplest possible representation and use that, pass responsibility back to the programmer for dealing with all of the edge cases. The problem is that nobody does and even those that try tend to get it subtly wrong several times across a codebase of any magnitude.

It's a nasty little problem and it will result in security issues for decades to come. There are plenty of managed languages, I had some hope (as a seasoned C programmer) that instead of this Cambrian explosion of programming languages that we'd have some kind of convergence so that it becomes easier, not harder to pick a winner and establish some best practices. But it seems as though cooperation is rare, much more common is the mode where a defect in one language or eco system results in a completely new language that solves that one problem in some way (sometimes quite convoluted) at the expense of introducing a whole raft of new problems. Besides the fractioning of mindshare.

GoblinSlayer · on July 19, 2023

It's not a hypothesis, the thing was already implemented many times in C, C++ and other languages and used for ages especially for networked code, because C "there's no length" approach is a guaranteed vulnerability.

lelanthran · on July 19, 2023

It's not a guaranteed vulnerability, it's a potential vulnerability.

Guaranteed doesn't mean "this will probably happen", it means "this will definitely happen".

The "no length approach" can probably result in a vulnerability. It won't definitely result in a vulnerability.

I mean, come one, if it was a guaranteed vulnerability, almost nothing on the internet would work because they all have, somewhere down the line, a dependency on a nul-terminated string.

I mean, do you think that nginx (https://github.com/nginx/nginx/blob/master/src/core/ngx_stri...) is getting exploited millions of times per hour because they have a few uses for nul-terminated strings?

GoblinSlayer · on July 19, 2023

nginx whacks one mole at a time https://cve.circl.lu/cve/CVE-2013-2028

jacquesm · on July 19, 2023

That CVE has absolutely nothing to do with length up front vs nul terminated strings. It's also two years old. The only thing it does is reference nginx but that's disingenuous, unless the point you're trying to make is that nginx has the occasional security issue, which I think we're all very much aware of. But it doesn't answer the GPs point in any relevant way.

GoblinSlayer · on July 19, 2023

The problem there is in opportunistic bound checking due to loose association of an array with length, string being an example of an array. This vulnerability is a direct consequence of C "there's no length" approach and shows why this approach in unsuitable for networked code.

jacquesm · on July 19, 2023

In C a string is not an example of an array. If we can't agree on terminology for a discussion that requires extreme precision it becomes difficult to keep going.

Networked code does not as a rule use C style nul terminated strings though, in the case of fixed length buffers they will usually be accompanied either by a length field or by zeroing out the end of the string or even the whole buffer (the latter is much better and ensures you don't accidentally leak data from one session to another).

Networked code doesn't have to be written in C to begin with. Regardless of implementation there usually is a protocol spec and you adhere to that spec and if you don't then you'll find out the hard way why it matters.

This particular vulnerability has nothing at all to do with C strings but in fact has everything to do with a broken implementation of length based strings, which could result in the length being negative, which is at least one problem which C style strings do not have... (small comfort there, they have plenty of other problems, but that one they don't.).

This is the fix for that particular CVE:

https://github.com/nginx/nginx/commit/4997de8005630664ab35f2...

Which stems from integer overflow after doing arithmetic on the lengths.

It looks to me as though you just pulled the first nginx CVE that you found and posted it without looking at what the CVE was all about, without realizing that the ancestor comment was referring to the string implementation inside nginx which lives in the referenced file, whereas you are pointing to a CVE related to the parsing of HTTP chunked data requests, which resides in an entirely different file and has nothing to do with string handling to begin with.

GoblinSlayer · on July 19, 2023

And what do you propose? To let only 1.5 good C programmers in the world write code like in 70s?

jacquesm · on July 19, 2023

> And what do you propose?

That you get your terminology right, back up your claims with links that actually make sense and try to understand that the software world is complex and that incremental approaches make more sense than demanding unrealistic / uneconomical changes because they are not going to happen.

> To let only 1.5 good C programmers in the world write code like in 70s?

No, I did not propose that, you just did and clearly that's nonsense aka a strawman even if you didn't bother throwing it down.

C is here. It will be here decades from now. Rewriting everything is not going to happen, at least, not in the short term. C will likely still be here (and new C code will likely still be written) in 2100, and possibly long after that. This isn't ideal and it's not going to help that we can not make a clean break with the past even though we are trying.

The solution will come in many small pieces rather than as one silver bullet to cure it all and TFA announces two such small pieces and as such is a small step in a very, very long game. The adoption of Rust and other safer (not inherently safe but safer, there are still plenty of footguns left) may well in the longer run give us a chance to do away with the last of the heritage from the C era. But there is a fair chance that it won't happen and that Rust's rate of adoption will be too low to solve this problem timely.

The same goes for every other managed language, they are partial solutions at best. This isn't good news and it isn't optimal, but it is the reality as far as I can determine. If you're going to do a new greenfield development I hope that you will find yourself on a platform where you won't have to use C and that you have skills and resources at your disposal that will allow you to side-step those problems entirely. But that won't do anything for the untold LOC already out there in production and that utterly dwarfs any concern I have about future development, it's the mess we made in the past that we have to deal with and we have to try hard to avoid making new messes.

Think of it as fixing a large toxic waste spill.

GoblinSlayer · on July 19, 2023

It's not a hypothesis, the change happened several times and is used in networking code: in putty and s2n in C and in grpc in C++ and I guess in all C++ code that uses string_view and span, it's easier to happen in C++ due to more language features.

>Rewriting everything is not going to happen, at least, not in the short term.

If you can't do a big task in one go, split it into smaller tasks and do them in sequence.

jacquesm · on July 19, 2023

I'm sorry, I apparently lack the vocabulary or clarity of expression to get my points across to you so I'm bowing out here.

jacquesm · on July 19, 2023

Which C compilers are those then?

Also, you keep writing 'null pointer' and 'null', there is a pretty big difference between 'null' and 'nul' and in the context of talking about language implementation details such little things matter a lot. You say a lot of stuff with great authority that simply doesn't match my experience (as a C programmer of many decades) and while I'm all open to being convinced otherwise you will have to show some references and examples.

GoblinSlayer · on July 19, 2023

What doesn't match your experience?

jacquesm · on July 19, 2023

My experience as a programmer of some 40 years in C has yet to expose me to a C compiler that has length based rather than nul terminated strings as the base string type. Please point me to one in somewhat widespread use rather than an experimental implementation that uses this concept and make sure not to confuse libraries with the implementation of the language.

GoblinSlayer · on July 19, 2023

Since no C/C++ compiler supports it, for them implementation is in a library.

jacquesm · on July 19, 2023

So that means they are not part of C/C++. Which was the point. You can write software in C/C++ but that's hardly news and you can use that to create new data types that are not in the language, which also is hardly news.

GoblinSlayer · on July 19, 2023

People suggesting it are concerned about security, they don't intend it to be a novel invention. Bound checking predates C.

jacquesm · on July 19, 2023

Yes it does. But that doesn't mean that you get to state a lot of stuff with certainty that upon inspection turns out to simply not be true. C programmers are - in spite of what you appear to think - also concerned about security. And whether bounds checking predates C or not has nothing to do with how this is implemented, in a library or in the compiler itself (or even in the hardware).

If you reference C you are talking about the compiler, that, and only that is the language implementation. In C that specification is so tiny that a lot of the functionality that you might expect to be present in the language is actually library stuff. K&R does a poor job for novices to split out what is the language proper and what is the library, but a good hint is that anything that requires an include file isn't part of the language itself.

The original comment to which you responded talked about the ABI, the layer between the applications and the operating system, presumably the UNIX/POSIX ABI, which is more or less cast in concrete by now and unlikely to be replaced because if you do so you introduce a breaking change: all compiled applications using that ABI will no longer work. Some versions of UNIX will occasionally do this and this is widely regarded as a great way to limit your adoption. So the problem, in a nutshell is: how do we repair the security situation that has emerged as the result of many years of bad practices in such a way that our systems continue to work without having to re-invest the untold trillions of $ that have been spent on software that we use every day. This is a hard problem. TFA is a small, and incremental step in trying to solve that problem.

Others are more pessimistic, believe that we should just take our lumps and get on with that rewrite, usually in whatever is their favorite managed (or unmanaged, in some cases) language. Yet others pursue compiler based or hardware based solutions which all introduce different degrees of incompatibility.

I'm somewhat bearish on seeing this problem resolved in my lifetime. At the same time I applaud every little step in the right direction. And I personally do not believe that replacing C's 'string type' (which it really doesn't have other than nul terminated string literals) is the way to go due to the reasons outlined above. But an incremental approach allows for fixing some known issues and allows us to back away from historical mistakes in a way that we can afford the cost and to do so without incurring the penalty of a complete rewrite (which usually comes with a whole raft of new bugs as well). So small improvements that do not address each and every grievance should be welcomed. Even if they no doubt introduce new problems at least the scope is such that you can - hopefully - deal with those without introducing new security issues.

GoblinSlayer · on July 20, 2023

Putty and s2n are examples how this problem is solved, they work on POSIX, e.g. linux, just compile them with gcc and they work.

GoblinSlayer · on July 19, 2023

>32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

32 bit should be enough for everyone, it's easier to type as int, and you have less problems with variable sized integers on different targets. Signed length makes sense because length is a number, and numbers are signed, also in conjunction with array -1 sentinel value is often used.

>If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value, actually in languages with nullable strings null string and empty string are routinely synonymous and you often use a method like IsNullOrEmpty to check for absence of value. Anyway you need the concept of absence for other types too, like int, so string isn't special here.

>You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

pointer+length struct is a value type, see https://en.cppreference.com/w/cpp/container/span

lelanthran · on July 19, 2023

> C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value,

Incorrect. I'm literally, today, working on a project where the business logic is different depending on whether an empty string is stored in the database, or no string.

"User didn't get to fill in a preference" is very different from "user didn't indicate a preference".

In more practical terms, a missing value could mean that we use the default while an empty value could mean that we don't use it at all.

GoblinSlayer · on July 19, 2023

For user empty text field means absence of value. Indeed, rarely a situation arises for optional values, but it's not only for strings, other types like int may need it too.

jacquesm · on July 19, 2023

The end user representation of a programming construct versus the implementation details surrounding such constructs give rise to what is called a 'leaky abstraction', in this case that 'absence of value' is something entirely different than 'empty string'.

We have a way of representing absence of value for some data types but not for others, again because of implementation details. This sort of leaky abstraction often gives options for creativity but it can also lead to trouble and bugs. Some languages offer such 'optional' behavior to more datatypes and make it a part of function calling conventions, either by supplying a default or by leaving the optional parameters set to the equivalent of 'empty' or even 'undefined' if that is possible.

account42 · on July 18, 2023

Pretty much all string implementations have the ability to give you a pointer and a length which you can then pass on to the foreign interface. Essentially, he API always takes a non-owning string view. C strings on the other hand require you to store that terminating NUL next to the string. This is only bearable because most string implementations are designed to deal with because C APIs are so popular.

For returning strings, ownership is a bigger problem than the exact representation. OS APIs typically make you provide a buffer an then fail if it was not big enough.

GoblinSlayer · on July 18, 2023

>Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy).

The idea is to use C-style memory management: you provide a buffer, where the string is copied, for example of string return see getenv_r function: https://man.netbsd.org/getenv.3

In C++ it's more similar to std::span.

wruza · on July 18, 2023

you can't just wistfully imagine effortlessly passing String objects around

To clarify, I didn’t mean it. No new style API/ABI. Only unboxing a string into (str, len) in/out-params and boxing it back from returns.

kazinator · on July 18, 2023

Lots of C programs define a more substantial string type for themselves (e.g. dynamic, reference-counted strings or what have you), used only internally. Time-honored tradition.

pjmlp · on July 19, 2023

You do like in Windows and define safe strings for ABI, as done for COM API, nowadays the main kind of Windows APIs.