Hacker News new | past | comments | ask | show | jobs | submit login
Inside the C Standard Library (begriffs.com)
206 points by signa11 on Jan 20, 2019 | hide | past | favorite | 57 comments



Well, I was not expecting to read a 10,000-word article about the C standard library this morning. I haven't been programming in C as long as many here, but over the past dozen years or so I've always felt that the standard library is a half-baked hodgepodge of inconsistencies and a minefield of gotchas that have caused innumerable security vulnerabilities.

Why is open() in fcntl.h and close() in unistd.h? Why does bcopy go src,dst but memcpy go dst,src? Why did they make strlcpy return a different value than strncpy, so you can't improve a whole codebase with one recursive sed?

There are many questions I still have, but I learned a bunch of new things from this article. Thanks to the author for putting the effort into digesting the source material for us.

On a tangential topic, I don't know of a good reference for exploring what is included in the standard library. Yes, of course there are manpages for individual families of functions. But I don't know how anyone would find out about, say, strtok or strsep without encountering them in some existing code. I had a university assignment which would have been trivial if anyone in the class knew that fts.h existed.


> I've always felt that the standard library is a half-baked hodgepodge of inconsistencies

It is!

> Why is open() in fcntl.h and close() in unistd.h?

Because they don't belong together. open() is used to create an FD using a filesystem path. There are other ways to create FDs, like socket(). close() is used to close any FD. You could say open() has an unfortunate name. Neither open() nor close() are part of the C standard library, though. The standard library has the fopen() and fclose() functions, both defined in stdio.h. (And it's the same situation there: There are other ways to create a FILE than using fopen(), while fclose() can close any FILE (although fopen() is probably the only way defined in the standard library)).

> Why did they make strlcpy return a different value than strncpy

My advice is to just write your own string data structures and functions. Make the interface that you need. It's often bad to use C strings, anyway (they don't have an explicit length).

> But I don't know how anyone would find out about, say, strtok or strsep without encountering them in some existing code.

Man pages cross-reference similar functions. And there are index-manpages. For example, strcpy(3) references string(3) which lists all the functionality from string.h, including strtok(3).

> I had a university assignment which would have been trivial if anyone in the class knew that fts.h existed.

Never knew that one either. Not part of the C standard library I guess. Also, just write your own code. In my opinion the standard library is mainly useful for portable access to basic OS services.


> My advice is to just write your own string data structures and functions.

I would rather not have to do this, and I would prefer code that I'm reviewing not do it either. C strings are error-prone, as you point out, and when I'm evaluating code for security issues, it's useful to know exactly how a function will behave in edge cases--is the buffer guaranteed to be terminated? is the return value the number of bytes written, like sprintf, or the number of bytes that could have been written, like snprintf?

A new function with new behavior is twice the work: you have to verify it is consistent with the rules it claims to follow, and then make sure it is being used correctly. One memorable example was a convoluted routine that I suspected was the same as strdup() but I had to spend a while validating it.

It's just a shame that so many of the standard library string functions fall short. This is the reason why there are so many security vulnerabilities, because there are too many nuances and inconsistencies. They messed up strcpy() so they introduced strncpy(), but they messed that up too so we had to create strlcpy(), which is not a drop-in replacement (and not portable apparently). So we've got 3 poor design choices implementing one trivial function, no clear replacement to reach for, and we expect every programmer to know this.


They messed up strcpy() so they introduced strncpy()...

That's not actually the case. strncpy() wasn't introduced as some kind of "safer strcpy()", it was built specifically to deal with strings stored in the manner of original UNIX directory entries - a fixed-width array where any unused space is padded out with NULs. That's why it worked the way it did - such a string that filled the whole array with no NUL at all was still valid.

The major problem with strncpy() was always its name - the "str" prefix was (obvious in hindsight) taken by people to mean that it dealt with ordinary C NUL-terminated strings, whereas the reality was that only its source argument was supposed to be such a string. It should have had a different name (this is illustrated perfectly by the similarly-named strncat(), which does deal only with ordinary C strings), but by the time of standardisation that ship had sailed. Probably it was niche enough that it just should have been left out of C89 altogether.


`strcpy()` and `strncpy()` were defined at the same time, but to understand `strncpy()` you have to understand the original Unix file system. Directories are special files that map names to inodes [1]. Back then, each entry in a directory was 16 bytes---14 for the name, and 2 bytes for the inode number. `strncpy()` was designed to slap filenames into the directory; adding a NUL byte would either restrict filenames to 13 bytes, or overwrite the inode number. Also remember, this was done in the early 70s, when space was at a premium and computers weren't networked with every other computer in the world. It was a different time then.

The ANSI committee, when standardizing C, were trying to keep existing code valid when possible. Up until that time, you had a lot of different C libraries. Some defined `bzero()` to clear memory, others defined `memset()`. The C committee had to distill all these conflicting libraries so as to minimize the work each C vendor had to do to get a conforming C library.

[1] Inodes contain all information about a file except its name. Still true today.


> is the return value the number of bytes written, like sprintf, or the number of bytes that could have been written, like snprintf?

Both report the same number. snprintf just doesn't write more than n bytes.

> when I'm evaluating code for security issues, it's useful to know exactly how a function will behave in edge cases

simple data: no edge cases. Pointer + length, that's simple. The most complexity you might be facing is if you allow a NULL pointer in case of 0 length (which is common). There should be hardly any function that cares for that.

Aside, C strings are still a valid alternate representation in some cases. For example, for many small strings (where an explicit length field would double the cost) or purely as a convenience in conjunction with string literals. Or for some serialized representations (it's nice not having to deal with compatibility problems concerning the physical representation of the length value).

> This is the reason why there are so many security vulnerabilities, because there are too many nuances and inconsistencies.

I think the reason is that the standard library is overused. Functions like strcat etc. are not only inefficient. They have a needlessly complex API. Would people create the functionality they actually needed instead of working around the ill-fitting API in each line of code, then there would be fewer vulnerabilities.

> strcpy() [..] strncpy() [..] strlcpy [..] no clear replacement to reach for, and we expect every programmer to know this.

No. Just don't use this stuff. It's too complex. It's the wrong interface. If you deal with strings and don't do automatic reallocation you need to get their lengths anyway. So what you should do is something like

    void do_silly_stuff(const char *a, const char *b, const char *c,
                        int alen, int blen, int clen)
    {
        char buf[FIXEDSIZE];
        ASSERT(alen + blen + clen <= sizeof buf);
        int len = 0;
        memcpy(buf + len, a, alen); len += alen;
        memcpy(buf + len, b, blen); len += blen;
        memcpy(buf + len, c, clen); len += clen;
        do_more_silly_stuff(buf, len);
    }
Nice and explicit. You cannot get less dangerous in C.

Or if you do automatic realloction ("dynamic string"):

    String sillyconcat(String a, String b, String c) {
        String result = new_string();
        //optionally:
        // string_reserve(string_length(a) + string_length(b) + string_length(c));
        string_append(result, a);
        string_append(result, b);
        string_append(result, c);
        return result;
    }
... but I'd only recommend this approach if you want to get super-comfortable and are willing to pay the cost of being maybe a bit opaque and of buying in into a specific String type.


Ah, but even this can be dangerous. Suppose FIXEDSIZE = 4096. If you pass 715,827,883 as all three length parameters, your assertion (assuming it's not compiled out) passes, but a buffer overflow occurs.

Basically, using the C string routines with untrusted input is terrifying.


Yep. You could argue to death almost any line of C code that contains an arithmetic operation in this way. But the ASSERT is not supposed to catch everything. It's a basic protection against programming errors. (Realistically I rarely have strings that are 700 MB large).

Unfortunately, arithmetic overflow doesn't result in an exception. I rarely want wraparound.


> Realistically I rarely have strings that are 700 MB large

maybe not that long, but it's not that hard to lose a '\0' when serializing complicated data structures to disk. when you read the file back in, suddenly one of your structs contains an arbitrarily long string. i've seen several 32+MB strings get created this way.


> > I had a university assignment which would have been trivial if anyone in the class knew that fts.h existed.

> Never knew that one either. Not part of the C standard library I guess.

fts(3) is a BSD interface. It got introduced in 4.3BSD-Reno.


fts(3) is also in SuSv3, but last time I checked it was not available on Solaris (but that was like 10 years ago). More portable but more limited equivalent is ftw(3).


The improved nftw() is in POSIX.


Why is open() in fcntl.h and close() in unistd.h? Why does bcopy go src,dst

I don't know about the rest, but open(), close() and bcopy() are not in the C standard library. They are either POSIX or Linux specific. The C standard library functions for files start with f e.g. fopen(), fclose().

On a tangential topic, I don't know of a good reference for exploring what is included in the standard library.

The classic book The C Programming Language covers most, if not all, of the standard library. The whole thing can be found in an appendix. Admittedly reading an appendix isn't normally the best way to learn about something, but the library is so small and the amount not already covered in the text is even smaller so it wouldn't take long to skim. Of course it doesn't cover Unix-specific functions, like the unistd.h header; for that, I recommend Advanced Programming in the Unix Environment.


I'd add that The C Programming Language should be seen as a tutorial (albeit an older one that no longer is fully up-to-date in best practices, in my experience) that can serve as a reference in a pinch (it skimps a bit on the corner cases if memory serves). Advanced Programming in the Unix Environment, conversely, should be seen as a reference book with fantastic examples and discussion on error handling -- yes, you can use it as a tutorial, but it's a tad dense and you'll likely never need a lot of it.


None of these are Linux specific :-). All were in POSIX; bcopy was deprecated in the 2004 iteration and removed in 2008.


Somewhat relevant fact is that almost any hosted environment (including MSVC) actually implements large subset of POSIX functions that are not part of C standard. And this does not only consist of OS interface functions but also of various general purpose userspace-only algorithms and data structures.


The post was based up the book The Standard C Library by P. J. Plauger. It's an excellent reference (if a bit dated as it only covers C89) that not only prints the relevant bits from the C Standard, but goes into the history behind the functions and provides an implementation.


Wonderful exposition. One nitpick: please do NOT suggest seeding random functions with &main. Although ASLR is widespread, it is by no means guaranteed, and the default compilation options on many platforms will still produce non-PIE executables. This would render your seed completely static.

A separate problem is that ASLR on some platforms has limited entropy. For example, on 32-bit Linux platforms the executable randomness has between 8 and 16 bits of entropy - which means your program would only ever have a small, finite range of possible random streams (16 bits correlating to less than 1 day of time() output taken once per second).

Don’t rely on ASLR for randomness. At best it can be used as an additional source of random data. The C standard provides no mechanism for obtaining high quality entropy, as some environments (especially freestanding ones) may be unable to offer such randomness.


On recent unix-likes, use getrandom() or getentropy().

On Windows, use CryptGenRandom, RtlGenRandom, or BCryptGenRandom (I'm not very familiar with Windows, sorry).


And on embedded systems with no RNG in sight, use a deterministic random number generator, with a seed generated from another system. Use fast key erasure to ensure forward secrecy. https://blog.cr.yp.to/20170723-random.html

If the device ever powers down, you will need 32 bytes worth of persistent storage, so you don't repeat the random stream every time you boot up.


I don't think this advice is really detailed enough to cover the breadth of the embedded space and/or comes with enough caveats about lack of writeable storage and reused keys. Also, sometimes people use 'embedded' to refer to things that are really full-sized computers (e.g., ARM with ordinary read/write storage), and on those you should just use ordinary CSPRNG implementations like Fortuna (which is not deterministic but does implement key erasure).


> I don't think this advice is really detailed enough to cover the breadth of the embedded space and/or comes with enough caveats about lack of writeable storage and reused keys.

I believe the link I gave is. I highly recommend you read it. It's from a trustworthy author (Daniel Bernstein): https://blog.cr.yp.to/20170723-random.html


I believe that there is no such a thing as useful system without any viable source of entropy.

Any useful system has to communicate with something and has to have at least some vague idea of passing of time and thus you can hash timing and contents of outside communication into your preferably persistent entropy pool.


Ah, thank you. This is a good summary of another problem I had with the comment we're both replying to, that I had trouble voicing. The assumption that there is no entropy available almost directly contradicts the idea of using cryptographic random data for almost any purpose. I'm having trouble imagining a use case but I don't think it's entirely impossible one exists.


> The assumption that there is no entropy available almost directly contradicts the idea of using cryptographic random data for almost any purpose

Strictly speaking, that's correct. For practical purposes however, all you need is 256 bits of entropy. From there, you can generate an unbounded stream of independent random streams, which you can use for your crypto stuff. You don't need to re-seed. Ever. How do you think stream cipher work? You start from 256 random bits, and you generate a virtually unbounded random stream that encrypts your message.

The use case of this scheme is mainly reducing the attack surface. Not re-seeding prevents some attacks, detailed in the link I gave above.


> For practical purposes however, all you need is 256 bits of entropy.

You need 256 bits of entropy and permanent, reliable power; or permanently writable and reliable media.

You haven't addressed the bit where you're presenting a solution in search of a problem.

> You don't need to re-seed. Ever. How do you think stream cipher work?

I'm quite familiar with random devices (i.e., Fortuna) and stream ciphers, thank you.


> You need 256 bits of entropy and permanent, reliable power; or permanently writable and reliable media.

I did mention that already: https://news.ycombinator.com/item?id=18955720 "If the device ever powers down, you will need 32 bytes worth of persistent storage, so you don't repeat the random stream every time you boot up."

> You haven't addressed the bit where you're presenting a solution in search of a problem.

Here's the problem: if the attacker controls one or several of the random sources, they may partially control the output of the RNG. In some cases, a simple bias can lead to catastrophic key recovery attacks (ECDSA comes to mind). In other cases, the attacker could simply use the RNG as a communication channel. It only takes 4 tries to control 2 bits of a hash. This is also described in the link you should really really follow.

> I'm quite familiar with random devices (i.e., Fortuna) and stream ciphers, thank you.

Good. Then you know that a stream cipher such as Chacha20 is a suitable high security RNG for long term keys (otherwise it wouldn't be suitable even as a stream cipher). The only problem is forward secrecy (the key used by the stream cipher is necessarily stored somewhere, and if it gets stolen, all past keys, including ephemerals!, are toast). That's what fast key erasure is for: have a 512 byte buffer or so, whose first 32 bytes are filled with your current seed. Stream 512 bytes from that seed and overwrite the entire buffer with it. The first 32 bytes will seed the next batch of random bytes. The rest can be distributed to the API user. (Also don't let users access the RNG state, they're obviously going to duplicate it.)

---

By the way, DJB points out that our go-to persistent storage, hard drives and SSD, aren't exactly easy to erase. Which is quite crucial for the forward secrecy fast key erasure achieves. My opinion is that every computer should come up with a dedicated persistent writeable (and over-writeable) memory. The tiniest chip should be able to afford 32 bytes of such memory.


And risk having attackers control the randomness that comes into your entropy pool? Forget it, Chacha20 with fast key erasure is safer.

And if you want to re-seed anyway, pick your randomness from a trusted server.


> One interesting tidbit is that errno is sometimes not a global variable at all, but a macro for (*_Error()). Having to set a real data object immediately after performing hardware floating point ops would break the FPU pipeline. Allowing the check to be deferred until requested with this _Error() function doesn’t break the pipeline.

The best reason I can think of to make errno a function rather than a global integer is due to threads. The function can handle thread local storage. With a simple declaration of "extern int errno;" all threads get the same errno and clobber each other.

Nowadays there are things like __thread or c++11 thread_local but those mechanisms didn't exist when I first saw this pattern.


Indeed that is the reason. On my system, I have:

    #define errno (*__errno_location ())


This snippet is not correct:

  #include <limits.h>
  #if ULONG_MAX != -1UL
  #error "This code requires 2s' complement arithmetic"
  #endif
The #error case will never be called, because the behaviour of the unsigned types is completely defined in C - it doesn't matter whether the system uses 2s complement or not, the result of applying the negation operator to 1UL must always be ULONG_MAX.


He really should have noted the misdefinition of the truncating versions in string.h. Eg. strncpy doesn't leave enough room for the final \0, so the string may be the unterminated. A serious POSIX error.

This worse than the wchar_t iswalpha trouble with older UCS-2 implementations (Windows, AIX).

Or the implementation trouble with memset, which either can be optimized away without a compiler barrier, or on modern hyperthreaded microarchitectures which violate load-store ordering and thus need a full memory barrier (mfence). no single libc does it properly, not even crypto libs, such as libsodium.


That isn't an error in strncpy — it's for manipulating fixed length fields, which is why it zeroes the entire remaining field.

strncpy is commonly misused as a strlcpy, though, in part because glibc continues to refuse to add strlcpy. strlcpy provides the functionality most people want from strncpy but do not get.


Thanks for pointing this out.

Can you give an example of the strncpy problem?

Also is there a precaution programs can take to use memset more safely?


Replace strncpy and strcpy altogether with calls to snprintf. It takes a fixed buffer size, terminates correctly with the null character, and safely does everything strncpy does and more. It's a POSIX standard, so it should be portable to most systems too.

And yes, maybe it'll impact performance. Worry about that _after_ you profile your code and have the numbers to show it -- I'd bet good money that 95% of developers will never need to worry about it.


snprintf is the same trash, just slower. See eg. http://blog.infosectcbr.com.au/2018/11/memory-bugs-in-multip... discussing the need of an improved scnprintf in the Linux kernel.


strncpy does not add a '\0' if the string does not fit into the target buffer, it just cuts it down to the max size.

So to be on the safe side you need to wrap strncpy calls into a function that always puts a '\0' on the very last index of the destination string buffer.

As for memset, the best workaround is to use OS specific variants that aren't subject to such issues, for example SecureZeroMemory() on Windows.


SecureZeroMemory() is insecure on hyperthreaded systems, and probably on normal SMT systems also. It only uses a compiler-barrier, ensuring that it is not optimized away. This is not secure, it is just a basic precaution against wrong compiler optimizations.

In reality the compiler should know about the libc memset and memzero and refuse to optimize it away, and memset_s/SecureZeroMemory needs a memory barrier.


explicit_bzero() is another OS-specific variant.

memset_s() is now part of standard C, but it is a pain to use because it isn't just a don't-optimize-away memset().


memset_s() is defined in C11 Annex K, which is optional and is not provided by most implementations.


AFAIK my own memset_s is the only secure one: https://github.com/rurban/safeclib/tree/master/src/mem but nobody cares. It was not mentioned at this years CCC talk about memset. Everybody thinks a simple compiler barrier is enough, it is not.

The strncpy truncating problem is widely known: https://www.google.com/search?q=strncpy+truncating+problem

But conceptually the biggest problem is the inability to deal with strings properly at all. You mentioned the iswalpha problem and the need for external libs, but the standard cannot even search for unicode strings properly (no normalization wcsnorm, no wcsfc, no UTF-8 u8 API), ditto sorting needs an API for the unicode version being used. It's relative and changes every year. And every libc is hopelessly behind. The most basic coreutils still cannot search for foreign strings (grep, sort, wc, cut, expand, ...): http://perl11.org/blog/foldcase.html https://crashcourse.housegordon.org/coreutils-multibyte-supp...



Nitpick: text colour is a light gray (#3A4145). Even tough I have a good eyesight I was struggling. Please consider accessibility when designing your web pages :)

Thankfully the web is hackable so I could just inspect element > change color to #000000.


That's interesting because the contrast is around 12:1 - which is well above the AAA accessibility guidelines for small text.

Perhaps the font and weight are contributing factors?


"C99 provides a macro SIZE_MAX with the maximum value possible in size_t. C89 doesn’t have it, although you can obtain the value by casting (size_t)-1. This assumes a twos’ complement architecture,..."

The architecture does not have to be two's-complement. The value of the expression (size_t)(-1) will always be SIZE_MAX, the largest number that can be represented in the size_t type. The bit pattern may change depending on the architecture, but the value shall equal to SIZE_MAX.


If you use assertions you can combine the expression with & & "this error shouldn't happen because..." to get more context.


Is this book worth having as the K&R one or is it too outdated considering we are upon C2X?


I read through it and found it interesting for the historical info and as a way to review the standard library.

But if you're trying to get better at C, there are probably a lot of resources that are better uses of your time. It's pretty dated, the structure is a little hard to follow (he goes in alphabetical order of the header names, which given that a lot of the libraries reference each other, makes for somewhat confusing reading), and since he isn't targeting any particular architecture, the OS dependent stuff is pretty handwavy.


No, but I don't think the K&R one is worthwhile either. It's far too outdated and teaches some bad practices. C99 in particular brought major improvements to the C language that you should be using. And C11 and C17/18 continued to refine the language in incremental ways.


> I don't think the K&R one is worthwhile either. It's far too outdated and teaches some bad practices.

It’s old, but I didn’t see anything horrible in the book. What bad practices does it teach?



Not a fan of Learn C the Hard Way, but I will agree with you that K&R is a bit light on details of how/why to write safe C code. I stand by K&R being the best book for learning C that I’ve read, though it goes nicely with a good application security curriculum ;)


I think similar like Simplicio, but I would like to point you to Hanson's "C Interfaces and Implementations" which will show you how to organize your C code well and how to implement robustly ADTs. It is pretty educative and enjoyable to read.


The author describes ctype.h as "kind of a letdown" and as "not sufficient for international text processing" and then proceeds to write an isalpha() that's not portable C because he assumes an ASCII derivative (the code won't work under e.g. an EBCDIC system).

A lot of things in the C Standard Library look odd or pointless until you do some digging and realize that they were often designed to be portable to systems that are either really obscure today, or don't exist anymore.


You mean where I'm comparing whether the character is between 'a' and 'z'? Yeah I'm aware that's an ASCII thing, and that the standard was designed to accommodate other encodings such as EBCDIC. My `isalpha` example was more about showing how such a macro ought to be careful to not evaluate its argument more than once.

Did you read the rest of that section? I have arguments about why ctype appears to be insufficient.


I just meant to add a bit of info to an already excellent overview, not to detract from it. Yes, most of the rest of the section is about unrelated pitfalls.

The non-ASCII use-case is something that's often forgotten these days, and until you mentioned it here there was no indication in the document that you were aware of it. I.e. the section discusses "non-English" non-8-bit languages etc.

But the most basic use of ctype.h is e.g. checking if something is in the 0-9a-zA-Z range for the purposes of a basic parser, and you can't do that portably in C without ctype.h, or effectively writing your own ctype.h implementation.


> until you mentioned it here there was no indication in the document that you were aware of it

Good point, and thanks for pointing it out. People can't read my mind and the example might give people the wrong idea. I'll add a note to the example.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: