I like it, but the array details are a little bit off. An actual array does have a known size, that's why when given a real array `sizeof` can give the size of the array itself rather than the size of a pointer. There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction. As you noted, it already has to be able to do this when assigning `struct`s.
Additionally a declared array such as `int arr[5]` does actually have the type `int [5]`, that is the array type. In most situations that decays to a pointer to the first element, but not always, such as with `sizeof`. This becomes a bit more relevant if you take the address of an array as you get a pointer to an array, Ex. `int (*ptr)[5] = &arr;`. As you can see the size is still there in the type, and if you do `sizeof *ptr` you'll get the size of the array.
I really wish that int arr[5] adopted the semantics of struct { int arr[5]; } -- that is, you can copy it, and you can pass it through a function without it decaying to a pointer. Right now in C:
will print 4, 20, 20, 20. I understand that array types having their sizes in their types was one of Kernighan's gripes with Pascal [0], which likely explains why arrays decay to pointers, but for those cases, I'd say you should still decay to a pointer if you really wanted to, with an explicit length parameter.
> I really wish that int arr[5] adopted the semantics of struct { int arr[5]; }
You and me both. In fact, D does this. `int arr[5]` can be passed as a value argument to a function, and returned as a value argument, just as if it was wrapped in a struct.
It's sad that C (and C++) take every opportunity to instantly decay the array to a pointer, which I've dubbed "C's Biggest Mistake":
I have long been convinced that WG14 has no real interest in improving C's security beyond what a Macro Assembler already offers out of the box.
Even the few "security" attempts that they have made, still require separate pointer and length arguments, thus voiding any kind of "security" that the functions might try to achieve.
However even a Macro Assembler is safer than modern C compilers, as they don't remove your code when one steps into a UB mine.
Earlier versions of gcc actually used to support this in a very restricted context in C90 (or maybe gnu89) mode:
struct foo { int a[10]; };
struct foo f(void);
int b[10];
b = f().a;
In C90, you can't actually do anything with `f().a` because the conversion from array to pointer only happened to lvalues (`f().a` is not an lvalue), and assignment is not defined for array variables (though gcc allowed it). The meaning is changed in C90 so that non-lvalue arrays are also converted to pointers. gcc used to take this distinction into account, so the above program would compile in C90 mode but not in C99 mode. New versions of gcc seem to forbid array assignment in all cases.
I think this quirk also means that it's technically possible to pass actual arrays to variadic functions in C90, since there was nothing to forbid the passing (it worked in gcc at least, though in strict C90, you wouldn't be able to use the non-lvalue array). In C99 and above, a pointer will be passed instead.
> There's no particular reason why C doesn't allow you to assign one array to another of the same length
Actually, there is a particular (though not necessarily good) reason, since that would require the compiler to either generate a loop (with conditional branch) for a (unconditional) assignment or generate unboundedly many assembly instructions (essentially a unrolled loop) for a single source operation.
Of course, that stopped being relevant when they added proper (assign, return, etc) support for structs, which can embed arrays anyway, but that wasn't part of the language initially.
Another weird property about C arrays is that &arr == arr. The reference of an array is the pointer to the first element, which is what `arr` itself decays to. If arr was a pointer, &arr != arr.
&arr is a pointer to the array. It will happen to point to the same place as the first element, but in fact they have different types, and e.g. (&arr)[0] == arr != arr[0].
> There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction.
IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars. If you need assignment that might do O(N) work, you need to call a stdlib function (memcpy/memmove) instead. If you need an allocation that might do O(N) work, you either need a function (malloc) or you need to do your allocation not-at-runtime, by structuring the data in the program's [writable] data segment, such that it gets "allocated" at exec(2) time.
This is really one of the biggest formal changes between C and C++ — C++ assignment, and keywords `new` and `delete`, can both do O(N) work.
(Before anyone asks: a declaration `int foo[5];` in your code doesn't do O(N) work — it just moves the stack pointer, which is O(1).)
This depends on what you consider to be O(1) - being that the size of the array is fixed it's by definition O(1) to copy it, but I might get your point. I think in general your point isn't true though, C often supports integer types that are too large to be copied in a single instruction on the target CPU, instead it becomes a multi-instruction affair. If you consider that to still be O(1) then I think it's splitting hairs to say a fixed-size array copy would be O(N) when it's still just a fixed number of instructions or loop iterations to achieve the copy.
Beyond that, struct assignments can already generate loops of as large a size as you want, Ex: https://godbolt.org/z/8Td7PT4af
I think the meaning here is that assignment is never O(N) for any variable N computed at runtime. Of course, you can create arbitrarily large assignments at compile time, but this always has an upper bound for a given program.
Then you are wrong, since we're already talking about arrays of sizes known at compile time. Indeed, otherwise we would also need to remember the size in the runtime.
I don't think we're actually in disagreement here. It looks like I misread the parent comment to be claiming that fixed-size array assignment ought to be considered O(N), when no such claim is made.
Yeah to clarify I'm definitely in agreement with you that it's O(1), the size is fixed so it's constant time. It's not like the 'n' has to be "sufficiently small" or something for it to be O(1), it just has to be constant :)
People are being very loose about what O(n) means so I attempted to clarify that a bit. Considering what assignments can already do in C it's somewhat irrelevant whether they think it's O(n) anyway, it doesn't actually make their point correct XD
How do you think it works? Does the compiler generate some kind of stack alloc?
Stupid question: Does that mean a huge value for 'n' can cause stack overflow at runtime? I recall that threads normally get a fixed size stack size, e.g., 1MB.
Yes, it causes stack overflow at runtime. Compilers warn for it, in particular clang has a warning that you can configure to pop up whenever the stack usage of a function goes beyond some limit you set - I think that setting it to 32k or 64k is a safe and sane default as e.g. macOS thread stack sizes are just 512kb
It just moves the stack pointer by n which is O(1). It doesn’t initialize it of course. But my point is that the array size isn’t known at compile time.
> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars.
this is absolutely and entirely wrong. You can assign a struct in C and the compiler will call memcpy when you do.
It's O(1) relative to any size computed at runtime: that is, running the same program (with the same array size) on different inputs will always take the same of work for a given assignment.
We're in the context of the assignment operation in the language here. Yes, in C you can only assign statically-known types but that does not mean you can just ignore that a = f(); may take a very different time depending on the types of a and f
Well C does allow "copying" an array if it's wrapped inside a struct, which does not make it O(1). gcc generates calls to memcpy in assembly for copying the array.
Fixed-width divisions are O(1), just comparatively expensive (and potentially optimized to run in variable time). Consider that you can do long division on pairs of numerals of, say, up to 20 digits and be Pretty Confident of an upper bound on how long it's going to take you (you know it's not going to take more than 20 rounds), even though it's going to take you longer to do that than it would for you to add them.
Interesting, I didn't fully realise that. That it's arbitrary is annoying, I clearly had tried to rationalise it to myself! Thanks for the comments, will get around to amending
int, as all other integer types except char, is indeed signed by default.
Aside: signedness semantics of char is implementation-defined. However, the type char itself is always distinct from both signed char and unsigned char.
Really, are multidimensional arrays an important part of the language?
The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.
Of course multidimensional arrays are an important part of the language, just as the ability to have structs inside structs.
> The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.
It is a "flat" array already, not an array of pointers: [0]. No need to write the code that compiler generates for you already.
Kind of. But the restriction is in keeping with the C philosophy of no hidden implementation magic. C has the same restriction on structs. That's the same question; an array of bytes of known size to the compiler it could easily abstract away. But assignment is always a very cheap operation in C. If we allow assigning to represent memcpy() that property is no longer true.
Same reason why Rust requires you to .clone() so much. It could do many of the explicit copies transparently, but you might accidentally pass around a 4 terabyte array by value and not notice.
> But assignment is always a very cheap operation in C.
That's just not true though, you can assign struct's of an arbitrarily large size to each-other and compilers will emit the equivalent of `memcpy()` to do the assignment. They might actually call `memcpy()` automatically depending on the particular compiler.
The fact that if you wrap the array in a struct then you're free to copy it via assignment makes it arbitrary IMO.
Perhaps I am missing something in the spec - but trying this in various compilers, it seems that you *can* assign structs holding arrays to one another, but you *cannot* assign arrays themselves.
This compiles:
struct BigStruct {
int my_array[4];
};
int main() {
struct BigStruct a;
struct BigStruct b;
b = a;
}
In the first example a & b are variables, which can be assigned to each other. In the second a & b are pointers, but b is fixed, so you can not assign a value to it.
They're pointers, just weird ones. The compiler knows it's an array, so it gives the result of the actual amount of space it takes up. If you passed it into a function, and used the sizeof operator in the function, it'd give `sizeof(int *)`. Because sizeof is a compile-time operation, so the compiler still knows that info for your example.
That jest means it decays into a pointer after being passed as a function argument. In the example given however it’s not a pointer. Just like it wouldn’t be inside a struct.
Essentially ‘b = a’ in the second example is equivalent to ‘b = &a[0]’ or assigning an array to a pointer.
This is because if you use an array in an expression, it’s value is (most of the time) a pointer to the array’s first element. But the left element is not an expression, therefore it is referring to b the array.
Example one works because no arrays are referred to in the expression side, so this shorthand so to speak is avoided.
Arrays can be a painful edge in C, for example variable length arrays are hair pulling.
Specifically arrays [T; N] are Copy if precisely T is Copy. So, an array of 32-bit unsigned integers [u32; N] can be copied, and so can an array of immutable string references like ["Hacker", "News", "Web", "Site"] but an array of mutable Strings cannot.
The array of mutable Strings can be memcpy'd and there are situations where that's actually what Rust will do, but because Strings aren't Copy, Rust won't let you keep both - if it did this would introduce mutable aliasing and so ruin the language's safety promise.
By far my biggest regret is that the learning materials I was exposed to (web pages, textbooks, lectures, professors, etc.) did not mention or emphasize how insidious undefined behavior is.
Two of the worst C and C++ debugging experiences I had followed this template: Some coworker asked me why their function was crashing, I edit their function and it sometimes crashes or doesn't depending on how I rearrange lines of code, and later I figure out that some statement near the top of the function corrupted the stack and that the crashes had nothing to do with my edits.
Undefined behavior is deceptive because the point at which the program state is corrupted can be arbitrarily far away from the point at which you visibly notice a crash or wrong data. UB can also be non-deterministic depending on OS/compiler/code/moonphase. Moreover, "behaving correctly" is one legal behavior of UB, which can fool you into believing your program is correct when it has a hidden bug.
The take-home lesson about UB is to only rely on following the language rules strictly (e.g. don't dereference null pointer, don't overflow signed integer, don't go past end of array). Don't just assume that your program is correct because there were no compiler warnings and the runtime behavior passed your tests.
Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction. A UB-having program could time-travel back to the start of the universe, delete it, and replace the entire universe with a version that did not give rise to humans and thus did not give rise to computers or C, and thus never exist.
It's so insidiously defined because compilers optimize based on UB; they assume it never happens and will make transformations to the program whose effects could manifest before the UB-having code executes. That effectively makes UB impossible to debug. It's monumentally rude to us poor programmers who have bugs in our programs.
I'm not sure that's a productive way to think about UB.
The "weirdness" happens because the compiler is deducing things from false premises. For example,
1. Null pointers must never be dereferenced.
2. This pointer is dereferenced.
3. Therefore, it is not null.
4. If a pointer is provably non-null, the result of `if(p)`
is true.
5. Therefore, the conditional can be removed.
There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing
The C and C++ (and D) compilers I wrote do not attempt to take advantage of UB. What you got with UB is what you expected to get - a seg fault with a null dereference, and wraparound 2's complement arithmetic on overflow.
I suppose I think in terms of "what would a reasonable person expect to happen with this use of UB" and do that. This probably derives, again, from my experience designing flight critical aircraft parts. You don't want to interpret the specification like a lawyer looking for loopholes.
It's the same thing I learned when I took a course in race in high performance driving. The best way to avoid collisions with other cars is to be predictable. It's doing unpredictable things that cause other cars to crash into you. For example, I drive at the same speed as other traffic, and avoid overtaking on the right.
I think this is a core part of the problem; if the default for everything was to not take advantage of UB things would be better - and we're fast enough that we shouldn't NEED all these optimizations except in the most critical code; perhaps.
You should need something like
gcc --emit-nasal-daemons
to get the optimizations that can hide UB, or at least horrible warnings that "code that looks like it checks for null has been removed!!!!".
AFAIK GCC does have switches to control optimizations, the issues begin when you want to use something other than GCC, otherwise you're just locking yourself to a single compiler - and at that point might as well switch to a more comfortable language.
In the really old DOS days, when you wrote to a null pointer, you overwrote the DOS vector table. If you were lucky, fixing it was just a reboot. If you were unlucky, it scrambled your disk drive.
It was awful.
The 8086 should have been set up so the ROM was at address 0.
This is the right approach IMO, but sadly the issue is that not all C compilers work like that even if they could (e.g. they target the same CPU) so even if one compiler guarantees they wont introduce bugs from an overzealous interpretation of UB, unless you are planning to never use any other compiler you'll still be subject to said interpretations.
And if you do decide that sticking to a single compiler is best then might as well switch to a different and more comfortable language.
This is the problem; every compiler outcome is a series of small logic inferences that are each justifiable by language definition, the program's structure, and the target hardware. The nasal demons are emergent behavior.
It'd be one thing if programs hitting UB just vanished in a puff of smoke without a trace, but they don't. They can keep on spazzing out literally forever and do I/O, spewing garbage to the outside world. UB cannot be contained even to the process at that point. I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems. One mistake and you invite the wrath of God!
> I personally find that offensive and rude that tools get away with being so garbage that they can't even promise to help you crash and diagnose your own problems.
This is literally why newer languages like Java, JavaScript, Python, Go, Rust, etc. exist. With the hindsight of C and C++, they were designed to drastically reduce the types of UB. They guarantee that a compile-time or run-time diagnostic is produced when something bad happens (e.g. NullPointerException). They don't include silly rules like "not ending a file with newline is UB". They overflow numbers in a consistent way (even if it's not a way you like, at least you can reliably reproduce a problem). They guarantee the consistent execution of statements like "i = i++ + i++". And for all the flak that JavaScript gets about its confusing weak type coercions, at least they are coded in the spec and must be implemented in one way. But all of these languages are not C/C++ and not compatible with them.
Yes, and my personal progression from C to C++ to Java and other languages led me to design Virgil so that it has no UB, has well-defined semantics, and yet crashes reliably on program logic bugs giving an exact stack traces, but unlike Java and JavaScript, compiles natively and has some systems features.
Having well-defined semantics means that the chain of logic steps taken by the compiler in optimizing the program never introduces new behaviors; optimization is not observable.
It can get truly bizarre with multiple threads. Some other thread hits some UB and suddenly your code has garbage register states. I've had someone UB the fp register stack in another thread so that when I tried to use it, I got their values for a bit, and then NaN when it ran out. Static analysis had caught their mistake, and then a group of my peers looked at it and said it was a false warning leaving me to find it long afterwards... I don't work with them anymore, and my new project is using rust, but it doesn't really matter if people sign off on code reviews that have unsafe{doHorribleStuff()}
On the contrary, the latter is a far more effective way to think about UB. If you try to imagine that the compiler's behaviour has some logic to it, sooner or later you will think that something that's UB is OK, and you will be wrong. (E.g. you'll assume that a program has reasonable, consistent behaviour on x86 even though it does an unaligned memory access). If you look at the way the GCC team responds to bug reports for programs that have undefined behaviour, they consider the emit_nasal_demons() version to be what GCC is designed to do.
> There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior
The problem is how due to other optimisations (mainly inlining) the emergent misbehaviour can occur in a seemingly unrelated part of the program. This can the inference chain very difficult, as you have to trace paths through the entire execution of the program.
The issue occurs for other types of data corruption, it’s why NPE are so disliked, but UB’s blast radius is both larger and less reliable.
I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").
> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.
The first statement is factually true, but I can provide a justification for the second statement which is an opinion.
Consider this code:
void foo(int x, int y) {
printf("sum %d", x + y);
printf("quotient %d", x / y);
}
We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.
No, and, in fact, the first one isn't valid - you can use C++ (or a subset of it) for the same performance profile with less footguns.
So really the only time to use C is when the codebase already has it and there is a policy to stick to it even for new code, or when targeting a platform that simply doesn't have a C++ toolchain for it, which is unfortunately not uncommon in embedded.
The basic deal is that in the presence of undefined behavior, there are no rules about what the program should do.
So if you as a compiler writer see: we can do this optimization and cause no problems _except_ if there's division by zero, which is UB, then you can just do it anyway without checking.
Only non-zero integer division is specified as having no side effects.
Division by zero is in the C standard as "undefined behavior" meaning the compiler can decide what to do with it, crashing would be nice but it doesn't have to. It could also give you a wrong answer if it wanted to.
Edit: And just to illustrate, I tried in clang++ and it gave me "5 / 0 = 0" so some compilers in some cases indeed make use of their freedom to give you a wrong answer.
To my downvoters, since I can no longer edit: I've been corrected that the rule is integer division has no side effects except for dividing by zero. This was not the rule my parent poster stated.
No you haven't. The incorrect statement was a verbatim quote from nayuki's post, which you were responding to. Please refrain from apologising for other people gaslighting you (edit: particularly, but not exclusively, since it sets a bad precedent for everyone else).
At the CPU level, division by zero can behave in a number of ways. It can trap and raise an exception. It can silently return 0 or leave a register unchanged. It might hang and crash the whole system. The C language standard acknowledges that different CPUs may behave differently, and chose to categorize division-by-zero under "undefined behavior", not "implementation-defined behavior" or "must trap".
I wrote:
> Integer division is a slow operation, and under the rules of C, it has no side effects.
This statement is correct because if the divisor is not zero, then division truly has no side effects and can be reordered anywhere, otherwise if the divisor is zero, the C standard says it's undefined behavior so this case is irrelevant and can be disregarded, so we can assume that division always has no side effects. It doesn't matter if the underlying CPU has a side effect for div-zero or not; the C standard permits the compiler to completely ignore this case.
> > Integer division is a slow operation, and under the rules of C, it has no side effects.
Yes, you did, and while that's a reasonable approximation in some contexts, it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour. (Arguably that means it has every possible side effect, but that's more of a philosophical issue. In practice it has various specific side effects like crashing, which are specific realizations of its theoretical side effect of invoking undefined behaviour.)
vikingerik's statement was correct:
> [If "Integer division [...] has no side effects",] Then C isn't following this rule - crashing is a pretty major side effect.
> it is false in the general case, since division by zero has a side effect in the form of invoking undefined behaviour.
They were careful to say “under the rules of C,” the rules define the behaviour of C. On the other hand, undefined behaviour is outside the rules, so I think they’re correct in what they’re saying.
The problem for me is that the compiler is not obliged to check that the code is following the rules. It puts so much extra weight on the shoulders of the programmer, though I appreciate that using only rules which can be checked by the compiler is hard too, especially back when C was standardised.
> They were careful to say "under the rules of C,"
Yes, and under the rules of C, division by zero has a side effect, namely invoking undefined behaviour.
> The problem for me is that the compiler is not obliged to check that the code is following the rules.
That part's actually fine (annoying, but ultimately a reasonable consequence of the "rules the compiler can check" issue); the real(ly bad and insidious) problem is that when the compiler does check that the code is following the rules, it's allowed to do it in deliberately backward way that uses any case of not following the rules as a excuse to break unrelated code.
Undefined behavior is not a side effect to be "invoked" by the rules of C. If UB happens, it means your program isn't valid. UB is not a side effect or any effect at all, it is the void left behind when the system of rules disappears.
> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.
This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.
Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.
As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.
> It is definitely not, e.g., an assertion on the operand because UB can't happen.
C specification says a program is ill-formed if any UB happens. So yes, the spec does say that compilers are allowed to assume UB doesn't happen. After all, a program with UB is ill-formed and therefore shouldn't exist!
I think you're conflating "unspecified behavior" and "undefined behavior" - the two have different meanings in the spec.
The C standard defines them very differently though:
undefined behavior
behavior, upon use of a nonportable or erroneous program
construct or of erroneous data, for which this International
Standard imposes no requirements
unspecified behavior
use of an unspecified value, or other behavior where this
International Standard provides two or more possibilities
and imposes no further requirements on which is chosen in
any instance
Implementations need not but may obviously assume that undefined behavior does not happen. Assume that however the program behaves if undefined behavior is invoked is how the compiler chose to implement that case.
"Nonportable" is a significant element of this definition. A programmer who intends to compile their C program for one particular processor family might reasonably expect to write code which makes use of the very-much-defined behavior found on that architecture: integer overflow, for example. A C compiler which does the naively obvious thing in this situation would be a useful tool, and many C compilers in the past used to behave this way. Modern C compilers which assume that the programmer will never intentionally write non-portable code are.... less helpful.
> I disagree on the logic from "ill-formed" to "assume it doesn't happen".
Do you feel like elaborating on your reasoning at all? And if you're going to present an argument, it'd be good if you stuck to the spec's definitions of things here. It'll be a lot easier to have a discussion when we're on the same terminology page here (which is why specs exist with definitions!)
> I admit I don't differentiate those two words. I think they are just word-play.
Unfortunately for you, the spec says otherwise. There's a reason there's 2 different phrases here, and both are clearly defined by the spec.
That's the whole point of UB though: the programmer helping the compiler do deduce things. It's too much to expect the compiler to understand your whole program to know a+b doesn't overflow. The programmer might understand it doesn't though. The compiler relies on that understanding.
If you don't want it to rely on it insert a check into the program and tell it what to do if the addition overflows. It's not hard.
Whining about UB is like reading Shakespeare to your dog and complaining it doesn't follow. It's not that smart. You are though. If you want it to check for an overflow or whatever there is a one liner to do it. Just insert it into your code.
No, the whole (entire, exclusive of that) point of undefined behaviour is to allow legitimate compilers to generate sensible and idiomatic code for whichever target architechture they're compiling for. Eg, a pointer dereference can just be `ld r1 [r0]` or `st [r0] r1`, without paying any attention to the possibility that the pointer (r0) might be null, or that there might be memory-mapped IO registers at address zero that a read or write could have catastrophic effects on.
It is not a licence to go actively searching for unrelated things that the compiler can go out of its way to break under the pretense that the standard technically doesn't explicitly prohibit a null pointer dereference from setting the pointer to a non-null (but magically still zero) value.
> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.
I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.
Will never see `set_some_flag == true`, as the memset call guarantees that ptr is not null, otherwise it's UB, and therefore the earlier `if` statement is always false and the optimizer will remove it.
Now the bug here is changing the definition of memset to match its documentation a solid, what, 20? 30? years after it was first defined, especially when that "null isn't allowed" isn't useful behavior. After all, every memset ever implemented already totally handles null w/ size = 0 without any issue. And it was indeed rather quickly reverted as a change. But that really broke people's minds around UB propagation with modern optimizing passes.
False. If a program triggers UB, then all behaviors of the entire program run is invalid.
> However, if any such execution contains an undefined operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).
Executing the program with that input is the key term. The program can't "take back" observable effects that happen before the input is completely read, and it can't know before reading it whether the input will be one that results in an execution with UB. This is a consequence of basic causality. (If physical time travel were possible, then perhaps your point would be valid.)
The standard does permit time-travel, however. As unlikely as it might seem, I could imagine some rare scenarios in which something seemingly similar happens -- let's say the optimiser reaching into gets() and crashing the program prior to the gets() call that overflows the stack.
Time travel only applies to an execution that is already known to contain UB. How could it know that the gets() call will necessarily overflow the stack, before it actually starts reading the line (at which point all prior observable behavior must have already occurred)?
If you truly believe so, then can you give an example of input-conditional UB causing unexpected observable behavior, before the input is actually read? This should be impossible, since otherwise the program would have incorrect behavior if a non-UB-producing input is given.
If it's provably input-conditional then of course it's impossible. But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB, and it doesn't have to implement "possible" non-UB-containing invocations if you can't find them. E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.
> If it's provably input-conditional then of course it's impossible.
My entire point pertains to programs with input-conditional UB: that is, programs for which there exists an input that makes it result in UB, and there also exists an input that makes it not result in UB. Arguably, it would be more difficult for the implementation to prove that input-dependent UB is unconditional: that every possible input results in UB, or that no possible input results in UB.
> But the C implementation does not have to obey the sequence point rules or perform observable effects in the correct order for invocations that contain UB
Indeed, the standard places no requirements on the observable effects of an execution that eventually results in UB at some point in the future. But if the UB is input-conditional, then a "good" execution and a "bad" execution are indistinguishable until the point that the input is entered. Therefore, the implementation is required to correctly perform all observable effects sequenced prior to the input being entered, since otherwise it would produce incorrect behavior on the "good" input.
> E.g. if you write a program to search for a counterexample for something like the Collantz Conjecture, that loops trying successively higher numbers until it finds one and then exits, GCC may compile that into a program that exits immediately (since looping forever is, arguably, undefined behaviour) - there's a real example of a program that does this for Fermat's Last Theorem.
That only works because the loop has no observable effects, and the standard says it's UB if it doesn't halt, so the compiler can assume it does nothing but halts. As noted on https://blog.regehr.org/archives/140, if you try to print the resulting values, then the compiler is actually required to run the loop to determine the results, either at compile time or runtime. (If it correctly proves at compile time that the loop is infinite, only then can it replace the program with one that does whatever.)
It's also irrelevant, since my point is about programs with input-conditional UB, but the FLT program has unconditional UB.
How this might happen is that one branch of your program may have unconditional undefined behavior, which can be detected at the check itself. This would let a compiler elide the entire branch, even side effects that would typically run.
The compiler can elide the unconditional-UB branch and its side effects, and it can elide the check itself. But it cannot elide the input operation that produces the value which is checked, nor can it elide any side effects before that input operation, unless it can statically prove that no input values can possibly result in the non-UB branch.
That example doesn't contradict LegionMammal978's point though, if I understood correctly. He's saying that the 'time-travel' wouldn't extend to before checking the conditional.
Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.
When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.
Optimizations themselves (except for perhaps -ffast-math) can't cause undefined behavior: the undefined behavior was already there. They can just change the program from behaving expectedly to behaving unexpectedly. The problem is that so many snippets, which have historically been obvious or even idiomatic, contain UB that has almost never resulted in unexpected behavior. Modern optimizing compilers have only been catching up to these in recent years.
There have been more than a few compiler bugs that have introduced UB and then that was subsequently optimized, leading to very incorrect program behavior.
A compiler bug cannot introduce UB by definition. UB is a contract between the coder and the C language standard. UB is solely determined by looking at your code, the standard, and the input data; it is independent of the compiler. If the compiler converts UB-free code into misbehavior, then that's a compiler bug / miscompilation, not an introduction of UB.
A compiler bug is a compiler bug, UB or not. You might as well just say "There have been more than a few compiler bugs, leading to very incorrect program behavior."
The whole thread is about how UB is not like other kinds of bugs. Having a compiler optimization erroneously introduce a UB operation means that downstream the program can be radically altered in ways (as discussed in thread) that don't happen in systems without the notion of a UB.
While it's technically true that any compiler bug (in any system) introduces bizarre, incorrect behavior into a program, UB just supercharges the things that can go wrong due to downstream optimizations. And incidentally, makes things much, much harder to diagnose.
I just don't think it makes much sense to say that an optimization can "introduce a UB operation". UB is a property of C programs: if a C program executes an operation that the standard says is UB, then no requirement is imposed on the compiler for what should happen.
In contrast, optimizations operate solely on the compiler's internal representation of the program. If an optimization erroneously makes another decide that a branch is unreachable, or that a condition can be replaced with a constant true or false, then that's not "a UB operation", that's just a miscompilation.
The latter set of optimizations is just commonly associated with UB, since C programs with UB often trigger those optimizations unexpectedly.
LLVM IR has operations that have UB for some inputs. It also has poison values that act...weird. They have all the same implications of source-level UB, so I see no need to make a distinction. The compiler doesn't.
Miscompilations are rarer and less annoying in compilers that do not have the design behaviour of compiling certain source code inputs into bizarre nonsense that bears no particular relation to those inputs.
You realize these two statements are equivalent, right?
> compiling certain source code inputs into bizarre nonsense
> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities
If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible. As long as such compiler conforms to the C standard, you have every right to promote this alternative. Don't shame other people building or using optimizing compilers.
> compiling certain source code inputs into bizarre nonsense
> winning at compiled-binary-execution-speed benchmarks, giving fewer reasons for people to hand-write assembly code for the sake of speed (assembly code is much harder to read/write and not portable), reducing code size by eliminating unnecessary operations (especially -Os), reordering operations to fit CPU pipelines and instruction latencies and superscalar capabilities
Mainstream C compilers actually make special exceptions for the undefined behaviour that's seen in popular benchmarks so that they can continue to "win" at them. The whole exercise is a pox on the industry; maybe at some point in the past those benchmarks told us something useful, but they're doing more harm than good when people use them to pick a language for modern line-of-business software, which is written under approximately none of the same conditions or constraints.
> Don't shame other people building or using optimizing compilers.
The people who are contributing to security vulnerabilities that leak our personal information deserve shame.
It's true that I don't like security vulnerabilities either. I think the question boils down to, whose responsibility is it to avoid UB - the programmer, compiler, or the standard?
I view the language standard as a contract, an interface definition between two camps. If a programmer obeys the contract, he has access to all compliant compilers. If a compiler writer obeys the contract, she can compile all compliant programs. When a programmer deviates from the contract, the consequences are undefined. Some compilers might cater to these cases (e.g. -fwrapv, GNU language extensions) as a superset of all standard-compliant programs.
Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.
> Coming from programming in Java first, I honestly would like to see a lot of UB eliminated from C/C++, downgrading them to either unspecified behavior (weakest), implementation-defined behavior, or single behavior (best). But the correct place to petition is not compiler implementations; we have to change the language standard - the contract that both sides abide by. Otherwise we can only get as far as having a patchwork of vendor-specific language extensions.
That feels backwards in terms of how the C standard actually gets developed - my impression is that most things that eventually get standardised start life as a vendor-specific language extensions, and it's very rare to have the C standard to introduce something and the compiler vendors then follow.
And really in a lot of cases the concept of UB isn't the problem, it's the compiler culture that's grown up around it. For example, the original reason for null dereference being UB was to allow implementations to trap on null dereference, on architectures where that's cheap, without being obliged to maintain strict ordering in all code that dereferences pointers. It's hard to imagine how what the standard specifies about that case could be improved; the problem is compiler writers prioritising benchmark performance over useful diagniostic behaviour.
> If you don't like the complexity of modern, leading-edge optimizing compilers, you are free to build or support a basic compiler that translates C code as literally as possible.
Most optimizing compilers can do this already, it's just the -O0 flag.
I tried compiling "int x = 1 / 0;" in both the latest GCC and Clang with -O0 on x86-64 on Godbolt. GCC intuitively preserves the calculation and emits an idiv instruction. Clang goes ahead and does constant folding anyway, and there is no division to be seen. So the oft-repeated advice of using -O0 to try to compile the code as literally as possible in hopes of diagnosing UB or making it behave sanely, is not great advice.
I recently dealt with a bit of undefined behavior (in unsafe Rust code, although the behavior here could similarly happen in C/C++) where attempting to print a value caused it to change. It's hard to overstate how jarring it is to see an code that says "assert that this value isn't an error, print it, and then try to use it", and have the assertion pass but then have it be printed out as an error and then panic when trying to use it There's absolutely no reason why this can't happen since "flipping bits of the value you tried to print" doesn't count as potential UB any less than a segfault, but it can be hard to turn off the part of your brain that is used to assuming that values can't just arbitrarily change at any point in time. "Ignore the rest of the program and do whatever you want after a single mistake" is not a good failure mode, and it's kind of astonishing to me that people are mostly just fine with it because they think they'll be careful enough not to make a mistake ever or that enough of the time it happened they were lucky that it didn't completely screw them over.
The only reason we use unsafe code on my team's project is because we're interfacing with C code, so it was hard not to come away from that experience thinking that it would be incredibly valuable to shrink the amount of interfacing with C as small as possible, and ideally to the point where we don't need to at all.
It's not insidious at all.
C compiler offers you a deal: "Hey, my dear programmer, we are trying to make an efficient program here. Sadly, I am not sophisticated enough to deduct a lot of things but you can help me! Here are some of the rules: don't overflow integers, don't dereference null pointers, don't go outside of array bounds. You follow those and I will fulfill my part of making your code execute quickly".
The deal is known and fair. Just be a responsible adult about it: accept it, live with the consequences and enjoy efficiency gains. You can reject it but then don't use arrays without a bound check (a lot of libraries out there offer that), check your integers bounds or use a sanitizer, check your pointers for nulls before dereferencing them, there are many tools out, there to help you, or... Just use another language that does all that for you.
UB was insidious to me because I was not taught the rules (this was back in years 2005 to 2012; maybe it got more attention now), it seemed my coworkers didn't know the rules and they handed me codebases with lots of existing hidden UB, and UB blew up in my face in very nasty ways that cost me a lot of debugging time and anguish.
Also, the UB instances that blew up were already tested to work correctly... on some other platform (e.g. Windows vs. Linux) or on some other compiler version. There are many things in life and computing where when you make a mistake, you find out quickly. If you touch a hot pan, you get a burn and quickly pull away. But if you miswire an electrical connection, it could slowly come loose over a decade and start a fire behind the wall. Likewise, a wrong piece of code that seems to behave correctly at first would lull the author into a false sense of security. By the time a problem appears, the author could be gone, or she couldn't recall what line out of thousands written years ago would cause the issue.
Three dictionary definitions for insidious, which I think are all appropriate: 1) intended to entrap or beguile 2) stealthily treacherous or deceitful 3) operating or proceeding in an inconspicuous or seemingly harmless way but actually with grave effect.
I'm neutral now with respect to UB and compilers; I understand the pros and cons of doing things this way. My current stance is to know the rules clearly and always stay within their bounds, to write code that never triggers UB to the best of my knowledge. I know that testing compiled binaries produces good evidence of correct behavior but cannot prove the nonexistence of UB.
I don't think this is the whole story. That are certain classes of undefined behavior that some compilers actually guarantee to treat as valid code. Type punning through unions in c++ comes to mind. Gcc says go ahead, the standard says UB. In cases like these, it really just seems like the standard is lazy.
It often isn't. C is often falsely advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect. Some writers may be used to pre-standardization compilers that are much less hostile than modern GCC/Clang.
> C is often [correctly, but misleadingly] advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect.
Because that's what it is. What they don't tell you is that the most heavily-developed two (or more) compilers for it (which you might otherwise assume meant the two best compilers), are malware[0] that actively seek out excuses to inject security vulnerabilities (and other bugs) into code that would work fine if compiled to the assembly that any reasonable author would expect.
> Figure 6 shows a simple modification to the compiler that will deliberately miscompile source whenever a particular pattern is matched. If this were not deliberate, it would be called a compiler "bug". Since it is deliberate, it should be called a "Trojan horse".
Nice way to put down the amazing work of compiler authors. It's not malware you just don't understand how to use it. If you don't want the compilers to do crazy optimisations turn down the optimisation level. If you want then to check for things like null pointers or integer overflow or array bounds at runtime then just turn on the sanitizers those compiler writers kindly provided to you.
You just want all of it: fast optimizing compiler, one that checks for your mistakes but also one that knows when it's not a mistake and still generates fast code. It's not easy to write such a compiler. You can tell it how to behave though if you care.
Socialism is when the government does something I don't like, and Reflections on Trusting Trust is when my compiler does something I don't like. The paper has nothing to do with how optimizing compilers work. Compiling TCC with GCC is not going to suddenly make it into a super-optimizing UB-exploiting behemoth.
A main point in the article is function classification, i.e. 'Type 1 Functions' are outward-facing, and subject to bad or malicious input, so require lots of input checking and verification that preconditions are met:
> "These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1."
Internal utility functions that only use data already filtered through Type 1 functions are called "Type 3 Functions", i.e. they can result in UB if given bad inputs:
> "Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented."
Incidentally I found that article from the top link in this Chris Lattner post on the LLVM Project Blog, "What Every C Programmer Should Know About Undefined Behavior":
In particular this bit on why internal functions (Type 3, above) shouldn't have to implement extensive preconditions (pointer dereferencing in this case):
> "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic. This would have an extremely high cost for many numerical and other applications, as well as breaking binary compatibility with every existing C library."
Basically, the conclusion appears to be that any data input to a C program by a user, socket, file, etc. needs to go through a filtering and verification process of some kind, before being handed to over to internal functions (not accessible to users etc.) that don't bother with precondition testing, and which are designed to maximize performance.
In C++ I suppose, this is formalized with public/private/protected class members.
I haven’t used C or C++ for anything, but in writing a Game Boy emulator I ran into exactly that kind of memory corruption pain. An opcode I implemented wrong causes memory to corrupt, which goes unnoticed for millions of cycles or sometimes forever depending on the game. Good luck debugging that!
My lesson was: here’s a really really good case for careful unit testing.
I would go one step farther: The documentation will say it is undefined behavior but the compiler doesn't have to. Here's an example from the man page for sprintf
sprintf(buf, "%s some further text", buf);
If you miss that section of the manual, your code may work, leading you to think the behavior is defined.
Then you will have interesting arguments with other programmers about what exactly is undefined behavior, e.g. what happens for
I remember reading a blog post a couple of years back on undefined behavior from the perspective of someone building a compiler. The way the standard defines undefined behavior (pun not intended), a compiler writer can basically assume undefined behavior never occurs and stay compliant with the standard.
This offers the door to some optimizations, but also allows compiler writers to reduce the complexity in the compiler itself in some places.
I'm being very vague here, because I have no actual experience with compiler internals, nor that level of language-lawyer pedantry. The blog's name was "Embedded in academia", I think, you can probably still find the blog and the particular post if it sounds interesting.
Yeah a decent chunk of UB is about reducing the burden on the compiler. Null derefs being an obvious such example. If it was defined behavior, the compiler would be endlessly adding & later attempting to optimize-away null checks. Which isn't something anyone actually wants when reaching for C/C++.
Similarly with C/C++ it's not actually possible for the compiler to ensure you don't access a pointer past the end of the array - the array size often isn't "known" in a way the compiler can understand.
> Which isn't something anyone actually wants when reaching for C/C++.
Disagree. I think a lot of people want some kind of "cross-platform assembler" (i.e. they want e.g. null deref to trap on architectures where it traps, and silently succeed on architectures where it succeeds), and get told C is this, which it very much isn't.
Except every other sane systems programming language does indeed do null checks, even those older than C, but they didn't come with UNIX, so here we are.
No, even type-punning properly allocated memory (e.g. using memory to reinterpret the bits of a floating point number as an integer) through pointers is UB because compilers want to use types for alias analysis[1]. In order to do that "properly" you are supposed to use a union. In C++ you are supposed to use the reinterpret_cast operator.
[1] Which IMO goes back to C's original confusion of mixing up machine-level concepts with language-level concepts from the get-go, leaving optimizers no choice but unsound reasoning and blaming programmers when they get it wrong. Something something numerical loops and supercomputers.
I believe using reinterpret_cast to reinterpret a float as an int is undefined behavior, because I don't believe that follows the type aliasing rules [1]. However, you could reinterpret a pointer to a float as a pointer to char, unsigned char, or std::byte and examine it that way.
As far as I'm aware, it's safe to use std::memcpy for this, and I believe compilers recognize the idiom (and will not actually emit code to perform a useless copy).
That's like saying all bugs are undefined behavior. C lets you write to your own stack, so if you corrupt the stack due to an application error (e.g. bounds check), then that's just a bug because you were executing fully-defined behavior. Examples of undefined behavior would be things like dividing by 0 where the result of that operation can differ across platforms because the specific behavior wasn't defined in the language spec.
There are some complicated UBs that arise when casting to different types that are not obviously logic errors (can't remember the specifics but remember dealing with this in the past).
This looks decent, but I'm (highly) opposed to recommending `strncpy()` as a fix for `strcpy()` lacking bounds-checking. That's not what it's for, it's weird and should be considered as obosolete as `gets()` in my opinion.
If available, it's much better to do the `snprintf()` way as I mentioned in a comment last week, i.e. replace `strcpy(dest, src)` with `snprintf(dst, sizeof dst, "%s", src)` and always remember that "%s" part. Never put src there, of course.
There's also `strlcpy()` on some systems, but it's not standard.
strncpy does have its odd and rare use-case, but 100% agree that it is not at all a “fix” for strcpy, it’s not designed for that purpose, and unsuited to it, being both unsafe (does not guarantee NUL-termination) and unnecessary costly (fills the destination with NULs).
The strn* category was generally designed for fixed-size NUL-padded content (though not all of them because why be coherent?), the entire item is incorrect, and really makes the entire thing suspicious.
Lol no. These are the Annex K stuff which Microsoft got into the standard, which got standardised with a different behaviour than Windows’ (so even on windows following the spec doesn’t work) and which no one else wants to implement at all.
And they don’t actually “do exactly what you want”, see for instance N1967 (“field experience with annex k”) which is less than glowing.
Would it be a sin to use memcpy() and leave things like input validation to a separate function? I'm nervous any time somebody takes a function with purpose X and uses it for purpose Y.
Uh, isn't using `memcpy()` to copy strings doing exactly that?
The problem is that `memcpy()` doesn't know about (of course) string terminators, so you have to do a separate call to `strlen()` to figure out the length, thus visiting every character twice which of course makes no sense at all (spoken like a C programmer I guess ... since I am one).
If you already know the length due to other means, then of course it's fine to use `memcpy()` as long as you remember to include the terminator. :)
The reason you would want to use memcpy would be if 1) you already know what the length is, 2) if you need a custom validator for your input, 3) you don't want to validate your input (however snprintf() is doing that), 4) if the string may include nulls or there is no null terminator.
But the fifth reason may be that depending on snprintf as your "custom-validator-and-encoder-plus-null-terminator" may introduce subtle bugs in your program if you don't know exactly what snprintf is doing under the hood and what its limitations are. By using memcpy and a custom validator, you can be more explicit about how data is handled in your program and avoid uncertainty.
(by "validate" I mean handle the data as your program expects. this could be differentiating between ASCII/UTF-8/UTF-16/UTF-32, adding/preserving/removing a byte-order mark, eliminating non-printable characters, length requirements, custom terminators, or some other requirement of whatever is going to be using the new copy of the data)
When I first learned C - which also was my first contact with programming at all - I did not understand how pointers work, and the book I was using was not helpful at all in this department. I only "got" pointers like three or four years later, fortunately programming was still a hobby at that point.
Funnily when I felt confident enough to tell other people about this, several immediate started laughing and told me what a relief it was to hear they weren't the only ones with that experience.
Ah, fun times.
EDIT: One book I found invaluable when getting serious about C was "The New C Standard: A Cultural and Economic Commentary" by Derek Jones (http://knosof.co.uk/cbook). You can read it for free because the book ended up being to long for the publisher's printing presses or something like that. It's basically a sentence-by-sentence annotated version of the C standard (C99 only, though) that tries to explain what the respective sentence means to C programmers and compiler writers and how other languages (mostly C++) deal with the issue at hand, but also how this impacts the work of someone developing coding guidelines for large teams of programmers (which was how the author made a living at the time, possibly still is). It's more than 1500 pages and a very dense read, but it is incredibly fine-grained and in-depth. Definitely not suitable for people who are just learning C, but if you have read "Expert C Programming: Deep C Secrets" and found it too shallow and whimsical, this book was written for you.
Having basic experience in any assembly language makes pointers far more clear.
"Addressing modes," where a register and some constant are used to calculate the source or target of a memory operation, make the equivalence of a[b]==*(a+b) much more obvious.
I also wonder about the author's claims that a char is almost always 8 bits. The first SMP machine that ran Research UNIX was a 36-bit UNIVAC. I think it was ASCII, but the OS2200/EXEC8 SMP matured in 1964, so this was an old architecture at the time of the port.
"Any configuration supplied by Sperry, including multiprocessor ones, can run the UNIX system."
> Having basic experience in any assembly language makes pointers far more clear.
That's a key point. I came to C after several years of programming in assembly and a pointer was an obvious thing. But I can see that for someone coming to C from higher level languages it might be an odd thing.
There was an "official" C compiler for NOS running on the CDC Cyber. As I recall, 18-bit address, 60-bit words, more than one definition of a 'char' (12-bit or 5-bit, I think). It was interesting. There were a lot of strange architectures with a C compiler.
I would also point out architectures like the 8051 and 8086 made (make...they are still around) pointer arithmetic interesting.
The C standard, as I recall defines a byte effectively as at least 8 bits. I've read that some DSP platforms use a byte (and thus a char) that is 24 bits wide, because that's what audio samples use, but supposedly those platforms rarely, if ever, handle any actual text. The standard library contains a macro CHAR_BITS that tells you
I think I remember reading about a C compiler for the PDP-10 (or Lisp Machine?), also a 36-bit machine, that used a 9 bit byte. There even exists a semi-jocular RFC for UTF-9 and UTF-18.
Pointers are by far the most insidious thing about C. The problem is that nobody who groks pointers can understand why they had trouble understanding them in the first place.
Once you understand, it seems so obvious that you cannot imagine not understanding what a pointer is, but at the beginning, trying to figure out why the compiler won't let you assign a pointer to an array, like `char str[256]; str = "asdf"`, is maddening.
One thing I think would benefit many is if we considered "arrays" in C to be an advanced topic, and focused on pointers only; treating "malloc" as a magical function until the understanding of pointers and indexing is so firmly internalized that you can just add on arrays to that knowledge. Learning arrays first and pointers second is going to break your brain because they share so much syntax, but arrays are fundamentally a very limited type.
When I've had to explain it, I describe memory as a street with house numbers (which are memory addresses).
A house can store either people, or another house number (for some other address).
If you use a person as a house number, it will inflict grievous harm upon that person. If you use a house number as a person, it will blow up some random houses. Very little in the language stops you from doing this, so you have to be careful not to confuse them.
Then I describe what an MMU does with a TLB, at which point the eyes glaze over.
From my memory, the syntax of pointers really tripped me up. E.g., the difference between * and & in declaration vs dereferencing. I think this is especially confusing for beginners when you add array declarations to the mix.
That's the problem! I can't tell you what is difficult because it seems so incredibly obvious to me now.
When I was ~12, I had a lot of trouble with it, and the only thing I remember from those times is various attempts to figure out why the compiler wouldn't let me assign a string value to an array. What the hell is an "lvalue", Mr. Compiler?
Now I look at the assignment command above and I recoil in horror, but for some reason at the time it seemed very confusing to me, especially since `char *str; str = "abcd";` works so well. The different between the two (as far as intention goes) is vast in retrospect, but for some reason I had trouble with it at the time.
The pointer/array confusion in C makes this way harder to understand than it has to be. The other thing is the syntax, which is too clever and too hard to parse in your head for complex expressions. Both of these things also tend to not be explained very well to beginners, probably partly due to the fact that explaining it in detail is complex and would perhaps go over the beginner's head. It's also stupid, so you'd probably have to explain how it turned out to be this complex.
Pretty sure that this is a big factor, I’m not aware of any recent languages that put type information before and after the variable name. Nowadays there’s always a clear distinction between the variable name and the type annotation.
both str and "asdf" are not pointer-type expressions; they're both arrays (which is exposed by sizeof). The reason why this doesn't work is because C refuses to treat arrays as first-class value types - which is not an obvious thing to do regardless of how well you understand pointers or not. Other languages with arrays and pointers generally haven't made this mistake.
One thing that helped me understand pointers was understanding that a pointer is just a memory address.
When I was still a noob programmer, my instructor merely stuck to words like "indirection" and "dereferencing" which are all fine and dandy, but learning that a pointer is just a memory address instantly made it click.
Pointer arithmetic is merely knowing that any addition/subtraction done to a pointer is multiplied by the size of the type being pointed to. So if you're pointing to a 64-byte struct, then "ptr++;" adds 64 to the pointer.
When I’m teaching (a very high-level language), I make a point of saying that a variable is a named memory location. Where is that location? We don’t know. Now, I am absolutely aware that the address isn’t the “real” location, but I have this idea that talking about variables in this way might help them grok the lower-level concept later on.
My experience with pointers was the inverse of yours. My first programming language was Java, and I spent many hours puzzling out reference types (and how they differed from primitive types). I only managed to understand references after somebody explained them as memory addresses (e.g. the underlying pointer implementation). When I later learned C, I found pointers to be delightfully straightforward. Unlike references in Java, pointers are totally upfront about what they really are!
When I got to Java, I experienced the same problem. Much later, I learned C# and found that it apparently had observed and disarmed some of Java's traps, but they also got a little baroque in some places, e.g. with pointers, references in/out parameters, values types, nullable types, ... A lot of the time one doesn't need it, but it is a bit of a red flag if language has two similar concepts expressed in two similar ways but with "subtle" differences.
I did like the const vs readonly solution they came up with. I wish Go (my current goto (pun not necessarily unintentional) language) had something similar
"The C Puzzle Book" is the thing I recommend to anyone who knows they want to have a good, working understanding of how to use pointers programming in C.
Many years ago I did the exercises on the bus in my head, then checking the answers to see what I got wrong and why over the space of a week or so. It's a really, really good resource for anyone learning C. It seemed to work for several first year students who were struggling with C in my tutorials as well and they did great. Can't recommend it highly enough to students and the approach to anyone tempted to write an intro C programming text.
I would highly recommend the video game Human Resource Machine for getting a really good understanding of how pointers work.
It's more generally about introducing assembly language programming (sort of) in gradual steps, so you'll need to play through a fair chunk of the game before you get to pointers. But by the time you get to them, they will seem like the most obvious thing in the world. You might even have spent the preceding few levels wishing you had them.
This will print two different values if you have `int i = 1` and you call `foo(&i, &i)`. This is the classic C aliasing rule. The C standard guarantees that this works even under aggresive optimisation (in fact certain optimisations are prevented by this rule), whereas the analogous Fortrain wouldn't be guaranteed to work.
I was born in '74 so the last generation to start with C and go to other, higher-level, languages like Python or JavaScript. Going in this direction was natural. I was amazed by all the magic the higher-level languages offered.
Going the other direction is a bit more difficult apparently. "What do you mean it does not do that?". Interesting perspective indeed!
What was nice about C then was that, based on my study of CPUs at the time, you could pretty much get your head around what the CPU was doing. So you could learn the instructions (C) and the machine following them (the CPU).
When I got to modern CPUs it's so complex my eyes glazed over reading the explanation and I gave up trying to understand them.
(Edit: But probably not in release builds, as @rmind points out.)
> The closest thing to a convention I know of is that some people name types like my_type_t since many standard C types are like that
Beware that names beginning with "int"/"uint" and ending with "_t" are reserved in <stdint.h>.
[Edited; I originally missed the part about "beginning with int/uint", and wrote the following incorrectly comment: "That shouldn't be recommended, because names ending with "_t" are reserved. (As of C23 they are only "potentially reserved", which means they are only reserved if an implementation actually uses the name: https://en.cppreference.com/w/c/language/identifier. Previously, defining any typedef name ending with "_t" technically invokes undefined behaviour.)"]
The post never mentions undefined behaviour, which I think is a big omission (especially for programmers coming from languages with array index checking).
So I actually stand by my original comment that the convention of using "_t" suffix shouldn't be recommended. (It's just that the reasoning is for conformance with POSIX rather than with ISO C.)
Well, semantically, "size_t" makes sense to me ("the type of a size variable"), while "uint_t" does not ("the type of a uint variable"), because "uint" is already a type, obviously - just like "int".
In addition, I recommend -fsanitize=integer. This adds checks for unsigned integer overflow which is well-defined but almost never what you want. It also checks for truncation and sign changes in implicit conversions which can be helpful to identify bugs. This doesn't work if you pepper your code base with explicit integer casts, though, which many have considered good practice in the past.
Wow nice, I didn't know about this one. I can add some more which are less known. This is my current sanitize invocation (minus the addition of "integer" which I'll be adding, unless one of these other ones covers it):
-fsanitize=address,leak,undefined,cfi,function
CFI has checks for unrelated casts and mismatched vtables which is very useful. It requires that you pass -flto or -flto=thin and -fvisibility=hidden.
You can read a comparison with -fsanitize=function here:
I think "leak" is always enabled by "address". It's only useful if you want run LeakSanitizer in stand-alone mode. "integer" is only enabled on demand because it warns about well-defined (but still dangerous) code. You can also enable "unsigned-integer-overflow" and "implicit-conversion" separately. See https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#...
Why the hell "potentially reserved" was introduced? How is it different from simply "reserved" in practice except for the fact such things can be missing? How do you even use a "potentially reserved" entity reliably? Write your own implementation for platforms where such an entity is not provided, and then conditionally not link it on the platforms where it actually is provided? Is the latter even possible?
Also, apparently, "function names [...] beginning with 'is' or 'to' followed by a lowercase letter" are reserved if <ctype.h> and/or <wctype.h> are included. So apparently I can't have a function named "touch_page()" or "issue_command()" in my code. Just lovely.
> The goal of the future language and library reservations is to alert C programmers of the potential for
future standards to use a given identifier as a keyword, macro, or entity with external linkage so that
WG14 can add features with less fear of conflict with identifiers in user’s code. However, the mechanism
by which this is accomplished is overly restrictive – it introduces unbounded runtime undefined
behavior into programs using a future language/library reserved identifier despite there not being any
actual conflict between the identifier chosen and the current release of the standard. ...
> Instead of making the future language/library identifiers be reserved identifiers, causing their use to be
runtime unbounded undefined behavior per 7.1.3p1, we propose introducing the notion of a potentially
reserved identifier to describe the future language and library identifiers (but not the other kind of
reservations like __name or _Name). These potentially reserved identifiers would be an informative
(rather than normative) mechanism for alerting users to the potential for the committee to use the
identifiers in a future release of the standard. Once an identifier is standardized, the identifier stops
being potentially reserved and becomes fully reserved (and its use would then be undefined behavior
per the existing wording in C17 7.1.3p2). These potentially reserved identifiers could either be listed in
Annex A/B (as appropriate), Annex J, or within a new informative annex. Additionally, it may be
reasonable to add a recommended practice for implementations to provide a way for users to discover
use of a potentially reserved identifier. By using an informative rather than normative restriction, the
committee can continue to caution users as to future identifier usage by the standard without adding
undue burden for developers targeting a specific version of the standard.
So... instead of mandating implementations to warn about (re)defining a reserved identifier, they introduce another class of "not yet reserved indentifiers" and advise implementations to warn about defining such identifiers in the user code — even though it's completely legal, — until the moment the implementation itself actually uses/defines such an identifier at which point warning about such redefinition in the user code — now illegal and UB — is no longer necessary or advised.
Am I completely misreading this or is this actually insane? Besides, there is already a huge swath of reserved identifiers in C, why do they feel the need to make an even larger chunk of names unavailable to the programmers?
The problem is that the traditional wording of C meant that any variable named 'top' was technically UB, because it begins with `to'.
In practical terms, what compilers will do is, if C2y adds a 'togoodness' function, they will add a warning to C89-C2x modes saying "this is now a library function in C2y," or maybe even have an extension to use the new thing in earlier modes. This is what they already do in large part; it's semantic wording to make this behavior allowable without resorting to the full unlimited power of UB.
> Besides, there is already a huge swath of reserved identifiers in C, why do they feel the need to make an even larger chunk of names unavailable to the programmers?
The C23 change was mostly to downgrade some of the existing reserved identifiers from "reserved" to "potentially reserved". (It also added some new reserved and potentially reserved identifiers, but they seem reasonable to me.)
I still fail to see any practical difference between these two categories, except that the implementations are recommended to diagnose illegal-in-the-future uses of potentially reserved identifiers but are neither required nor recommended to diagnose actually illegal uses of reserved identifiers. There is also no way to distinguish p.r.i from r.i.
It also means that if an identifier becomes potentially reserved in C23 and reserved in C3X, then compiling a valid C11 program that uses it as C23 will give you a warning, which you can fix and then compile resulting valid C23 program as C3X without any problem; but compiling such a C11 program straight up as C3X will give you no warning and a program with UB.
Seriously, it boggles my mind. Just a) require diagnostics for invalid uses of reserved identifiers starting from C23, b) don't introduce new reserved identifiers, there is already a huge amount of them.
You can declare a type without (fully) defining it, like in
typedef struct foo foo_t;
and then have code that (for example) works with pointers to it (*foo_t). If you include a standard header containing such a forward declaration, and also declare foo_t yourself, no compilation error might be triggered, but other translation units might use differing definitions of struct foo, leading to unpredictable behavior in the linked program.
One potential issue would be that the compiler is free to assume any type with the name `foobar_t` is _the_ `foobar_t` from the standard (if one is added), it doesn't matter where that definition comes from. It may then make incorrect assumptions or optimizations based on specific logic about that type which end up breaking your code.
But wouldn't one be required to include a particular header in such case (i.e. the correct header for defining a particular type)?
I mean, no typedef names are defined in the global scope without including any headers right? Like I find it really weird that a type ending in _t would be UB if there is no such typedef name declared at all.
Or is this UB stuff merely a way for the ISO C committee to enforce this without having to define <something more complicated>?
The purpose of this particular naming rule is to allow adding new typedefs such as int128_t. The "undefined behaviour" part is for declaration of any reserved identifier (not specifically for this naming rule). I don't know why the standard uses "undefined behaviour" instead of the other classes (https://en.cppreference.com/w/cpp/language/ub); I suspect because it gives compilers the most flexibility.
The standard reserves several classes of identifiers, "_t" suffix [edit: with also "int"/"uint" prefix] is just one of several rules. Another rule is "All identifiers that begin with an underscore followed by a capital letter or by another underscore" (and also "All external identifiers that begin with an underscore").
> C has no environment which smooths out platform or OS differences
Not true - C has little environment, not no environment. For example, fopen("/path/file.txt", "r") is the same on Linux and Windows. For example, uint32_t is guaranteed to be 32 bits wide, unlike plain int.
> Each source file is compiled to a .o object file
Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?
> static
This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.
> Integers are very cursed in C. Writing correct code takes some care
> Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?
The standard only says that the implementation must preprocess, translate, and link the several "preprocessing translation units" to create the final program. It doesn't say anything about how the translation units are stored on the system.
> This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.
Not quite: in a global scope, it gives the variable internal linkage, so that other translation units can use the same name to refer to their own variables. In a block scope, it gives the variable static storage duration, but it doesn't give it any linkage. In particular, it doesn't let the program refer to the variable outside its block.
On Windows you can directly access UNC paths (without mounting) with fopen. You can't do this on POSIX platforms. Also, not all API boundaries are fixed width so you're going to be exposed to the ugliness of variable width types.
I think the article is correct that one must be aware of the platform and the OS when writing C code.
fopen will/should fail on windows with the unix path syntax.
The reason it's indeterminate is because some stdc lib vendors will do path translation on Windows, some won't. I believe cygwin does (because it's by definition a unix-on-windows), but I'm pretty sure the normal stdclib vendors on windows do not.
I'm almost positive that MacOS (before MacOS X) will fail with unix path separators, since path separators are ':' not '/'.
It will work on Windows, since it inherits the behavior from MS-DOS. It's the shell on Windows (or MS-DOS) where it fails since the shell uses '/' to designate options, so when MS-DOS gained subdirectories (2.0) it used '\' as the file separator on the shell. The "kernel" will accept both. There even used to be an undefined (or underdefined) function in MS-DOS to switch the option character.
Apparently it's true. I wonder when this was implemented?
Canonicalize separators
"All forward slashes (/) are converted into the standard Windows separator, the back slash (\). If they are present, a series of slashes that follow the first two slashes are collapsed into a single slash."
After learning C, one of the first projects I came into contact with, was the ID Tech 3 game engine [1]
On the one hand, it taught me how professional C programmers structure their code (extra functions to remove platform differences, specific code which is being shared between server and client to allow smooth predictions) and how incredible fast computers can be (thousands of operations within milliseconds), but it also showed me, how the same code can result in different executions due to compiler differences (tests pass, production crashes) and how important good debugging tools are (e.g. backtraces).
To this day I am very grateful for the experience and that ID decided to release the code as open source.
Love the intro and overview — looking forward to more!
These weren't mentioned in the post but have been very helpful in my journey as a C beginner so far:
- Effective C by Robert C. Seacord. It covers a lot of the footguns and gotchas without assuming too much systems or comp-sci background knowledge. https://nostarch.com/Effective_C (Also, how can you not buy a book on C with Cthulhu on the cover written by a guy with _three_ “C”s in his name?)
- Computer Systems, A Programmer's Perspective by Randal E. Bryant and David R. O'Hallaron for a deeper dive into memory, caches, networking, concurrency and more using C, with plenty of practice problems: https://csapp.cs.cmu.edu/
This used to be my bible when doing full time C programming around 2000 (together with the standard docs) but I’m out of date with the latest standard updates (as is this) but it may still be of interest.
Thanks for submitting this. I'm teaching myself C so these high level overviews are super useful for improving my intuition.
In the following example, shouldn't there be an asterisk * before the data argument in the getData function call? The way I understand it the function is expecting a pointer so you would need to pass it a pointer of the data object.
>
"If you want to “return” memory from a function, you don’t have to use malloc/allocated storage; you can pass a pointer to a local data:
No, it's correct. The asterisk is a little inconsistent, in that it means two opposite things. In the declaration it means "this is a pointer." However, in an expression, it means "this is the underlying type" and serves to dereference the pointer.
int a = 5;
int *x; // this is a pointer
x = &a;
int c = *x; // both c and *x are ints
If it were *data, it would be equivalent to *(data + 0), which is equivalent to data[0], which is an int. You don't want to pass an int, you want to pass an *int.
Because now you've got an int pointer and an int. The star associates with the right, not left.
I prefer to use the variant you described though, because it feels more natural to associate the pointer with the type itself. As far as I know, the only pitfall is in the multiple declaration thing so I just don't use it.
IMO, it's also more readable in this case:
int *get_int(void);
int* get_int(void);
The second one more clearly shows that it returns a pointer-to-int.
Multiple declaration is generally frowned upon, because you declare the variables without immediately setting them to something.
If you always set new variables in the same statement you declare them, then you don't use multiple declarations, which means there is no ambiguity putting the * by the type name.
So convention wins out for convention's sake. And that's the entire point of convention in the first place: to sidestep the ugly warts of a decades-old language design.
Spaces are ignored (except to separate things where other syntactical things like * or , aren't present), and * binds to the variable on the right, not the type on the left. I actually got this wrong in an online test, but I screenshotted every question so I could go over them later (! I admit, a dirty trick but I learned things like this from it, though I still did well enough on the test to get the interview).
int*x,y; // x is pointer to int, y is int.
int x,*y; // x is int, y is pointer to int
And the reason I got it wrong on the test is it had been MANY years since I defined more than one variable in a statement (one variable defined per line is wordier but much cleaner), so if I ever knew this rule before, I had forgotten it over time.
I keep wanting to use slash-star comments, but I recall // is comment-to-end-of-line in C99 and later, something picked up from its earlier use in C++.
Oh yeah, C99 has become the de-facto "official" C language, regardless of more recent changes/improvements, as not all newer changes have made it into newer compilers, and most code written since 1999 seems to follow the C99 standard. I recall gcc and many other compilers have some option to specify which standard to use for compiling.
I think the question is why it binds to the variable rather than the type. It's obviously a choice that the designers have made; e.g. C# has very similar syntax, but:
int* x, y;
declares two pointers.
I think the syntax and the underpinning "declaration follows use" rule are what they got when they tried to generalize the traditional array declaration syntax with square brackets after the array name which they inherited directly from B, and ultimately all the way from Algol:
int x, y[10], z[20];
In B, though, arrays were not a type; when you wrote this:
auto x, y[10], z[20];
x, y, and z all have the same type (word); the [] is basically just alloca(). This all works because the type of element in any array is also the same (word), so you don't need to distinguish different arrays for the purposes of correctly implementing [].
But in C, the compiler has to know the type of the array element, since it can vary. Which means that it has to be reflected in the type of the array, somehow. Which means that arrays are now a type, and thus [] is part of the type declaration.
And if you want to keep the old syntax for array declarations, then you get this situation where the type is separated by the array name in the middle. If you then try to formalize this somehow, the "declaration follows use" rule feels like the simplest way to explain it, and applying it to pointers as well makes sense from a consistency perspective.
I don't know for certain, but I suspect it simplified the language's grammar, since C's "declaration follows use" rule means you can basically repurpose the expression grammar for declarations instead of needing new rules for types. This is also why the function pointer syntax is so baroque (`int (*x)();` declares a variable `x` containing a pointer to a function taking no parameters and returning an int).
"integer pointer" named "x" set to address of integer "a".
---
As a sibling comment pointed out, this is ambiguous when using multiple declaration:
int* foo, bar;
The above statement declares an "integer pointer" foo and an "integer" bar. It can be unambiguously rewritten as:
int bar, *foo;
But multiple declaration sucks anyway! It's widely accepted good practice to set (instantiate) your variables in the same statement that you declare them. Otherwise your program might start reading whatever data was lying around on the stack (the current value of bar) or worse: whatever random memory address it refers to (the current value of foo).
I think it would help if beginners learn a language other than C to learn about pointers. My first language was Pascal, and it didn't have a confusing declaration syntax, nor did it have a confusing array decay behavior so it was much much easier to learn. Nowadays of course I don't think about it but those details mattered to beginners.
The name of an array decays to a pointer to the first element in various contexts. You could do `&data[0]` but it means exactly the same thing and would read as over-complicated things to C programmers.
" ... is often called the stack."
" integers are cursed"
These statements and a few others made me uncomfortable. They imply, to me, that the author has too little knowledge of computer internals to be programming in C.
C does a wonderful job of looking like a high level language but it was designed to do low level stuff. There is an implicit assumption, to my mind, that the user has a deeper understanding of what is behind the "curtain" that is created by the compiler.
It almost seems like the there should a pre-requisite course like "Introduction to Computer hardware" before one is allowed to write a line of C.
>C does a wonderful job of looking like a high level language but it was designed to do low level stuff. There is an implicit assumption, to my mind, that the user has a deeper understanding of what is behind the "curtain" that is created by the compiler.
Lots of people say this, indeed several comments here talk about learning assembly before C being beneficial.
I actually think this is not true, today. In its original incarnation C may have mapped closely to instructions, but the details of e.g. x86 are IMO neither necessary nor particularly useful knowledge. Knowing what memory is like and how stuff is laid out is enough.
C is not just a super-thin layer to the bare metal. By "integers are cursed," I mean exactly that: C integers are cursed. The arithmetic behaviours that standard C defines for integers are error-prone and in many cases undefined. C is a layer and it has its own behaviour, it doesn't just expose your CPU.
As someone who started writing C with very little understanding of how the underlying hardware worked (or, indeed, programming in general), I support and disagree with parts of this comment at the same time.
On one hand, I support the notion that programming well (in any language) requires knowledge of hardware architectures.
On the other hand, I disagree that people should not be "allowed to write a line of C" before they have that understanding.
I started writing C early on in my programming career (having already dropped out of "Introduction to Computer Hardware"), and I'll admit, it was tough. I probably would have had an easier time if I had taken a year to study x86, win32 internals and graphics pipelines. That said, I was interested in learning to program graphics, so that's what I did, and I learned a tremendous amount while doing it. It was probably the long way round, but if the goal was learning, "just doing it" was an effective strategy.
What I'm trying to say here is that for people that would drop out of "Introduction to Computer Hardware", learning C is actually an extremely good supplement for learning about underlying hardware architectures because, in the long run, you have no choice if you want to be good at it.
> There is an implicit assumption, to my mind, that the user has a deeper understanding of what is behind the "curtain" that is created by the compiler.
With the behaviour of today's C compilers, what benefit would there be to such an understading? It would seem to mainly give the user a lot of ways to shoot themselves in the foot.
I think the "automatic storage" terminology is technically correct. C can be used in places where there is no actual "stack" and these still need a mechanism for local variables, so they specify a different kind of automatic storage.
Everyone uses a stack now, though, even very exotic processors.
You can easily have a stack without the special machine instructions (like push and pop) or the special stack pointer register. In fact, such special register is only useful in the absence of (a sufficient number of) general-purpose registers, which is characteristic to simpler architectures that have few registers (most of which, too, are specialized); IBM mainframes, for instance, never had the notion of the stack built into the architecture: to allocate a stack frame the program would simply subtract the entire frame's size from the current value in some register and then populate whatever pieces it needs there using the register as the base pointer.
> There is an implicit assumption, to my mind, that the user has a deeper understanding of what is behind the "curtain" that is created by the compiler.
It's also useful to remember that C is from a time when approximately everyone doing programming understood these things because you had to be exposed to lower level details.
Hah, "unnecessarily abstract" is my middle name :)
I don't really like stack/heap terminology. "Heap" especially is a nightmare because (a) it also means some specific, irrelevant kind of data structure and (b) there's so many ways of implementing allocation it feels wrong to call it "the" anything.
Function variables are deleted after return, allocated stuff isn't - no need to know about stack pointers, etc. It's good enough for me!
But it's really interesting to hear from other programmers who learned things in the historical order. I suppose I come from a new generation where abstractions are first, and I wrote this article for them, really.
But the stack refers to the number of calling functions, stacked upon one another too? This is why its always stack, disregard the structure.. cause its a mirror of the program running, and the usual c program uses functions.
Well, the term is overloaded. C has recursion so an implementation needs something like a call stack, but you don't have to store it in a stack datastructure.
Year 1 in my university, they started with electron, I was impressed. All to way to diode and how logic gate was made. And then digital circuit design, and you got your one|two's complement from here.
I've been programming in C forever, one advantage is that the language has not evolved much (especially compared with C++), but it has evolved.
There was the big K&R C to ANSI C function declaration transition. For portable code, you used K&R C well into the 90s (because older machines only had the K&R compiler), or used ugly macros to automatically convert from ANSI to K&R.
Another was the addition of 'const' to the language. It used to be said that const was a virus: once you start using it, you need to use it universally in your entire code-base.
A more recent big one is stdint.h (types like uint32_t). To correctly use these types you must also use the macros in inttypes.h for scanf and printf conversions. IMHO, they both should have been in the same header files, they go along with each other.
So in the old days, you would say:
unsigned int x;
printf("%u\n", x);
But now, you should tediously say:
uint32_t x;
printf("%"PRIu32"\n", x);
(Because uint32_t might be defined as a long even if it's the same size as in int, so you will get compiler warnings. You could use %lu, but then you get compiler warnings the other way.)
Another: On UNIX, we don't have to worry about "wide" characters (use UTF-8 instead) and wide strings, but you certainly do on Windows. "Wide" characters are obsolete but Windows is stuck with them since they are built into the OS.
> It used to be said that const was a virus: once you start using it, you need to use it universally in your entire code-base.
In order for const to actually work for what it's supposed to do, it does have to be viral in the direction of data flow. You should start by adding const to function arguments that point to data the function only reads (and doesn't pass the pointer to any subroutines) and expand from there. Eg:
_Bool isurl(char /*const here*/* s) {
while(isalpha(*s)) s++;
return *s == ':';
} /* s is never written through */
Then anything that passes pointers (only) to functions like isurl, and so on as is convenient.
Half of the items the poster lists as "wish I'd known" are things which were literally part of the curriculum when I taught C at my alma mater (Technion IIT, Haifa). And then, some things are better not know - like cppreference, which would just confuse you with a lot of C++, which is interesting, but not relevant. Also, there are some nitpicks to make, like about array decay etc.
More generally:
We can't expect to know a lot of context before learning anything. You have to learn things in _some_ order. It's better to know C when you study PDP assembly, and at the same time it's better to already know PDP assembly when you study C - as you have insight into the motivation of what's possible via the language syntax itself. Same thing for Calculus and Topology: The latter course helped me understand and generalize a lot of what we had done in the former, but without the former, I wouldn't have had proper motivation for the latter, which might have seemed like useless sophistry.
I didn't learn computer science at college/university - only via the internet, and C is quite different to learn that way vs other languages. That was my main motivation for writing :)
The couple of times I tried to really learn C I ran into the same problems. I started by trying to answer two questions: what are the modern features of C that I need to learn to use it, and what style/rules should I follow.
What I found is a body of several C spec updates that are each defined in reference to previous C spec updates. So I'm supposed to... ? Learn C from the 70s and then study each version update and mentally diff everything to figure out what C is now?
Then in terms of rules and style, unlike when K&R C was published, I couldn't find any authority. Actually what I see is that even programmers who have been writing C for many years frequently debate what practices are correct vs not. You can see it in this very thread. Every language has this, but I've seen it much more with C than other languages.
Actually learning the material for me is hard when I can't even get a firm handle on what it is I'm supposed to learn.
Completely agree - that's exactly why I wrote this. Some of things I wrote even seem super obvious but with no real authority you wonder whether or not this is "the way" to do things. If you ever want to get some practical experience, we'll help you out if you want to contribute to Brogue :)
Replying here hoping that you'll have a better chance of seeing it: thank you for maintaining Brogue. I've sung praises [0][1][2] for years about the quality of the Brogue codebase (and the game itself).
The one thing missing from this list: Compiler optimizations can have undesirable effects.
A trivial example I ran into in an older job: I was compiling a program that relied on libm.so (the standard math library you get when including math.h). Now I wanted the code to use my own custom libm.so - not the one that was installed in /usr/lib or wherever, so I ensured it was dynamically compiled.
My code had some calls like:
int x = sin(3.2);
During compilation, it computed sin(3.2) using the system libm. Notably, it would not use the sine function in my custom libm.so (and indeed, I needed it to!)
And IIRC, even -O0 was not enough to prevent this from happening. I had to go through the man page to figure out which compiler flags would prevent it.
C doesn't have namespaces so the compiler is certainly within its right to deduce that the sin() function is the one from the standard library.
Actually even in C++ after the compiler performs the unqualified name lookup if the result of the lookup is the standard sin() function it will make use of its internal knowledge about the function to do optimization.
Remember that the C or C++ standard doesn't deal with compilers and linkers; the standard deals with the "implementation" as a whole.
The idea is not to write apps for a living in assembler, but to get an idea of what's going on under the hood. Just as C will help understanding what's going on under the hood of Javascript/Python etc.
Definitely pick a simpler assembly than x86 assembly, and it's not so bad. I learned 68HC11 assembly which has been a boon for understanding what's happening underneath the hood.
I find this article very strange, perhaps because I started using 'c' so long ago. To the bullet points:
(1) In general, 'c' is always 'c' at the command line, regardless of the platform.
(2) yes, there are options and build tools, but cc my_program.c -o my_program works fine. I have a very hard time figuring out how to compile/run java.
(3) hard to see how this has anything to do with 'C', vs any other compiled language.
(4) so?? I would think I would be more concerned about how to use 'c' for my problem, without worrying about how to use 'c' to solve some other problem. It is hard for me to understand why a language that can do many things is more problematic than a language that only does a few things.
My sense is that reading this article makes things harder, not easier. Most people do not care whether an int is 32 or 64 bits.
I won't argue that copying things (that are not ints or floats) needs to be understood, but many popular languages (specifically python) have the same problem. Understanding the difference between a reference and a value is important for most languages.
There are different schools of thought -- those that can imagine lots of issues after reading the documentation, vs those that simply try writing code and start exploring edge cases when something breaks. I learn faster by trying things, and rarely encounter the edge-case issues.
‘ You can’t extend structs or do anything really OO-like, but it’s a useful pattern to think with’
That’s not quite true. If you define 2 structs so that they start the same (eg: both with “int x; int y” in your example), pointers can be passed to functions with either struct type. You can use this to add fields (eg: int z) to structures, and extend a 2d vector into a 3d one…
With a bit of creative thought, and constructive use of pointer-to-functions, you can do quite a bit of OOP stuff in C.
The defined way to do something like this is to have the smaller struct as the first member of the larger one. The first member is guaranteed to have the same address as the outer object.
The common initial sequence trick is guaranteed to work with unions in limited circumstances.
In 2nd year of college, one of my friends and I figured this out on our own, which was a really rewarding experience. The code did look pretty hairy, though.
Details for anyone interested:
The CS course project was to write a game solver for a variety of games with perfect information (e.g: tic-tac-toe), and they highly suggested we use object-oriented design and recursion + backtracking. They also let us pick any language we wanted, and being computer engineers, between the two of us, we were most comfortable with C.
So we kind of started writing our project and implementing the first game, and when we got to the second, we were scratching our heads like, "Is it possible to just... take a pointer to a function?" "Yeah, it's just somewhere in memory, right?" And then everything fell into place, and we just had to define a struct with pointers to game state and "methods", and our TA was baffled that we did the project in C but we got a great grade.
Strict aliasing allows you to convert between a pointer to a struct and a pointer to its first member, since two objects exist at that address, one with the type of the struct and one with the type of the first member.
> A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.
Yes, that's the workaround described by @spacedcowboy and @leni53. I should have clarified that my comment on undefined behaviour was specifically for the case of two structs that start the same (e.g. "int x; int y;") as opposed to composition.
Modulo any padding rules, I would be surprised if C didn't store them in the same offsets, the compatible-type specification[1] and equality operator[2] (also see footnote 109) would make implementation much harder if you tried to do it any other way. The way to work around that of course, and to make sure that no matter what, the parent/child objects are compatible to the parent-only level is to do something like:
// this structure defines a node of a linked list and
// it only holds the pointers to the next and the previous
// nodes in the linked list.
struct list_head {
struct list_head *next; // pointer to the node next to the current one
struct list_head *prev; // pointer to the node previous to the current one
};
// list_int holds a list_head and an integer data member
struct list_int {
struct list_head list; // common next and prev pointers
int value; // specific member as per implementation
};
// list_str holds a list_head and a char * data member
struct list_str {
struct list_head list; // common next and prev pointers
char \* str; // specific member as per implementation
};
Often the 'parent' structure would have an 'int objType', which can be switch'd on, to make sure that the receiving function knows what to do. I'm not really seeing any undefined behaviour here.
I'm pretty sure this technique is decades old, btw. I know at one point the linux kernel used it, not sure if it still does..
If you're happy with being C11-compliant, then remove any names for the 'parent' structure in the child structures, making them "anonymous structs" at which point they are [3] considered to be part of the same struct as the parent.
Even if the fields are stored in the same offsets, there are modern optimisations that rely on "strict aliasing" and that would cause issues (unless the workaround you described is used).
It may be strictly UB, but given that any operating system written in C relies on this behavior being well-defined (e.g. the container_of macro widely used in the Linux kernel) you're probably pretty safe.
Note that there is sort of an active war between the OS folks, who are probably the main users of pure C nowadays, and the UB nazis among the compiler folks who are mostly worried about efficiently optimizing complex template code in C++ and don't care whether their computers continue to run :-)
The linux kernel compiles with no strict aliasing, so kernel code is unaffected. In userspace this is still dodgy, strict aliasing does affect generated code, and correctness of code that relies on UB this way.
A pragmatic solution would be attributes that allowed declaring that certain pairs of types that are allowed to alias with each other. It would even be better if the C and C++ standards provided facilities for this, although it can be challenging to rigurously fit into the object model.
> I get an int* value which has 5 ints allocated at it.
Not crazy about this array explanation. Better wording would be: "I get a memory address that points to the first byte of a chunk of memory that is large enough to hold 5 ints and tagged as int"
> Essential compiler flags
Nitpick, but this is only true if you are on a gcc/clang platform.
> If you want to “return” memory from a function, you don’t have to use malloc/allocated storage; you can pass a pointer to a local data
It should be specified the memory must be passed by the caller for this to work. You can't create a var in a called function and return that (you can but you will get undefined behavior as that memory is up for grabs after the function returns).
> Integers are very cursed in C. Writing correct code takes some care.
Cursed how? No explanation.
Overall this article is not very good. I would add these to the lists:
General resources: Read the K&R book. Read the C spec. Be familiar with the computer architectures you are working on.
Good projects to learn from: Plan 9. Seriously. It's an entire OS that was designed to be as simple as possible by the same people who made Unix and the C.
One does not follow from the other. You can declare your stack variable anywhere in the function and still have the same stack layout. It is upto the compiler/language designer (I am not talking about C per se).
The most important thing I had to learn when I started on hardcore ANSI C projects (large embedded business critical application) was to really learn to build abstractions.
C has few tools to help you with this and so it is super important to get the most of what is available.
Another lesson is that idioms are very useful in C and cut a ton of time. For most or all repeatable tasks there should be conventions for how to implement it and how to use it.
These are useful in any programming language and environment but I think are especially useful in C where you are on your own when structuring your application.
The reason you use specific sizes for types (int8, int16, uint16, etc) is so you know how to read/write them when you move your data between platforms.
x86 is little endian. ARM apparently can be configured to be either.
In real code there should be readXXX and writeXXX functions that read/write data from disk/network and do the byte swapping in there.
You could also just convert everything to JSON, but you're trading space for complexity/time.
I would add to this:
1. Visual Studio debugger is really, really good and you should use it to step through your program (or equivalent IDE based workflow).
2. Learn to compile your code with memory safeguards and use the tooling around them. Specific depends on the platform. On POSIX address sanitizer is good I hear. On Windows you can use the debug CRT and gflags to instrument your binary really well.
People should be using IDEs with modern debuggers anyways.
Being able to step through your code one line at time and instantly see the values of all your variables is going to be FAR more effective than adding a bunch of print statements, no matter what language you're using.
It always blows my mind to hear about how many engineers don't know how to use their IDE's debugger and don't know what a breakpoint is.
Perhaps there is not enough training available on these tools. Raw terminal GDB also is not very user friendly so if someone has experience from that it might be a disencouragement.
It really should be a standard part of any school curriculum, book, website, or any other media that people consume to learn programming. Debugging is such an integral skill to software engineering that it's beyond inexcusable that it's not taught.
Graphical debuggers aren't even hard to use. You can learn PyCharm's in an hour. Learn how to set a break point, examine local variables, learn the different step buttons, and view the function call stack and the local variables at each level of the stack.
Heck, maybe people wouldn't struggle with recursion so much if they were taught how to examine the call stack using the debugger, showing what happens when function A calls B which calls C, examine the local variables at each level of the stack, including an example where A and B both have a local variable called "x", and then note that function A calling itself is not a special case and adds another level to the stack with a new "x".
Without knowing how to use a debugger, learning to program feels like programming a black box. Sure, a bunch of "print" statements help, but nothing beats stepping through code line-by-line.
After more than a decade without writing any C code, I'm currently reading "C Programming: A modern Approach" by K. N. King (http://www.knking.com/books/c2/) and I found it very good. I think it's a better modern alternative to the K&R.
I wouldn't recommend fixed-size integers in general. Most of the time you want size_t or the lesser-known, signed ptrdiff_t. More often than not, integers are array sizes or indices, so size_t is the best type to use. In other cases, int is practically the same as int32_t and long long the same as int64_t except for really exotic platforms.
I've been burnt enough by varying sizes that I don't care if there's a performance impact anymore. Consistency and reliability are the main things I care about.
Well, then you should always use 64-bit types. Using uint32_t instead of size_t on a 64-bit platform will bite you eventually. size_t is also used extensively and for good reason in the standard library. Code like the following is a disaster waiting to happen:
I haven't written actual C code in decades. Did C11 get magic statics with thread-safe initialization like C++11? Does C even have non-trivial initialization of statics local variables?
I don't know why it would be magic to have thread-local storage. Thread-local variables are either static or extern. Thread-local static variables are initialized like normal static variables (initialization on declaration line occurs on first instantiation), but with a separate copy per thread.
N1570, sec. 6.2.4, para. 4: An object whose identifier is declared with the storage-class specifier _Thread_local has thread storage duration. Its lifetime is the entire execution of the thread for which it is created, and its stored value is initialized when the thread is started. There is a distinct object per thread, and use of the declared name in an expression refers to the object associated with the thread evaluating the expression.
N1570, sec. 6.7.9, para. 10: If an object that has static or thread storage duration is not initialized explicitly, then [it is initialized to NULL or 0 as appropriate, including all-bits-zero padding in structs]
Not thread local storage. Plain function statics are initialized on first use in c++ (if the initializer is not trivial), and this initialization is thread safe. There is a simple algorithm to implement it that has very little overhead on the already initialized case that is known as magic statics (it is basically an instance of the double checked locking pattern).
C does have initialization of local variables, including static.
It does not have any thread safety in standard for these, however, so no thread safe singletons. (You can get them initialized by linker or runtime safely, of course, e.g. ELF .bss and UCRT on Windows)
You can use atomics to implement such a singleton if they are available.
This is one of the main reasons why C is more portable than C++.
My college taught us pascal and x86 asm before teaching us C. I think that was perfect because "bookending" it with a high-level language and a low-level one helped put C in perspective nicely. Knowing asm definitely helped to demystify pointers in C, which is usually a stumbling block for novice programmers.
It's only UB if the original variable is declared with a const type. If you convert a modifiable T * into a const T *, then cast it back into a T *, then you can modify it without any issues.
I still remember when a co-worker told me that the biggest problem with C is that programmers are terrible at memory management. Given the number of memory corruption bugs I encountered in 27 years of working with C, I have to say that rings true.
> -Werror to turn warnings into errors. I recommend always turning on at least -Werror=implicit, which ensures calling undeclared functions results in an error(!)
"Implicit declarations" is such a frustrating "feature" of C. Thankfully, in more recent clang builds this warning is enabled by default.
this is a good a good blog, although I wonder is there any good up-to-date free online resource to learn C for experienced programmer (I learned C many years ago, but never use it seriously and forgot much of it)? I searched around, the results are either for beginner, or seems out-dated
> The sizes of arrays are not known in any useful way to C. When I declare a variable of type int[5] in a function, I don’t get a value of type int[5]; I get an int* value which has 5 ints allocated at it. Since this is just a pointer, the programmer, not the language, has to manage copying the data behind it and keeping it valid.
This is not quite correct. Assuming arrays == pointers is usually true enough, but it isn't actually true. This[1] SO thread has some useful information in it, but the TLDR is that actual arrays are treated differently than pointers. You do get an object of type "int array" rather than "int pointer" when you do int[5].
The compiler does know the size of an array allocated like T a[n]. It does not, however, do any bounds checking for you.
I started learning C around when ANSI C came out, and learned much of this in self defense. I'm glad I decided to learn C++ in recent years, it has fixes for so many things (like pass by reference instead of passing pointers, const values can be used to define array sizes though it's better to use vectors anyway, etc.), but that's off topic.
A few things I didn't see mentioned:
Add multiple inclusion guards to every header file you write, it saves multiply-defined errors and such:
file mygreatheaderfile.h:
#ifndef MYGREATHEADERFILE_H
#define MYGREATHEADERFILE_H
/* insert usual header file content here /
#endif / #ifndef MYGREATHEADERFILE_H */
Most (all?) compilers have a "don't include this file more than once" preprocessor directive, but from what I've seen they're nonstandard and they vary, but using the above method always works.
If I have a "complete program" with a main function and other functions in one source file, I put main() at the end and put all functions in the order they are called, that way there's no need for function prototypes (except for recursion) as there would be if main() is the first function in the file. None of the C books I've read said you could do this, but when I figured it out I thought yeah, it's just like Pascal and assembly, you have to define something before you use it, but you can make the first occurrence be the function definition and not have to have a separate prototype.
As for naming and capitalizing, as the document said, there's no standard/convention of camelCase vs. snake_case, but all macro names using #define are by convention in ALL_CAPS. That way it's easy to tell a MAX(x, y) macro from a max (x, y) function, and you can eventually learn why never to write such perverse things as MAX (x++, y++). Trace through the expansion to see why (and see why it's better to use a function instead, or in C++ a template):
#define MAX(x,y) x>y?x:y
Equals comparison/assignment and if statements: One of the most common and insidious errors in C is accidentally doing an assignment (=) instead of comparison (==). Modern C compilers (the ones with integrated C++ compilers, see below) will give a warning when they see this, but still, if one of these is a constant, put the constant on the left so it will give an ERROR if you accidentally try to assign something to the constant as in if (5 = n) instead of what may feel natural but be wrong (and compile fine with an old compiler!), if (n = 5). There are other gotchas like this, but I can't think of them all, and there's probably too many to post here anyway. I do see "undefined behavior" discussed. Be sure to make backups before running your code.
If you need to do maintenance using some original C compiler for an embedded controller from 30 years ago (or indeed modern C as is still popular in embedded systems), you really need to know all these ins and outs, and I might be convinced to help for an appropriately high hourly amount, but virtually every C compiler nowadays is part of a C++ compiler, and you can do much of this stuff in C++ using better code practices, resulting in fewer bugs.
typedef/enum/(_Generic)/etc should go (fix/cleanup function pointer type declaration). Only sized primitive types (u32/s32,u64/s64,f32/f64 or udw/sdw,uqw/sqw,fdw/fqw...). We would have only 1 loop statement "loop{}", no switch. I am still thinking about "anonymous" code blocks for linear-code variable sub-scoping (should be small compile-unit local function I guess). No integer promotion, no implicit cast (except for void* and maybe for literals like rust) with explicit compile-time/runtime casts (not with that horrible c++ syntax). Explicit compile-time/"scoped" runtime constants: currently it is only "scoped" runtime, with on some optimization passes to detect if the constant would be compile time. extern properly enforced for function plz (aka write proper headers with proper switches), and maybe "local" instead of "static" for compile-unit-only functions since all functions in a compile-unit are "static" anyway like global variables.
"Non-standard": ban attributes like binary format visibility (let the linker handle the fine details), packed (if your struct is packed and not compatible with the C struct, it should be byte defined, with proper pointer casts), etc.
Fix the preprocessor variable argument macro for good (now it is a mess because of gcc way and c++ ISO way, I guess we will stick to gcc way).
With the preprocessor and some rigorous coding, we should be able to approximate that with a "classic" C compiler, since we mostly remove stuff. Was told that many "classic" C compiler could output optional warnings about things like implicit casts and integer promotions.
In theory, this should help writting a naive "C-" compiler much more easily than a "classic" C compiler and foster real life alternatives.
I am sure stuff I said are plain broken as I did not write a C compiler.
I wonder how far rust is from this "C-" as it is expected to have a much less rich and complex, but more constraint, syntax than C.
I sympathize a bit with the dislike of enum. We could also probably get away with replacing typedef's with either #define (for simple types) or one-field structs (for arrays and function pointers; the latter usually need a void* to go along with them anyway).
There are reasonable low-level optimizations you can do that switch is needed for. You can have cases that start or end around blocks in non-hierarchical ways. This makes it similar to a computed goto.
This is a very slipery slope (where gcc/clang[llvm] just run onto): adding tons of attributes/keywords/"syntax features" which gives the compiler some semantic hints in order to perform "better" optimizations. There is no end to it, this is a toxic spiral of infinite planned obsolescence (and now ISO is doing the same).
There is also this other thing: extreme generalization and code factorization, to a point, we would have no clue of what the code actually does without embracing the entirety of the code with its "model". It did reach a pathological level with c++.
And the last, but not the least: OS functions are being hardcoded in the syntax of the language.
If you push further all those points they kind of converge: compilers will have keywords specific for each performance critical syscall, significant library function, and significant data structure for abstraction (some abstraction can be too much very fast as I said before). There is a end game though: directly coding assembly.
There's a lot in "C", but I don't see anything in it that really isn't needed to serve its modern purpose.
In 2022 "C" is used as a portable assembly language. When you really need to control where and how memory is allocated, and represent data structures used directly by the hardware in a high-level language.
Additionally a declared array such as `int arr[5]` does actually have the type `int [5]`, that is the array type. In most situations that decays to a pointer to the first element, but not always, such as with `sizeof`. This becomes a bit more relevant if you take the address of an array as you get a pointer to an array, Ex. `int (*ptr)[5] = &arr;`. As you can see the size is still there in the type, and if you do `sizeof *ptr` you'll get the size of the array.