Coincidences... I thought "How come Warren Toomey (one of the guys of the Unix Heritage Society [1]), has never posted this?"
Turns out, this Github-repo is just a mirror/copy of his work, but with attribution [2]. Still worth reading through there, tuhs also stores some extremely old UNIX versions.
Edit2: Now that I've thought a little bit about it, I'm not happy that the sources are on GitHub in this form. This is Warren's work - he did a lot of work in getting these tapes to work again, and "mortdeus" just copied the work and didn't even change the folder-names - "last1120c" is the first tape, "prestruct" the second. And you still need Warren's Apout emulator to get these files to work.
> Edit2: Now that I've thought a little bit about it, I'm not happy that the sources are on GitHub in this form.
If one releases code under an open license and people use/copy/fork it, why should you or the original author be unhappy. (As long as the license terms are not being broken.)
If we had to worry about this each time we forked/cloned someone's work, it would make code reuse very hard.
[Edit]
I am working on some Python charting. I found Pychart, which I found interesting: http://home.gna.org/pychart/ . Because it is in bzr which I don't work with, I put it on github. Should I be worried that someone will feel offended with it?
If you're within the terms of the license, you're fine. The developer should say so in the license (or choose another license) if they don't want it reproduced anywhere. A link to the original is common courtesy, however.
In the case of the C compiler, I think the modifications are under the same license as the original, but I'm not totally certain. If they aren't it would be cause for concern.
This is the compiler's printf(), not the one that gets linked in with compiled code. As long as the compiler itself doesn't have more than 9 substitutions, it's OK. Your code will link with a different one.
I'm ashamed for having to google his name but for others like me here's a glimpse:
"Dennis MacAlistair Ritchie (born September 9, 1941; found dead October 12, 2011) was an American computer scientist who "helped shape the digital era." He created the C programming language and, with long-time colleague Ken Thompson, the Unix operating system"
To be honest, I don't remember much of a hullaballo when either Dennis Ritchie or John McCarthy died. There was a post or two on each event, but nothing like when Steve Jobs died (when literally the entire front page was posts about his death).
(All three events were in October 2011 - Jobs, then Ritchie, then McCarthy)
Somehow over the years learning C on my own, perhaps because it seemed more consistent to me, I got used to writing
if(foo) {
omitting the space after control statements' names, which is almost never done - except in very early C code. (It's not just this: V7 Unix often omits the space as well.). I should probably settle on a less idiosyncratic personal style, but I can't help but take a little heart from seeing "my" style in such a famous codebase. :)
This is all very well if you have a friendly, high level scripting language like Ruby, but I'm definitely glad I don't have to write a C compiler where functions can take arbitrary blocks of code.
> In an appropriately powerful language, it could be a function call.
> This is all very well if you have a friendly, high level scripting language like Ruby, but I'm definitely glad I don't have to write a C compiler where functions can take arbitrary blocks of code.
That's a surprisingly good setup because in Smalltalk(one of ruby's main ancestor languages)
if/else is a method which takes a block closure.
a ifTrue: [ l log: 'a is true'] ifFalse: [ 'a is false']
while
in ruby if else is a syntatic construct
if a
l.log ('a is true')
else
l.log ('a is false')
end
probably more for perceived clarity/comfortability then speeds sake.
I'm fairly certain that shadowing a keyword with a macro is forbidden by the standard (or invokes UB?) but this will still work for most compilers, provided LET is defined reasonably. Hm, something like
Shadowing keywords is perfectly valid, although I believe you can't shadow keywords in standard headers so that compiler writers don't need to worry about it. It gives rise to some "useful" C/C++ features that should never be used, like if you want to access private members in foo.cpp, do
#define private public
#include "foo.cpp"
#undef private
Why should a developer have to have headspace for a special class of construct? And what difference does it make in practice whether it is a keyword or a function - why do you think it's so important that they are obviously different in the code - does their being part of a special category have any practical implications?
Recently some ignorant, nescient C++ programmer called me a Java programmer, because I use K&R style and not his preferred Allman style. Even Stroustrup uses hanging braces (with the exception of function body opening braces)!
tabs are an infinite source of pain and inconsistencies...
Everyone must support the space character, it cannot be banned. But a simple commit hook to ban tabs can make indentation and alignment not get messed up over time with many collaborators with default-configured editors (that mess up and use tabs for alignment).
The inventor of the tab character is on my list of people to assassinate when they're young if I ever get access to a time machine, the other people on the list being Hitler and Charles Douglass.
The tab character is a nice idea, but they do not seem to have worked out at all. I'd much rather have syntax-aware indenting in the editor, now that available compiler technology and CPU power make it practical.
I don't understand what's so bad about tabs. With tabs, all you have to do is adjust the settings in your editor, and each programmer gets the indentation (s)he wants: 2, 4, 8, whatever...
Also, do you really want to be pressing the space bar twelve times, instead of tabbing three times?
I've seen the results of spaces-only: inconsistent, sloppy indentation, 7 or 9 spaces instead of 8, as long as it looks "indented" enough.
How, specifically does the tab character "not seem to have worked out at all"?
I'm not saying you're wrong for using spaces, just wondering...
The key is great. I use it all the time. Of course I don't want to press the space bar twelve times instead of tabbing three times. Of course I don't want sloppy indentation with 7 or 9 spaces instead of 8. Is this an actual problem? I've literally never seen either one in ~25 years of using languages that need indentation. The editor takes care of it.
The problem with the tab character is that there's no standard for how wide they're supposed to be, and so everyone uses them differently. Sure, in theory you use one tab character to indicate one level of indentation, and everybody can be happy. In practice, they're often not used that way. People will use two-space tabs with four-space indentation, using two tab characters to indent. People will use eight-space tabs with four-space indentation, using four spaces for one level of indentation, a tab character for two levels, a tab character followed by four spaces for three levels, etc. Some people just blithely mix and match for no particular reason. Put either one into an editor with different tab settings and it explodes into a huge mess.
While you might do so, anyone you collaborate with will probably not do that. Why? Because almost every code editor in existence comes with that option turned off. And many of them don't even have that option at all! If you expect your codevelopers to manually adjust the alignment to be spaces rather than tabs on each line they edit, when their editor doesn't even distinguish tabs from spaces, oh boy.
Tabs just don't work out in a collaborative environment. It is too complicated for a heterogenous editing environment to get correct, so no such environment gets it correct in practice and code ends up a huge mess.
I am just thinking, why not solve this at the scm level, e.g. make a git plugin that will check out the sources indentation independent and only show changes to contents of the file not the indentation.
You could even make a plugin that works without collaboration by others: check out the file in your preferred style and transform the changes back into the original style.
I totally agree. I seem to recall an experimental clang-based git plugin for this floating around. I think it even handled certain differences in naming convention.
We'll probably get there in a few more years, as we get comfortable with adding more power to these systems, and taking advantage of compiler technology for things other than generating code.
Indentation, yes. Braces, no. He uses K&R style because he was the R (obviously). But nothing new is being written in K&R C these days, favouring ANSI C (99/11 or 89 if you're stuck on cl) with usually Allman/KNF style or a variation thereof instead.
Looks like a very early dialect. C assumes everything is an int unless specified otherwise. You can declare parameter types after the function name. So:
init(s, t)
char s[]; {
would be equivalent to:
int init(char s[], int t) {
This still works with modern compilers.
I'd be interested if anyone has any more info about this:
A second, less noticeable, but astonishing peculiarity is the space allocation: temporary storage is allocated that deliberately overwrites the beginning of the program, smashing its initialization code to save space. The two compilers differ in the details in how they cope with this. In the earlier one, the start is found by naming a function; in the later, the start is simply taken to be 0. This indicates that the first compiler was written before we had a machine with memory mapping, so the origin of the program was not at location 0, whereas by the time of the second, we had a PDP-11 that did provide mapping. (See the Unix History paper). In one of the files (prestruct-c/c10.c) the kludgery is especially evident.
Doh, I completely glossed over the readme and went straight to the code. That makes sense -- Thanks!
Cool to think that that waste function can still compile with todays compilers. A quick disassembly it seems to take up 751 bytes compiled on x64 using clang on O0.
ANSI C didn't drop old-style (non-prototype) function declarations and definitions -- nor did C99 or C11. They've been officially obsolescent since 1989, but they're still fully supported by any conforming compiler.
Where, if you take the time, you will find a wonderful story, upon completing, you will probably know more about the early embryonic history of C then 95% of your peers.
(Spoiler - We start with BCPL, then Move to B - it's left as an exercise to determine how we originally compiled BCPL)
Uh these commits seem interesting, but I don't quite understand the context. What's macho and what is its relation to go? Are the dates way off or is it an import of older history?
Your leg is being pulled, a little. Mach-o is the mach object file format, Brian Kernighan's C 'hello world' program is being used as source for, I think, a test. It's checked into the go tree with version history all the way back to its primordial source.
A compiler has to do a lot more than parse a source file and translate it to target code: error checking and reporting, optimization, etc.
To compile a C compiler, you don't need a full-blown C compiler. For instance, I bet floats and doubles are not used. Therefore, you can write a barebones proto-C compiler in whatever language you have available and use it to bootstrap your compiler. Rinse and repeat.
I understand bootstrapping, but at some point there has to exist some outside compiler in another language or a hand-compiled version of this otherwise the chicken and egg chain never ends.
The very first compilers were tediously written in assembly. Bill Gates wrote his first version of Basic in assembly. I believe all early Fortran compilers were written in direct machine code through punch cards! Coding in assembly is not all that bad :). Once you got some compiler going, you can write more complex compiler with it and keep adding more syntactic sugar :).
Coding in Assembly was a gas! I wrote applications in IBM BAL in the 70's. G/L, Payroll, Inventory. We did it in part because we had so little memory (typically 64K to 131K), we had to squeeze every drop we could out of available memory.
I worked on Univac 9400s. We received the O/S in source code form (Assembly) on tape. We ran it through a parametizer (PROC), compiled the resulting source, and that's what the customer ran with.
You haven't lived until you've stepped through your code one instruction at a time, displaying op codes and raw binary data on the maintenance panel lights.
You can use any language available in the machine to create the first, most basic compiler. You can also use a cross-compiler if a compiler for the language already exists for another machine. If all else fails, you can write the proto-compiler in assembly.
One might write a compiler from language X to language Y first in assembly, then when that works, write a new compiler from language X to language Y in language X itself and use the previous compiler to compile it. Presumably they already had a C compiler that they used to get this one compiled, then it can compile itself afterwards.
The later is a c compiler in about 1kb of source code! It's quite functional and can compile itself.
The first link is what came out of it: A compiler so fast, that it can boot Linux from source code in a few seconds:
http://bellard.org/tcc/tccboot.html
c0 is the first pass (c to intermediate), c1 is the second pass (intermediate to assembly), c2 is the optimizer. c0 is built from files beginning with 'c0' and so on.
Does anyone know what hardware the assembly language files are for?
Maybe you could produce a modified version with the archaic features removed, compile it with a modern compiler, then use the binary produced to compile an unmodified version. Or maybe there are still binaries of really old compilers that can understand this code floating around out there.
I remember being absolutely thrilled when function prototypes were introduced with the 'proposed' ANSI C standard (X3J11). The first implementation I got my hands on was Borland's Turbo C circa 1988.
The standard now known as C89/C90 had been in committee for many years before being 'released'. This didn't stop the tool vendors (like Borland) from supporting the 'proposed' standard much earlier than 1989.
Unfortunately, commercial UNIX vendors (like HP, in my case), were very slow to adopt the standard and update the cc compiler in their distribution. This forced us to work in K&R for a good time after 1990, all the while grumbling that $150 MS-DOS compilers already 'had ANSI'.
It's a holdover from Fortran (and probably from before that). Whereas modern day we would define functions as
int foo(int i, int j) {...}
in Fortran you would do (! is comment)
function foo(i, j)
integer :: foo !return type
integer :: i
integer :: j
!body goes here
Early C stuck to that style, so you would just put the names of the variables in the declaration, and then before the body give them types. The reason only argv is mentioned in that example is that C assumes a variable is an int if not declared otherwise, so there's no reason to put "int argc" like "int argv[]".
It's called a K&R style function definition, which was the way to do it back in the day. Basically, you define your parameter names first, then you define the parameter types immediately after the function but before the opening curly brace. It's definitely not recommended today and can result in undefined behavior if your compiler doesn't recognize it. If you're working with legacy code, though, I'm pretty sure you can set some C compilers to allow for it.
To explain further:
main(argc, argv)
int argv[]; {
is equivalent to:
int main(int argc, int argv[]) {
The old style definition works because C had a default type of int, so the type specifications for the function main and the parameter argc could be omitted.
As for int argv[]? What that actually represents is an array of memory addresses that hold the command line arguments given. Obviously this becomes a problem if you're on a 64-bit system, where int and (void * ) are two different sizes. However, I checked this out on my 64-bit machine and it works just fine:
int main(int argc, unsigned long long argv[]) {
char *firstarg = (void *)(argv[1]);
printf("%s", firstarg);
}
which, given "./a.out pickles" will print "pickles" (argv[0] gives the memory address of the cstring "./a.out"). I'm guessing that, in the case of a compiler, the memory addresses of arguments are more relevant to have than the arguments themselves.
The 1999 ISO C standard dropped the "implicit int" rule, so this:
main(argc, argv)
char *argv[];
{
/* ... */
}
is illegal (strictly speaking, it's a "constraint violation"). Note that it's
char *argv[]
not
int argv[]
But this:
int main(argc, argv)
int argc;
char *argv[];
{
/* ... */
}
is still perfectly valid.
As for this:
int main(int argc, unsigned long long argv[]) {
char *firstarg = (void *)(argv[1]);
printf("%s", firstarg);
}
it's not a constraint violation, but its behavior is undefined (unless your compiler specifically supports and documents that particular form as an extension).
The compiler code states int argv[], not char argv[] (I assumed this is why the OP asked for clarification in the first place, since char argv[] is much more common).
You're right, in theory this is undefined behavior, but in practice on a 32-bit system, sizeof(int) will almost always be equal to sizeof(void *). I was just demonstrating how one could recreate the code in the compiler while on a 64-bit architecture.
Yes it is. Originally that's how you specified parameters: just the name in the parentheses, then (optionally) the types in a format similar to variable declarations before the function body. If you didn't specify a type it would default to int, so argc above would be an int.
All modern C compilers still accept this style for backwards compatibility. I'm not sure about C++ compilers.
Function declaration for main. The type declarations for the function parameters default to int, but can be specified outside of the parens before the curly brace.
The int argv[]; is where the declaration for argv is happening and is being declared as an pointer for ints.
You can also see elsewhere in the code where they are passing pointer addresses (as int params) into functions and then using the address to build pointers referencing that data.
I don't know if it's still valid by the spec, but it's still supported by some compilers. There was a post on vim reaching 7.3.0.1000 the other day, and that's still using that style of typing.
Turns out, this Github-repo is just a mirror/copy of his work, but with attribution [2]. Still worth reading through there, tuhs also stores some extremely old UNIX versions.
[1] www.tuhs.org
[2] http://cm.bell-labs.com/cm/cs/who/dmr/primevalC.html
Edit: Warren has written a paper on restoring ancient UNIX versions and C-compilers, you might like it [3]
[3] http://epublications.bond.edu.au/infotech_pubs/146/
Edit2: Now that I've thought a little bit about it, I'm not happy that the sources are on GitHub in this form. This is Warren's work - he did a lot of work in getting these tapes to work again, and "mortdeus" just copied the work and didn't even change the folder-names - "last1120c" is the first tape, "prestruct" the second. And you still need Warren's Apout emulator to get these files to work.