Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dennis Ritchie's first C compiler on Github (github.com/mortdeus)
288 points by jnord on May 22, 2013 | hide | past | favorite | 85 comments


Coincidences... I thought "How come Warren Toomey (one of the guys of the Unix Heritage Society [1]), has never posted this?"

Turns out, this Github-repo is just a mirror/copy of his work, but with attribution [2]. Still worth reading through there, tuhs also stores some extremely old UNIX versions.

[1] www.tuhs.org

[2] http://cm.bell-labs.com/cm/cs/who/dmr/primevalC.html

Edit: Warren has written a paper on restoring ancient UNIX versions and C-compilers, you might like it [3]

[3] http://epublications.bond.edu.au/infotech_pubs/146/

Edit2: Now that I've thought a little bit about it, I'm not happy that the sources are on GitHub in this form. This is Warren's work - he did a lot of work in getting these tapes to work again, and "mortdeus" just copied the work and didn't even change the folder-names - "last1120c" is the first tape, "prestruct" the second. And you still need Warren's Apout emulator to get these files to work.


> Edit2: Now that I've thought a little bit about it, I'm not happy that the sources are on GitHub in this form.

If one releases code under an open license and people use/copy/fork it, why should you or the original author be unhappy. (As long as the license terms are not being broken.)

If we had to worry about this each time we forked/cloned someone's work, it would make code reuse very hard.

[Edit]

I am working on some Python charting. I found Pychart, which I found interesting: http://home.gna.org/pychart/ . Because it is in bzr which I don't work with, I put it on github. Should I be worried that someone will feel offended with it?


If you're within the terms of the license, you're fine. The developer should say so in the license (or choose another license) if they don't want it reproduced anywhere. A link to the original is common courtesy, however.

In the case of the C compiler, I think the modifications are under the same license as the original, but I'm not totally certain. If they aren't it would be cause for concern.


Reading Dennis Ritchie's code is as close to reading a religious text as I'll ever come. The straightforward elegance of it is so inspiring!


The lack of type noise makes it easier to read too.


I can't disagree more! Everything is an int -- Basically untyped, in C terms! Maybe less noisy, but hard to figure out what's going on.


The declaration of printf is both scary and pretty cool.

What happens when you have more then 9 substitutions specified in the string? :D

edit: decided the code was a bit long to have pasted into my post. Can find it at the bottom of http://cm.bell-labs.com/cm/cs/who/dmr/last1120c/c03.c


This is the compiler's printf(), not the one that gets linked in with compiled code. As long as the compiler itself doesn't have more than 9 substitutions, it's OK. Your code will link with a different one.


I'm ashamed for having to google his name but for others like me here's a glimpse:

"Dennis MacAlistair Ritchie (born September 9, 1941; found dead October 12, 2011) was an American computer scientist who "helped shape the digital era." He created the C programming language and, with long-time colleague Ken Thompson, the Unix operating system"

http://en.wikipedia.org/wiki/Dennis_Ritchie


You should have seen HN when he died. It was that hell week of a bunch of early computer pioneers all passed away...


To be honest, I don't remember much of a hullaballo when either Dennis Ritchie or John McCarthy died. There was a post or two on each event, but nothing like when Steve Jobs died (when literally the entire front page was posts about his death).

(All three events were in October 2011 - Jobs, then Ritchie, then McCarthy)


He already used the right brace style (hanging braces, cuddled else) and the right indentation (tabs, not spaces).


Somehow over the years learning C on my own, perhaps because it seemed more consistent to me, I got used to writing

    if(foo) {
omitting the space after control statements' names, which is almost never done - except in very early C code. (It's not just this: V7 Unix often omits the space as well.). I should probably settle on a less idiosyncratic personal style, but I can't help but take a little heart from seeing "my" style in such a famous codebase. :)


It was also defined in the Plan 9 code style: http://plan9.bell-labs.com/magic/man2html/6/style

"no white space after the keywords if, for, while, etc."


    if(foo)
I avoid doing this because it looks like a function call.


In an appropriately powerful language, it could be a function call.


This is all very well if you have a friendly, high level scripting language like Ruby, but I'm definitely glad I don't have to write a C compiler where functions can take arbitrary blocks of code.


> In an appropriately powerful language, it could be a function call.

> This is all very well if you have a friendly, high level scripting language like Ruby, but I'm definitely glad I don't have to write a C compiler where functions can take arbitrary blocks of code.

That's a surprisingly good setup because in Smalltalk(one of ruby's main ancestor languages)

if/else is a method which takes a block closure.

    a ifTrue: [ l log: 'a is true'] ifFalse: [ 'a is false']
while

in ruby if else is a syntatic construct

    if a
        l.log ('a is true')
    else
        l.log ('a is false')
    end
probably more for perceived clarity/comfortability then speeds sake.



after

  #define if(X) LET(_it,X) if (_it)
it would be a macro call


I'm fairly certain that shadowing a keyword with a macro is forbidden by the standard (or invokes UB?) but this will still work for most compilers, provided LET is defined reasonably. Hm, something like

     #define LET(name, X) \
       for (int let_once_=1, name=(X); \
            let_once_; \
            let_once_=0)
may be reasonable, though tokenpasting in __LINE__ for good measure might be necessary for nesting.


Shadowing keywords is perfectly valid, although I believe you can't shadow keywords in standard headers so that compiler writers don't need to worry about it. It gives rise to some "useful" C/C++ features that should never be used, like if you want to access private members in foo.cpp, do

     #define private public 
     #include "foo.cpp"
     #undef private


I can imagine some insidious menace adding this to the top of a file:

    #define public protected


I once someone jokingly do (the second only always works in C, not C++)

    #define else
    #define struct union
Evil!


Which is exactly why you don't do it in a language where it isn't.


Why should a developer have to have headspace for a special class of construct? And what difference does it make in practice whether it is a keyword or a function - why do you think it's so important that they are obviously different in the code - does their being part of a special category have any practical implications?


Never using a space before the paren is more consistent. I don't know why people put a space there...


I don't want my function calls to be consistent with control structures.


I don't recall ever mistaking function calls with control structures.


That's the way you do it in normal text.


Recently some ignorant, nescient C++ programmer called me a Java programmer, because I use K&R style and not his preferred Allman style. Even Stroustrup uses hanging braces (with the exception of function body opening braces)!


tabs are an infinite source of pain and inconsistencies...

Everyone must support the space character, it cannot be banned. But a simple commit hook to ban tabs can make indentation and alignment not get messed up over time with many collaborators with default-configured editors (that mess up and use tabs for alignment).


The inventor of the tab character is on my list of people to assassinate when they're young if I ever get access to a time machine, the other people on the list being Hitler and Charles Douglass.

The tab character is a nice idea, but they do not seem to have worked out at all. I'd much rather have syntax-aware indenting in the editor, now that available compiler technology and CPU power make it practical.


I don't understand what's so bad about tabs. With tabs, all you have to do is adjust the settings in your editor, and each programmer gets the indentation (s)he wants: 2, 4, 8, whatever...

Also, do you really want to be pressing the space bar twelve times, instead of tabbing three times?

I've seen the results of spaces-only: inconsistent, sloppy indentation, 7 or 9 spaces instead of 8, as long as it looks "indented" enough.

How, specifically does the tab character "not seem to have worked out at all"?

I'm not saying you're wrong for using spaces, just wondering...


Don't confuse the tab character with the tab key.

The key is great. I use it all the time. Of course I don't want to press the space bar twelve times instead of tabbing three times. Of course I don't want sloppy indentation with 7 or 9 spaces instead of 8. Is this an actual problem? I've literally never seen either one in ~25 years of using languages that need indentation. The editor takes care of it.

The problem with the tab character is that there's no standard for how wide they're supposed to be, and so everyone uses them differently. Sure, in theory you use one tab character to indicate one level of indentation, and everybody can be happy. In practice, they're often not used that way. People will use two-space tabs with four-space indentation, using two tab characters to indent. People will use eight-space tabs with four-space indentation, using four spaces for one level of indentation, a tab character for two levels, a tab character followed by four spaces for three levels, etc. Some people just blithely mix and match for no particular reason. Put either one into an editor with different tab settings and it explodes into a huge mess.


Just use tabs for indentation and spaces for alignment.


While you might do so, anyone you collaborate with will probably not do that. Why? Because almost every code editor in existence comes with that option turned off. And many of them don't even have that option at all! If you expect your codevelopers to manually adjust the alignment to be spaces rather than tabs on each line they edit, when their editor doesn't even distinguish tabs from spaces, oh boy.

Tabs just don't work out in a collaborative environment. It is too complicated for a heterogenous editing environment to get correct, so no such environment gets it correct in practice and code ends up a huge mess.


I am just thinking, why not solve this at the scm level, e.g. make a git plugin that will check out the sources indentation independent and only show changes to contents of the file not the indentation.

You could even make a plugin that works without collaboration by others: check out the file in your preferred style and transform the changes back into the original style.


I totally agree. I seem to recall an experimental clang-based git plugin for this floating around. I think it even handled certain differences in naming convention.

We'll probably get there in a few more years, as we get comfortable with adding more power to these systems, and taking advantage of compiler technology for things other than generating code.


Your comment is flamebait.


Indentation, yes. Braces, no. He uses K&R style because he was the R (obviously). But nothing new is being written in K&R C these days, favouring ANSI C (99/11 or 89 if you're stuck on cl) with usually Allman/KNF style or a variation thereof instead.


Is this written in C? If so, what compiled this? Sorry for the noob question, I'm just a little lost.


Looks like a very early dialect. C assumes everything is an int unless specified otherwise. You can declare parameter types after the function name. So:

  init(s, t)
  char s[]; {
would be equivalent to:

  int init(char s[], int t) {
This still works with modern compilers.

I'd be interested if anyone has any more info about this:

  waste()		/* waste space */
  {
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  	waste(waste(waste),waste(waste),waste(waste));
  }
Found in last1120c/c10.c


From the linked description (http://www.cs.bell-labs.com/who/dmr/primevalC.html):

A second, less noticeable, but astonishing peculiarity is the space allocation: temporary storage is allocated that deliberately overwrites the beginning of the program, smashing its initialization code to save space. The two compilers differ in the details in how they cope with this. In the earlier one, the start is found by naming a function; in the later, the start is simply taken to be 0. This indicates that the first compiler was written before we had a machine with memory mapping, so the origin of the program was not at location 0, whereas by the time of the second, we had a PDP-11 that did provide mapping. (See the Unix History paper). In one of the files (prestruct-c/c10.c) the kludgery is especially evident.


Doh, I completely glossed over the readme and went straight to the code. That makes sense -- Thanks!

Cool to think that that waste function can still compile with todays compilers. A quick disassembly it seems to take up 751 bytes compiled on x64 using clang on O0.


That was the standard way of declaring parameter types until ANSI C in 1989. C actually copied the current style back from C++.


ANSI C didn't drop old-style (non-prototype) function declarations and definitions -- nor did C99 or C11. They've been officially obsolescent since 1989, but they're still fully supported by any conforming compiler.


> Looks like a very early dialect. C assumes everything is an int unless specified otherwise.

It looks beautiful, almost like a scripting language. No monster type signatures like

    const std::foo_bar<boost::blah_ptr<const xyz::bar::Bar&, baz::Baz>>&


I think the modem carrier dropped on your last line. Can you resend?


From the link on GitHub:

http://cm.bell-labs.com/cm/cs/who/dmr/primevalC.html

Which led me to here:

http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

Where, if you take the time, you will find a wonderful story, upon completing, you will probably know more about the early embryonic history of C then 95% of your peers.

(Spoiler - We start with BCPL, then Move to B - it's left as an exercise to determine how we originally compiled BCPL)


> (Spoiler - We start with BCPL, then Move to B - it's left as an exercise to determine how we originally compiled BCPL)

The first version of Go started with B: http://code.google.com/p/go/source/detail?r=f6182e5abf5e

The second revision was converted to C: http://code.google.com/p/go/source/detail?name=f6182e5abf5e&...

The third to Draft-Proposed ANSI C: http://code.google.com/p/go/source/detail?name=f6182e5abf5e&...

And the fourth to ANSI C: http://code.google.com/p/go/source/detail?name=f6182e5abf5e&...


Uh these commits seem interesting, but I don't quite understand the context. What's macho and what is its relation to go? Are the dates way off or is it an import of older history?


Your leg is being pulled, a little. Mach-o is the mach object file format, Brian Kernighan's C 'hello world' program is being used as source for, I think, a test. It's checked into the go tree with version history all the way back to its primordial source.


They definitely messed up with a conversion from Subversion to Mercurial. ;)


coupled with user:rmrfrmrf's pseudo-religious post praising the code, I read the second url as "christ" :P

Oh how our minds play tricks on us!


Your question is answered at length in http://plan9.bell-labs.com/who/dmr/chist.html. But it is common to write compilers some subset of the language that you want to compile, see http://en.wikipedia.org/wiki/Bootstrapping_(compilers).


A compiler has to do a lot more than parse a source file and translate it to target code: error checking and reporting, optimization, etc.

To compile a C compiler, you don't need a full-blown C compiler. For instance, I bet floats and doubles are not used. Therefore, you can write a barebones proto-C compiler in whatever language you have available and use it to bootstrap your compiler. Rinse and repeat.



I understand bootstrapping, but at some point there has to exist some outside compiler in another language or a hand-compiled version of this otherwise the chicken and egg chain never ends.


The very first compilers were tediously written in assembly. Bill Gates wrote his first version of Basic in assembly. I believe all early Fortran compilers were written in direct machine code through punch cards! Coding in assembly is not all that bad :). Once you got some compiler going, you can write more complex compiler with it and keep adding more syntactic sugar :).


Coding in Assembly was a gas! I wrote applications in IBM BAL in the 70's. G/L, Payroll, Inventory. We did it in part because we had so little memory (typically 64K to 131K), we had to squeeze every drop we could out of available memory.

I worked on Univac 9400s. We received the O/S in source code form (Assembly) on tape. We ran it through a parametizer (PROC), compiled the resulting source, and that's what the customer ran with.

You haven't lived until you've stepped through your code one instruction at a time, displaying op codes and raw binary data on the maintenance panel lights.


You can use any language available in the machine to create the first, most basic compiler. You can also use a cross-compiler if a compiler for the language already exists for another machine. If all else fails, you can write the proto-compiler in assembly.


The mysterious true origins of bootstrapping compilers.


Presumably the first step on the road to a self-compiling C compiler was written in B.


One might write a compiler from language X to language Y first in assembly, then when that works, write a new compiler from language X to language Y in language X itself and use the previous compiler to compile it. Presumably they already had a C compiler that they used to get this one compiled, then it can compile itself afterwards.


Also see my B compiler that was inspired by these early C compilers: https://github.com/aap/abc


Gorgeous! Have good fun, thx for the link


Someone should check that thing for back doors!

http://cm.bell-labs.com/who/ken/trust.html


Too late. Ken Thompson already put the back door in his B compiler.


If you like c compilers, also check out http://bellard.org/tcc/ and http://bellard.org/otcc/

The later is a c compiler in about 1kb of source code! It's quite functional and can compile itself.

The first link is what came out of it: A compiler so fast, that it can boot Linux from source code in a few seconds: http://bellard.org/tcc/tccboot.html


It felt weird being able to read it and understand (on a high level) what was happening.


I didn't understand anything but I want to. Where do you start reading to follow the flow?


Can anyone explain the naming of files? c00.c, c01.c etc


c0 is the first pass (c to intermediate), c1 is the second pass (intermediate to assembly), c2 is the optimizer. c0 is built from files beginning with 'c0' and so on.


Thanks for the explanation


Can anyone get this to compile and run?

Does anyone know what hardware the assembly language files are for?

Maybe you could produce a modified version with the archaic features removed, compile it with a modern compiler, then use the binary produced to compile an unmodified version. Or maybe there are still binaries of really old compilers that can understand this code floating around out there.

Any ideas?


I don't understand this

    main(argc, argv)
    int argv[]; {
Is that still valid today?


I remember being absolutely thrilled when function prototypes were introduced with the 'proposed' ANSI C standard (X3J11). The first implementation I got my hands on was Borland's Turbo C circa 1988.

The standard now known as C89/C90 had been in committee for many years before being 'released'. This didn't stop the tool vendors (like Borland) from supporting the 'proposed' standard much earlier than 1989.

Unfortunately, commercial UNIX vendors (like HP, in my case), were very slow to adopt the standard and update the cc compiler in their distribution. This forced us to work in K&R for a good time after 1990, all the while grumbling that $150 MS-DOS compilers already 'had ANSI'.


It's a holdover from Fortran (and probably from before that). Whereas modern day we would define functions as

    int foo(int i, int j) {...}
in Fortran you would do (! is comment)

    function foo(i, j)
        integer :: foo !return type
        integer :: i
        integer :: j
        !body goes here
Early C stuck to that style, so you would just put the names of the variables in the declaration, and then before the body give them types. The reason only argv is mentioned in that example is that C assumes a variable is an int if not declared otherwise, so there's no reason to put "int argc" like "int argv[]".


It's called a K&R style function definition, which was the way to do it back in the day. Basically, you define your parameter names first, then you define the parameter types immediately after the function but before the opening curly brace. It's definitely not recommended today and can result in undefined behavior if your compiler doesn't recognize it. If you're working with legacy code, though, I'm pretty sure you can set some C compilers to allow for it.

To explain further:

    main(argc, argv)
    int argv[]; {
is equivalent to:

    int main(int argc, int argv[]) {
The old style definition works because C had a default type of int, so the type specifications for the function main and the parameter argc could be omitted.

As for int argv[]? What that actually represents is an array of memory addresses that hold the command line arguments given. Obviously this becomes a problem if you're on a 64-bit system, where int and (void * ) are two different sizes. However, I checked this out on my 64-bit machine and it works just fine:

    int main(int argc, unsigned long long argv[]) {
      char *firstarg = (void *)(argv[1]);
      printf("%s", firstarg);
    }
which, given "./a.out pickles" will print "pickles" (argv[0] gives the memory address of the cstring "./a.out"). I'm guessing that, in the case of a compiler, the memory addresses of arguments are more relevant to have than the arguments themselves.


The 1999 ISO C standard dropped the "implicit int" rule, so this:

    main(argc, argv)
    char *argv[];
    {
        /* ... */
    }
is illegal (strictly speaking, it's a "constraint violation"). Note that it's

    char *argv[]
not

    int argv[]
But this:

    int main(argc, argv)
    int argc;
    char *argv[];
    {
        /* ... */
    }
is still perfectly valid.

As for this:

    int main(int argc, unsigned long long argv[]) {
      char *firstarg = (void *)(argv[1]);
      printf("%s", firstarg);
    }
it's not a constraint violation, but its behavior is undefined (unless your compiler specifically supports and documents that particular form as an extension).


We're talking about the code from the actual source files, not the standard.

Look here (lines 22 and 23): https://github.com/mortdeus/legacy-cc/blob/master/prestruct/...

The compiler code states int argv[], not char argv[] (I assumed this is why the OP asked for clarification in the first place, since char argv[] is much more common).

You're right, in theory this is undefined behavior, but in practice on a 32-bit system, sizeof(int) will almost always be equal to sizeof(void *). I was just demonstrating how one could recreate the code in the compiler while on a 64-bit architecture.


Yes it is. Originally that's how you specified parameters: just the name in the parentheses, then (optionally) the types in a format similar to variable declarations before the function body. If you didn't specify a type it would default to int, so argc above would be an int.

All modern C compilers still accept this style for backwards compatibility. I'm not sure about C++ compilers.


Function declaration for main. The type declarations for the function parameters default to int, but can be specified outside of the parens before the curly brace.

The int argv[]; is where the declaration for argv is happening and is being declared as an pointer for ints.

You can also see elsewhere in the code where they are passing pointer addresses (as int params) into functions and then using the address to build pointers referencing that data.


I don't know if it's still valid by the spec, but it's still supported by some compilers. There was a post on vim reaching 7.3.0.1000 the other day, and that's still using that style of typing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: