Alternatively, just write software in a Lisp variant (with good macros, clojure is quite meh here) or Ocaml or Haskell.
These languages put you at the level of writing a language for each problem domain immediately and with a comprehensive and useful toolkit. These languages also do this while directly competing with all but the most carefully tuned of C++/C environments. Steel Bank Common Lisp and the Glasgow Haskell Compiler are good references here.
These languages are often considered strange because of this, and feel somewhat alien. But once you realize you're building a language to model your problems, suddenly tons of stuff makes more sense. Previously alien concepts like Macros and Monads are outside the typical language's experience precisely because the language authors created these contexts and put you inside them.
This article is sort of fundamentally wrong that it's difficult to write small, purpose built languages. It's not, and even outside the functional and homoiconic meta-syntactic world we have seen code generators deployed regularly. Successful products and libraries are build using these techniques all the time, and with a modern toolchain it delivers excellent results. It's just that other more restricted and guided approaches are often introduced earlier in people's learning curve and sets the expectations for them subsequently.
> Alternatively, just write software in a Lisp variant (with good macros, clojure is quite meh here) or Ocaml or Haskell. These languages put you at the level of writing a language for each problem domain immediately and with a comprehensive and useful toolkit.
Great comment. And if a domain-specific-language (DSL) of really custom syntax is needed, an alternative is to use Racket (a Lisp-family language), which allows for creating DSLs of all kind. Racket is already used for designing your own programming language and trying it, so I'm surprised that the author did not mention suggest Racket as a test-bed for the new language.
> This article is sort of fundamentally wrong that it's difficult to write small, purpose built languages. It's not, and even outside the functional and homoiconic meta-syntactic world we have seen code generators deployed regularly.
Correct. For example, most of the common systems used today do use a 'code generator' internally: Every software that uses an ORM library (i.e. Hibernate), is already using a sort of domain-specific-language (i.e. HQL) that, via a code generator, translates to another language (i.e. SQL). So most major platforms (Java, .NET, etc) already have one or many popular libraries that do implement some sort of (limited) DSL and code generator.
I think the aim of the article was to create a new compiler from the ground-up.
I think the correct choice for syntax is something that can't be boiled down to three pieces of advice or one article. It is a topic that would require very deep discussion.
Does Racket support non-Lisp-like syntax for DSLs written in it? I've been hearing really great things about how it has the best macro system on earth, but Haskell has spoiled me to the extent that I'm a bit lazy (no pun intended at all) to look at things outside of that broad area (so Idris, PureScript, etc.) nowadays, but Racket seems to be interesting enough that (to continue the pun) I can afford a bit of IO for it.
Most of my interest stems from this excellent project in Racket that's shaping up to be quite something:
>Does Racket support non-Lisp-like syntax for DSLs written in it?
Yes! You can override the reader so that it parses anything, and there are modules for writing grammers[1][2]. Still, it is usually much easier to just use the built in s-expression parser, and if you're writing a lot of lisp, you probably don't mind s-expressions anyways :)
Hackett has interested me too, I hope that progress can continue to be made, it would be a nice alternative to Typed Racket (which is great, but very complex).
I'm working with the author of the "Type Systems as Macros" paper right now, we're working on implementing a linear language (along the lines of Rust) using Racket's macro facilities and the turnstile[3] package. Being able to embed arbitrary type systems in Racket really shows the insane power of the macro system.
I'll second the use of Lisp as the implementation language. I've also used Prolog. For low level code, e.g. a virtual machine, you'll also need a low level language such as C or assembly language.
If you're designing a language as a learning exercise, or a domain-specific one, my advice is to go right ahead, using a Lisp variant. If you don't know one, now is the time to learn one.
If you're developing a general-purpose language, ask yourself how your language will be better than, or at least different to, every general purpose language in current use. Be ambitious. As languages belong to families (e.g. Algol variants, Lisp variants, functional languages, visual dataflow languages, etc.), you should aim for your language to be the best in its family.
Your language should also be one you'd use yourself.
I'm developing Full Metal Jacket (http://web.onetel.com/~hibou/fmj/FMJ.html). It's very different from almost everything else, so I'm not expecting popularity overnight.
The Dr Dobb's article is way too conventional for my requirements. My language doesn't require any parsing. Syntax, type, and memory errors can't happen. The ideal target machine would just run the code directly without any need for compilation, though I will at some point need to cross-compile it onto the von Neumann architecture, and for that dataflow analysis is unnecessary, making optimal object code simpler to generate.
This is the direction I'd like to see computing go.
>This article is sort of fundamentally wrong that it's difficult to write small, purpose built languages. It's not,
This "correction" is mischaracterizing Walter Bright's article. He's writing about industrial "professional languages" and that context is made clear multiple ways:
1) he's referencing his D Language and presumably languages like it, 2) desire for the compiler to display quality error messages, 3) a good runtime library. The comparables to D Language would be Java/C#/Golang/Rust. All those languages took years by a team of people to get to v1.0.
> we have seen code generators deployed regularly.
But no professional language's (like D's) canonical compiler I know of is the direct output of yacc or ANTLR.
It seems like you have made reasonable and correct statements about programming languages -- for someone else's article -- but not specifically Mr. Bright's. The context really matters here.
> He's writing about industrial "professional languages" and that context is made clear multiple ways
Embedded languages can have all the features you've named. Although I do agree that usually they piggyback on an existing bytes->bytecode system. But you mention YACC and ANTLR as the mechanisms for doing this whereas someone using Lisp or Haskell need not use an external parser, which lowers the cost.
I can implement a C interpreter in Haskell or Lisp. We routinely make language extensions to support all sorts of things, and we do it with excellent error messages and supportive runtime libraries.
Maybe if the author had titled the essay, "So you want to build a runtime and optimizer then attach a lexer and parser to the top" I'd have rephrased. But he didn't, and I took the opportunity to make my point.
> It seems like you have made reasonable and correct statements about programming languages -- for someone else's article -- but not specifically Mr. Bright's. The context really matters here.
Maybe if people want to make nuanced points they should stop using clickbaity titles and leaving major tenants unsaid save for "contextual" queues applied ad hoc by others in 3rd party forums?
Actually, this is one of the most important, albeit usually implicit, features of a good programming language: how easy it makes to create a DSL (without leaving it). This is because ideally you would first find or create a language to express, in the most adequate way, the problem you are trying to solve, and only then you solve it. That is why pseudo-code is useful. The OO support in programming languages made creating a DSL somewhat easier, but it is true - nothing can compare with Lisp in this regard. (So, I guess, this makes Lisp the best programming language of all time.)
> how easy it makes to create a DSL (without leaving it).
Pretty much impossible. If you don't leave your original language, you don't
use domain specific language, you just have general purpose language with
domain specific API.
But what is an API? Is it just a set of functions? Even then, it is already a language, with its own vocabulary. Or, is it a set of classes, interfaces, etc.? Then, even more so. And so on. Limitations and particulars of the syntax of a host language do not matter as much as you may think. Also, if you look close enough, you will realize that most of the languages already are, in fact, combinations of various built-in sublanguages, each having its own "subsyntax". What matters is the ease with which the host allows to create a new sublanguage that suits a particular purpose.
You're missing the point. Some programming languages (Lisp family in particular) allow you to build DSLs directly, with the tools the language itself provides.
And then, inside this "DSL" in Lisp, can you have a completely different
language, with significant spaces and unmatched parentheses? Or is it just
Lisp data structures that are then interpreted and you're pretending to
use language other than Lisp because of its flexible data structures notation?
At the age of 40, I strongly believe that I'd have been much more productive, had I invested at least as much energy into creating languages as I did into learning how to tame C++. Doing so in a structured way, more elegant than writing more parsers and compilers in C, sounds like a good idea.
I always liked C++, but having had the luck of experimenting many alternative concepts during the university (FP, LP, GC system programming languages) I never bought into the idea of using the C/lex/yacc trio.
Always preferred more productive ways to prototype programming languages.
Many of lex/yacc flaws (flaky syntax, global mutable state…) have nothing to do with their fundamentals. Also, LALR is bad, you want a generalised parser that can work on any context free grammar (Bison has such a mode).
We need no-nonsense lexer and parser generators, that are easy to use, easy to deploy, not too hard to implement, and generate fast parsers. My own studies¹ suggest it can be done. Unfortunately, I'm busy earning a salary right now.
Back when I was in the university, we weren't allowed to pick Lisp, Prolog or Caml Light for our compiler design projects, because they would make the whole project too easy. :)
Reminds me of my compiler class, and how I cheated by picking as a project symbolic differentiation and convincing the PhD to let me use Lisp. Needless to say, a single EBNF->lambda macro from a library, and the project became trivial :).
Monads aren’t as special as macros though, all you need is functions and higher-kinded types. The rest is carefully crafting a lawful monadic API, which is something you could probably do in most languages (but none of the standard ones have kinds other than *, if even that). Monads are a fairly simple (to implement and use) but strange (to learn to appreciate) abstraction, nothing more.
You can leave away higher-kinded types if you leave away types altogether, but I don’t think anyone has had a practical benefit from a monadic effects system or transformer stack in e.g. Javascript. I’m not even sure the plain »monad« concept even exists as an abstraction instead of in docstrings.
> Monads aren’t as special as macros though, all you need is functions and higher-kinded types. The rest is carefully crafting a lawful monadic API, which is something you could probably do in most languages (but none of the standard ones have kinds other than *, if even that). Monads are a fairly simple (to implement and use) but strange (to learn to appreciate) abstraction, nothing more.
I don't really see what you're saying here. I didn't mean to equivocate the structures except in terms of their "alien-ness".
But Monads are "special", in that literally everyone uses them but almost never sees them from the "outside" unless they're using a language that offers techniques to model them. Writing Javascript? You're in a monad with helper tools of a specific type. Writing C? Same deal, different shapes.
Lisp Macros do something different (and I alluded to this with the "meta-syntactic" mention), but similarly change the thought process from "how do I write this logic" to "how do I describe this logic".
> . I’m not even sure the plain »monad« concept even exists as an abstraction instead of in docstrings.
Since they are annihilated in the Haskell runtime as well that shouldn't be terribly surprising. It's a modeling technique to build programs algebraically. It's like asking if Generics exist at runtime. They do inasmuch as they are relevant to the interpretation of the program.
The key insight is that all languages have monads, but very few have monad. It’s up to the programmer to see the pattern in e.g. Javascript. But other than having higher kinds there’s not much that keeps most languages from having them the same way e.g. Haskell has.
The »plain monad« part was directed at Javascript, not Haskell. Haskell, like JS, can work perfectly fine without monads, but it still works perfectly fine with them ;-)
Even without knowing about monads, you can simply implement »bindIO« and »pureIO«. That’s what most other languages do – they have bindMaybe, bindList, bindAsync, …
I'm approaching an intermediate level in Haskell, but I don't really see how writing a program in Haskell is any more like writing a language than writing a program in, say, python. Could you elaborate or point to some DSLs written on top of Haskell? I'm very curious.
Here is a fantastic article to get you started down this path. Note that the author, in his introduction, submits that the approach to FizzBuzz in this paper is a "somewhat tongue-in-cheek solution." Nevertheless, by applying this kind of thinking to what is, in some ways, a deceptively simple-looking problem, the paper serves as a great starting point for using DSLs as an approach to problem-solving.
There is another article about FizzBuzz I wrote roughly 5 years ago (forgive the pronouns, I need to rebuild the site and I'm procrastinating on the css).
This uses a slightly different set of abstractions with the same problem.
But perhaps more specifically, when you select a monad stack (group of effects) to compose to solve a problem you're building the features of the language and its effects. And if you use the "final tagless" or "Free" approaches you're doing that even more directly.
You can think a monad defines a DSL. Let's take for example the Rand[0] monad:
type Rand g = RandT g Identity
newtype RandT g m a = ...
-- with
runRandT :: (Monad m, RandomGen g) => RandT g m a -> g -> m (a, g)
newtype MyType = Rand StdGen
-- now you can define
intGreaterThan :: Integer -> MyType Integer
intGreaterThan = ...
randomName :: Integer -> MyType String
randomName length = ...
-- an finally convine the two above
nRandomNames :: Integer -> Integer -> MyType [String]
nRandomNames length n
| n <=0 = return []
| otherwise = do
tailNames <- nRandomNames (n-1)
headName <- randomName length
return (headName:tailNames)
These functions of type `...->MyType a` can be viewed as a DSL where the Random generator state is abstracted away.
#include <stdio.h>
int
main(int argc, char **argv)
{
double i, s;
s = 0;
for (i = 1; i < 100000000; i++)
s += 1/i;
printf("Almost infinity is %g\n", s);
}
Lennarts-Computer% gcc -O3 inf.c -o inf
Lennarts-Computer% time ./inf
Almost infinity is 18.9979
1.585u 0.009s 0:01.62 97.5% 0+0k 0+0io 0pf+0w
And now the Haskell code:
import BASIC
main = runBASIC' $ do
10 LET I =: 1
20 LET S =: 0
30 LET S =: S + 1/I
40 LET I =: I + 1
50 IF I <> 100000000 THEN 30
60 PRINT "Almost infinity is"
70 PRINT S
80 END
And running it:
Lennarts-Computer% ghc --make Main.hs
[4 of 4] Compiling Main ( Main.hs, Main.o )
Linking Main ...
Lennarts-Computer% ./Main
Almost infinity is
18.9979
CPU time: 1.57s
As you can see it's about the same time. In fact the assembly code for the loops look pretty much the same. Here's the Haskell one:
Variable capture is a problem with lisp macros that lisp and clojure programmers have tools and strategies to manage.
A number or languages use hygienic macros to avoid the variable capture problem: Scheme, Racket, Dylan, Elixir, Rust, Julia & Perl 6 (which has hygienic AND unhygienic macros)
Racket has more advanced macro facilities specifically aimed at making new languages languages and DSLs. Syntax-parse is very interesting in how it addresses the problem of error reporting with macros.
> Scheme, Racket, Dylan, Elixir, Rust, Julia & Perl 6 (which has hygienic AND unhygienic macros)
Just want to add that Racket, and most Schemes also support both hygienic and unhygienic macros, through defmacro. [0][1]
Though, their use is heavily discouraged. (And it may be worth pointing out that Guile's defmacro is actually implemented using syntax-case, which is hygienic most of the time, but flexible enough to let you do madcap things. [2]).
Racket has hygenic template macros, which are quite nice. Common Lisp offers a whole bunch of machinery around macros and more types of macros (most famously symbol macros).
Clojure picks the worst of all worlds by making reduced-power backtick macros, not stack tracing the macro output on failures, and then having a culture that discourages the use of macros.
Do you know of any template based macro systems which are more powerful than syntax-rules, but without going to the other extreme of full turing completeness? As I understand it, syntax-rules doesn't even let you concatenate strings to create new identifiers - everything in the output must come verbatim from the input or the template.
I have a sprawling codebase written in a 80s/90s era closed 4gl. Crossroads. a) keep kicking this can down the road using what I have inspite of its obvious and increasingly imposing limitations, b) discard it and rewrite or code in something else inspite of loss of hard won business rules, logic, behaviours and feel of the code over the years or c) develop a compiler/runtime and evolve it from there.
Went with c). In hindsight was incredibly ambitious but glad I did it. Been evolving compiler and runtime steadily. Now at point where I am starting to think about how to bend the language to discourage certain anti-patterns the language by design encourages and move the code base away from these anti-patterns (excessive use of global variables for one).
Alot of consequences I have to accept with this approach though, one obvious one is it is hard to find and hire programmers who want to work on the frankenstein's monster I've created here. Programming seems easy when it's all new and shiny and you are minting things for the first time. Old code is challenging, there are no obvious solutions forward, none I've found at least.
There are so many languages available these days, so the first thing to ask is "what is unique and beneficial about my language?" (valid answers might be that it is a unique combination of features seen elsewhere, that it addresses a specific domain's problems better than any general-purpose language, or that it is an experiment.)
Someone made an interesting point about Kotlin recently: as one might hope for a language coming from an IDE maker, it has excellent tools. I have heard Alan Kay make a similar comment about Smalltalk. There are a couple of semi-popular languages that I feel are significantly disadvantaged by their lack of tools support (primarily in debugging and documentation.)
I have long wanted to see programmers experiment with taking this IDE thing much further. Almost every other authoring tool has an opaque file format, manipulated via one or more views. E.g. when a digital artist wants to create and manage a complex 3D scene, they use a tool like Maya. They're not fussed about whether they can read the file format!
So why not try letting the program exist as an embellished AST, that you edit in multiple IDE views? You could still have text views that present the program in one or more programming languages. But also views like circuit diagrams, graphical call graphs, other standard programmer diagram types. A run-time view with a time axis and time controls, with graphical representations of the program, ability to rewind, fiddle with state, travel visually down all the code paths.
Can we really do no better than text + our imaginations and whiteboards? I'm almost certain we can do better.
Or, if you find those ideas a bit eccentric, then start with just having views of source code. For example, my team requires verbose symbols, but I like very short symbols. I wish I could just have a simple map from the long symbol names to short ones, and toggle my view of the source code without affecting the source code. But since no IDE I know of has the concept of a view, I don't know how to do this today with existing tools.
I like this idea, not least because I've also had it myself. I use Xcode's "Callers" and "Callees" assistant editor all the time but that's a primitive tool compared to your suggestion.
> Edit in multiple views
If you take a closed system, one that begins executing from a fixed point and doesn't have to handle events, concurrency, asynchronous calls, interaction with any external process, then you probably could create a tool like this that would work nicely most of the time. It's going to be a bit harder to do that once you introduce those common real-world situations.
I think you're really asking for a tool which externalizes a lot of the internal reasoning we do as programmers. You want a tool to make concrete some of the abstract models we build up in our heads about the structure of our programs: call graphs, event sequences etc. I wonder if it would be useful for education? If not that, then such a tool would be incredibly helpful for getting to grips with a new, large codebase.
I'm just not sure it's particularly do-able in any kind of non-trivial way.
Yes, this is definitely what I'm getting at. I do think it would be amazing for education, but also for real work. The Unity Editor, and other similar WYSIWYG game editors, are kind of half-way between tools like Maya and programming environments. Other than being buggy and disorganized, the idea of it is amazing. They let you view "the program" from the 3rd person, as a visual object, inspect and edit state, etc. The productivity gains are real, IMO.
I think the approach generalizes, in principle, to all systems, including the real-world concerns in your second paragraph. Certainly it would be non-trivial. Still, if I ever have the free time, it's about the only thing I'd want to work on.
One place to start could be the debugger. This at least has the advantage of a fixed call stack at any point of inspection rather than a whole cloud of possibilities, making drawing some of those call graphs and flows much easier.
I agree, I think it'd be really useful for real work, particularly maintenance work and ramping up on established code bases.
I've been playing around with a similar idea for the design of a language/tool suite I've been working on (essentially a DSL for narrative structures in games/interactive multimedia), and it's been very nice so far. I'm definitely surprised it isn't an approach more are trying.
Another interesting example is Antimony (http://www.mattkeeter.com/projects/antimony/3/), an experimental CAD program that came out of the MIT Media Lab. The main metaphor is dataflow programming (with nice built-in Python scripting within nodes), but that exists side by side with a pane exposing a more traditional 3D modeling interface. It's not quite production-ready, but as a UI experiment I find it way easier to use than traditional CAD software.
> I have long wanted to see programmers experiment with taking this IDE thing much further. Almost every other authoring tool has an opaque file format, manipulated via one or more views. E.g. when a digital artist wants to create and manage a complex 3D scene, they use a tool like Maya. They're not fussed about whether they can read the file format!
Interesting choice of example, because over the last few years open formats like Pixar's USD (https://graphics.pixar.com/usd/docs/index.html) have been emerging for managing complex 3D scenes - precisely because of problems with opaque file formats.
That said, we're now seeing more and more things like Unreal Engine's Blueprints, Apple's "Swift Playgrounds" and so on. I think your wish is slowly coming true.
> Almost every other authoring tool has an opaque file format, manipulated via one or more views.
There is a strong trend away from this. Blender is taking over the 3D world by having a programming-like CLI. AutoCAD brought layout as a LISP into architecture and basically destroyed everyone there. Image manipulation software are starting to offer custom sequences of manipulations instead of just filters. Mechanics 3D CADs are getting CLIs everywhere.
I've had this on my mind in various forms for almost a decade. It's only recently been crystallized from something abstract into much more concrete thanks largely in part to Bret Victor's talks. If you haven't already seen them, I highly recommend "Stop Drawing Dead Fish"[0] and "The Future of Programming"[1] as starting videos. Then just watch all the rest of his talks. :)
Thinking of something like Halide, which splits the program into a algorithm part and an execution schedule: there may be productive ways to factorize "one-text-file" programming conventions into multiple aspects written in different IDE modes.
Haskell's "do" notation always seems ripe for this. Write the non-pure aspects of your code in a more appropriate language, or any one of a choice of front-ends. The aesthetic of typographic code in a flat file is strong in the FP community, but "do" notation is not its finest hour.
I wrote an article on a relatively simple approach to building language systems with these properties titled, "How to Make View-Independent Program Models"[0]
I haven't had a chance to try an implementation yet, and maybe there's a reason why it wouldn't work, but I've been looking and haven't found it yet (maybe someone here will!).
I think the challenge for systems like this is flexible editing -- the most natural way of making changes to source code often goes through "invalid" states which can't be represented as an AST (or as a path through a parser automaton).
I think that has been true for AST editors because they remain halfway in the traditional text-editing paradigm, which involves these invalid states because the granularity of the interface is single character operations, while the model you are specifying with those characters (the AST) doesn't recognize those units.
An alternative is to have the editor operate on language constructs rather than characters. Rather than parsing the program view in order to arrive at a model of your program, you arrive at a view by rendering the model in some way. But the key thing is the model always stays intact because the operations you perform on it through your editor are in terms of language constructs (e.g. a single action in your editor might be 'add property' or 'create class', which is ordinarily accomplished by typing out lots of characters that can hopefully be parsed into such things).
I believe in this approach, but making it usable is harder than it sounds. Once I tried watching myself program with this idea in mind. I realized I often start by typing methods which don't yet have a class, code fragments which don't yet belong to a function, even expressions which don't yet live in a statement and which refer to variables which not only don't yet exist but don't belong to any yet-existing class — maybe I'll pause half-way through writing the expression and start creating the class in another window.
So you pass through a lot of ill-defined states. For the IDE to keep up it would have to just represent explicitly "this is a code fragment which doesn't have a name", "this is a variable which isn't defined anywhere" and so forth.
Those states are only ill-defined on the assumption that you're entering text which will be parsed. You can still enter fragments when operations are in terms of language constructs rather than characters--you have just supplied a partial set of parameters to the construct, which you can go back and finish later.
Maybe the other part of this issue is expecting to use these potentially better editors in the same way we use text editors. I would bet different patterns of interaction would surface for the new editors which aren't obvious from our present standpoint.
xoL is a graphic based programming language that I have been working on. The goal is to make programs visually easier to read. And to be editable with almost no use of keyboard. It can work well on tablets. A partially working prototype can be found here: https://github.com/lignixz/xoL
I have a newer design with substantial improvements. If only I had funding to focus on this.
Problem with that is that programming languages are not just about ASTs, they also have different semantics. A "class" in C++ is not like "class" in Java. Nor is it very easy to translate between a garbage-collected language and a one with manual memory management.
One could think that such an "AST in database" solution should have one agreed-upon semantics common to all its textual representations, but at this point you may as well serialize your AST into s-expressions and suddenly you're writing in Lisp.
Yeah, my recollections are from a talk I attended 17 years ago, so details may be a little off. It was the first time that I ever saw refactoring demoed. I have no idea what happened to it in the longer term, I know that Simonyi left Microsoft and started a company just to pursue intentional programming.
FWIW I do remember that the name comes from the ideas that the most important thing that a developer does is express intent. I've found that to be a powerful way of thinking over the years.
Not really. The AST in lisps is much more obvious, but you're still editing a text serialisation format, and any features that operate on the AST itself are provided by the IDE on top of that.
There is no "concrete" AST that's not a "serialization format". A binary AST is also a serialization of AST.
Lisp source is pretty much as close to AST as you can get while still staying in text-land. That said, experienced Lisp developers often use tools like Paredit mode that let them navigate and edit the code in terms of tree nodes, not characters.
> Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn't be a goal in itself. Of course, I'm not suggesting that large amounts of boilerplate is a good idea.
Couldn't agree more! No one needs another perl or bash. Languages should be as concise as possible, but not more.
Also good points about familiarity and helpful error messages.
> No one needs another perl or bash. Languages should be as concise as possible, but not more
I've seen you lump perl and bash together before, but around the idea of sigils like $ and @. This sounds more like a complaint about "default" operations. Like Perl's $_ perhaps?
Other than that sort of thing, I don't see where Perl reaches that far in being concise. Maybe regular expressions being a first class thing? Like "if ($foo =~ /bar/)" ? Though that seems straightforward to me.
Other than that, reducing the number of tokens is good.
If we can express a solution in fewer tokens of one language than of another, that's a good indicator that our language having better abstractions for that problem. The longer solution is perhaps spending some tokens to solve some distracting sub-problems which the shorter one doesn't have to.
But if the token count stays the same, and we simply rename the tokens to one or two character sequences, we haven't gained much, and it is possibly even detrimental.
> Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages.
This makes no sense. A terminator is required, multiple terminators (ie, redundancy) are not. Various modern languages eliminate redundancy: any random sequence of characters is not a valid program in those languages.
Disappointed Dr Dobbs would publish something like this.
More subjectively:
> Tried and true. Absent a very strong reason, it's best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates.
More people will program in the future than who do at present. If you aim to approach this audience, things like using '=' to set values, rather than test equality, make no sense to the vast majority of those people.
The lack of a terminator in python, and the optional terminator in JavaScript, has often produced weird errors for me, when the compiler and me disagree on if something should be interpreted as one line or two. With a semicolon, it's clearer.
Of course, missing semicolon errors are still annoyingly bad, I wish GCC and clang could produce better errors messages in this case.
Yet, semantic white space creates almost no problem in Haskell. Both Python and JavaScript suffer from a bad designed feature, it's not inherent on the use of white space.
I've had similarly weird errors in python, but given the same situations its ever occurred to me in ruby. Maybe its just the way the grammar is defined, for example stuff like the following works perfectly fine in ruby while python just says "Invalid Syntax":
Of course, the point I was making is that it's perfectly possible to write a parser that understands newline as a terminator without sacrificing these idiosyncrasies.
I think statistically humans are more likely to encounter errors when the machines use different terminators than humans do. But we could always test and find out for sure.
"A terminator is required, multiple terminators (ie, redundancy) are not."
Food for thought and discussion, rather than a "correction": Terminators are not required in a language. See the concatenative languages like Forth or Factor, in which
z = func1(x, y);
func2(1 + 2 + 3, z);
comes out looking like
1 2 + 3 + x y func1 func2
And, indeed, many people find this a fairly confusing programming style. (For instance, note how the second expression has no "z" variable in it.) Functions tend to become difficult-to-differentiate streams of tokens with few breaks in them. You'll also note that by token count the concatenative approach is smaller (21 non-whitespace tokens in the first if I'm counting this correctly, 9 tokens in the second); that is normal, not a fluke. By using the stack to hold things almost all variable assignments and their usages end up going away. However, despite being a programming paradigm that nearly as old as imperative programming, it has not taken the world by storm.
Speaking formally, we might consider a language to contain "redundancy" if there are source programs which produce the same compiled output; if they did it would be "redundant". Since all syntactically erroneous programs produce the same output (presumably none -- obviously I'm excluding e.g. diagnostic error messages here), a language which has more than one syntactically invalid program is redundant.
This might not seem all that useful in a practical sense, but I think that's basically the author's point: Such a language would be essentially impossible to write, and thus you're always going to have redundancy. As a result, you should evaluate redundancy in practical terms, rather than viewing it as always needing to be removed.
Natural languages are full of redundancy, which makes them easier to understand because we can do more accurate error correction. For example, the phonotactics of a language define which sounds may appear in conjunction with which other sounds; if someone says something that appears to contradict these rules, whether because they mispronounced it or because you misheard it, you have more ability to infer what they meant than if the error were a valid-but-different utterance.
It works the same for programming languages—when designing a notation, you need to consider how you’re going to produce useful error messages when people make common errors, and the simplest way to do that is to add a bit of redundancy so you can infer the programmer’s intent.
For example, I’m working on a dataflow-type language where the syntax for introducing local variables is evocative of a labelled edge in a graph; the original syntax to introduce three variables was this:
-> x y z;
But the problem was that people would forget the semicolon, so this notation would “run away” and continue to gobble up any following identifiers:
// Whoops, accidentally declared 6 variables
-> x y z
foo bar baz
{ … }
The solution? Add commas between the identifiers:
-> x, y, z;
Now if someone forgets a semicolon:
-> x, y, z
foo bar baz
{ … }
The compiler can say “I expected a comma or a semicolon, not this identifier ‘foo’” and additionally use the newline as a hint to offer “I suggest putting a semicolon after ‘z’”.
Likewise, in C-style languages you have to “redundantly” specify the number of arguments to each function at each call site using commas:
foo(1, 2, 3)
In Haskell, for example, that would be written like so:
foo 1 2 3
I find this prettier—it’s less redundant and confers other advantages. But it also suffers from the drawback that now the compiler has to figure out when you have an argument-count mismatch using the types rather than the syntax, making it harder to produce good error messages. The commas are an extra hint to the compiler about how many arguments you intended to pass.
Redundancy increases the the ability to detect errors, like a checksum in a code.
take foo("one", "two") mis-typed as foo("one" "two"),
instead of getting a "missing comma" error, you get "wrong number of arguments", unless the function can take one argument, in which case you may get some other random error.
> If you aim to approach this audience, things like using '=' to set values, rather than test equality, make no sense to the vast majority of those people.
A lot of good that will do if your language doesn't live long enough for a critical mass of those people to keep it alive. shrug
Python was designed for new programmers, rather than existing programmers. Ruby and JavaScript also broke many existing conventions. All have had enormous success.
I'm not sure what point you're trying to make. Python was plenty familiar to programmers of the day, and there's a good argument that its success is only attributable to its English-like syntax insofar as it caused adoption into CS curricula (rather I think the lion's share of its success is due to its status as an early, cross-platform, approachable scripting language). Further, Ruby is successful solely because of Rails and JavaScript solely because of it's browser monopoly; not remotely because of any broken conventions.
My point isn't that novelty is antithetical to success, but that it's a terrible mistake to assume that 1) programmers of the future will be inherently familiar with math syntax and 2) these programmers at any point in time will significantly outnumber existing programmers such that the market dynamics favor laypersons over trained programmers.
> 1) it's a terrible mistake to assume that programmers of the future will be inherently familiar with math syntax
That's an excellent point, and more of an expansion of the argument than a retort! Perhaps math syntax is simply wrong and = has no place in either setting values or comparing them.
> 2) these programmers at any point in time will significantly outnumber existing programmers such that the market dynamics favor laypersons over trained programmers.
Java dominates CS introductions before Python did. Python would not have appealed to existing Javanauts. Yet Python now dominates CS introductions and many areas of programming. The market favoured a language that trained programmers did not.
> That's an excellent point, and more of an expansion of the argument than a retort! Perhaps math syntax is simply wrong and = has no place in either setting values or comparing them.
The issue isn't math syntax; it's comparing the population of all future novice programmers with the population of current programmers. The apt comparison is all future novice programmers who might be exposed to this hypothetical language to all experienced programmers who might be exposed to this language. Note that an "experienced programmer" might be someone born in 2018 who happened to have learned JavaScript2035 before seeing our new hypothetical language. The latter likely pool dwarfs the former. And this doesn't factor in that a language is a living thing--it needs a critical mass of users in order to survive, and the largest pool of potential users is existing programmers (very, very few early language adopters are first time users, I would wager). I think this is the retort I meant to make; I don't think this is merely an expansion of the topic, with sincere respect. :)
> Java dominates CS introductions before Python did. Python would not have appealed to existing Javanauts. Yet Python now dominates CS introductions and many areas of programming. The market favoured a language that trained programmers did not.
Apologies, I don't see how Python displacing Java in CS courses supports your point (or harms mine). Please elaborate.
In the case of JS, JS was designed to require a semicolon to terminate statements, then ASI was hacked on when programmers realized they were already adding newlines. ASI is a bad idea because it's not consistent (that said, ugly code is more likely to trigger ASI issues than otherwise).
Python was designed so there's a single terminator used by both humans and machines.
Trying yourself at inventing a new programming language is both fun and an excellent intellectual exercise - regardless of whether you want to make your language "better" than the one(s) that you are familiar with.
It also opens a path to humility and a true appreciation of work of others.
It's fun taking an existing feature in one language, and implementing it in another language that doesn't have it. Often you'll come up against a simplicity/power tradeoff; making that decision (after thinking hard) and then comparing it to how the others did it is usually an interesting revelation. Lisps are the canonical language for doing this in, but sweet.js macros could enable some interesting results as well.
A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist.
This statement feels imprecise in a couple ways. It seems to imply that some IDEs actually use context-free grammars for syntax highlighting? Which ones?
As far as I can tell, Vim and Textmate bundles (i.e. what Github uses for syntax highlighting) don't use anything close to a context-free grammar for their syntax highlighting models. They are more like ad hoc lexers -- a collection of rules and regular expressions.
Certainly an editor doesn't want to parse the entire file to highlight text, because it has to potentially re-highlight at every keystroke. Also, you want to be able to highlight malformed programs (i.e. code with syntax errors). As far as I understand, that's generally why they don't use grammars.
I think you might mean that a language should be designed with a concise grammar so that somebody else can reimplement it by hand more easily? That is, you want multiple independent implementations.
If that's true, "context-free grammars" is the wrong term to express that notion. Context-free grammars don't handle many real languages, not just C and C++, but also Python/Perl/Ruby, and even JavaScript and Go (semi-colon insertion.)
Also, a lot of tools are enabled not by grammars but by producing specific data structures for your front end. (e.g. here is the way I think about it:
http://www.oilshell.org/blog/2017/02/11.html )
The different between Clang and GCC is that Clang is a library that enables a whole ecosystem of fantastic tools, including IDE support. But Clang doesn't do much at all with CFGs. The real difference is that the front end is a library that produces a very rich representation of the code.
In other words, I would say that integrating the compiler front end into the IDE became the more popular and successful approach. And of course the compiler very much has to be designed with this use case in mind.
> integrating the compiler front end into the IDE became the more popular and successful approach
C and C++ syntax highlighting suffered for decades because it was not possible to do a complete job of it without integrating a compiler front end into the IDE.
I think you're talking about Intellisense-like functionality, not syntax highlighting. Syntax highlighting works just fine in C++ in many editors.
The red squigglies in Visual Studio is a completely different technology than syntax highlighting, which is lexical.
Intellisense has only a rough relation to context-free grammars. Some of the red squigglies are errors in semantic analysis, not parsing. That is why the more fruitful approach was to integrate the actual compiler and the IDE -- not have two separate parsers/compilers, as was the case with Java IDEs.
Also, I think it was always "possible" to integrate a C++ front end and an IDE -- just nobody did it in an open source compiler until Clang.
No, I'm not talking about Intellisense. Just highlighting the code correctly. The C++ editors that don't integrate a compiler front end tend to get confused when you do tricks with macros, backslash line splicing, and trigraphs, for example. They'll also have trouble with the >> thing and templates, and preprocessor metaprogramming stuff like:
#define BEGIN {
#define END }
They do work fine with conventional code, but if one knows the darker corners of the Standard, they can be broken.
One could argue "don't write code like that", but as a tool developer there is always someone that does. When designing a language, though, one can design out all those problems.
OK, but highlighting the code correctly has essentially nothing to do with context free grammars.
This happens in Vim and Emacs with languages other than C and C++ -- here docs in shell, multiline strings in Python -- and Python does use a CFG, etc.
I agree it's annoying although I think most people view it as a minor thing. They stick with Vim and Emacs for other reasons.
I'm not sure anyone has based their language design around Vim/Emacs syntax highlighting, although ironically that is one of my criteria for language design. I was just confused by the advice to use a CFG, since it's not the relevant issue.
I would say the relevant issue is that your lexer shouldn't be too clever and have too many modes. And to avoid mixing languages in the same file, or have a very obvious lexical construct to mix languages.
The C preprocessor is an entirely separate language than C or C++, so that is the core of the issue in your example. Likewise, it is usually hard to highlight CSS and JavaScript embedded within HTML.
This is incorrect. Some languages have user-defined tokens. Some have contextual keywords. Both require a semantic understanding of the code to highlight them correctly.
And it isn't just the preprocessor with C++. There's the >> problem. It's not just me talking through my hat - tools for C++, such as pretty-printers and refactoring tools - have been very slow to appear, and fragile. But with a language like Java the tooling is quick & easy to write.
You don't have to believe me. Write a tool that reads C++ source code and inserts boilerplate at the beginning and end of each function, and works 100% of the time.
I know exactly what problem you are talking about. It's exactly the problem that Clang solves.
With the Clang front end, you can write a tool to read C++ source code and insert boilerplate at the beginning and end of each function, and it will work 100% of the time. There are dozens of such tools in active use at Google and I'm sure many other places.
But it has nothing to do with context free grammars -- really. Clang uses a recursive descent parser. GCC used to use a yacc-style grammar (which BTW is only context free-ish because of semantic actions), but it could NOT perform the task you are talking about. In fact that was largely the motivation for Clang.
It also doesn't really have to do with syntax highlighting as practiced by any editor or IDE I know of. Even though Clang has the power that you want ("semantic understanding"), I don't know any editor that uses it for syntax highlighting.
Instead they use approximate lexical hacks. This is probably because of the need to highlight partial files and malformed files, as I mentioned. You don't want your syntax highlighting to turn off in the middle of typing a code fragment.
But editors DO use Clang for semantic understanding, e.g. the YCM plugin for Vim.
But they use CFGs for NEITHER problem. You're conflating two different issues and suggesting the wrong solution for both of them.
There are a lot of links about this issue with regard to languages like C#, Scala, Go, JavaScript, etc. in the wiki page I linked.
I agree with your general point about language design, but the terminology you're using is wrong and confusing.
Yes, and clang appeared on the scene 20 years after C++ did. It's a long wait. If you create a new language, are you willing to wait 20 years for tooling?
I agree C++ is too hard to parse, and you should design something simpler. Simpler isn't the same thing as a context-free grammar. The issues you are pointing out are lexical (Python has a CFG but still has imprecise syntax highlighting in editors).
> A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting
I disagree with this because it's wrong. People don't use Clang or context-free grammars to syntax highlight code. Java has a CFG -- who uses it to syntax highlight code?
This conversation isn't very interesting because it's just me explaining the same thing to you over and over again. Your head is stuck in the mode of "expert" and not somebody who is curious and wants to learn something.
If you haven't done this before and want to explore a bit, I'd recommend using ANTLR 3. You can get an interpreter up and running pretty easily and it targets the major platforms, so you can typically use the produced grammars and generators in your language of choice. Recursive descent is usually the easiest to implement.
This is really only 5 or 10% of the work involved in creating your own language, but it's enough to experiment and give you an idea if you want to proceed with the long and hard work of developing it further.
The author actually recommends against using an lexer/parser generator. If you are just playing around they can be find but he's right that when you are ready to get serious they often end up being a hindrance.
Additionally a it's not usually that hard to create your own tokenizer and parser by hand.
Honestly if I'd make a language, I'd take most thing that make C and python popular, to make something as simple as possible, and as similar to C as possible. All I could wish for is C with native features of python (dict, list comprehensions, libs like sqlite and xml...)
What I think is the most important thing for a language is that it must be easy to learn and read for beginners. There is no need to have advanced features for advanced programmers. Those guys will use other languages and their specific tools to solve their advanced problems.
A language is something that is "talked" by many people, so the easier is it to learn, the more people will use it. It is all about a low barrier of entry.
Just stop focusing on particular features or what you like about that X or Y language. Just imagine CS students and beginners, and make a language that let them write code and do their assignments.
Cython is a fantastic and underrated language. It basically allows you to write C using Python syntax, and to also write Python in the same source file. Basically if you use a dict or some other nice python functionality, your code will run at Python speed, but if you add a few type annotations and whatnot it runs like C. Not to mention that you can import and use C libraries. Mix and match Python and C as required for the ultimate balance of speed and convenience. It's great.
Cython is nice to save some typing for generating wrappers. It is a very tough language to work with for actually implementing things that you'd want to do in a low-level language, because it's severely underdocumented and underspecified, but extremely complex at the same time. To actually see whether the code as written is correct, one must almost always cythonize it and dig through the generated, verbose and hard to decipher C code. Both the language and the compiler have a ton of things I'd call bugs, induced by the complexities and mismatches in the type system. E.g. it's awkwardly easy to have Cython call some PyObject function on something that isn't a Python object.
No thank you.
PS: Also avoid using setup.py Extension, including Cython's variant, if you can at all.
(Source: I've been maintaining software containing Cython stuff for some while, written and reviewed quite some of it, too.)
I suppose I almost solely use Cython for numerics, which seems to be reliable. Anything that's not numerics is not usually the bottleneck for my code and so is left as Python. So I can't comment on problems trying to use the C end of the Cython continuum as a general purpose replacement for C.
For my numerics at least I can't say I've had to dig through generated C. I generate the annotated html to ensure that what I thought would translate to pure C without python API calls indeed has, but I've never had to actually read the C it generates.
The annotated mode is indeed very nice to browse the generated C code, although I still needed to manually read the .c in some cases. The annotated code elides things like the (many thousand lines of) Cython support code; but since few things are documented I sometimes even needed to dig through those, deciphering nested #ifs and whatnot just to see whether the code would be correct.
E.g. what does `cdef char *something = somethingelse` give me. Even if you know the type of somethingelse it's at best a guess. (Bonus question: Say you know somethingelse is going to be a Python bytes object. Does something point to a copy?)
Cython can wrap libraries but there are better alternatives for doing that. It is of most benefit when doing computation on NumPy arrays. That way you pass in and out native C objects, do bulk computation in compiled code, and never pay the penalty of accessing a PyObject. If you can't guarantee that then there is not much point in using Cython although you still do get a speedup on PyObject code by bypassing the bytecode interp.
> All I could wish for is C with native features of python (dict, list comprehensions, libs like sqlite and xml...)
Many had these fantasies before realizing the intrinsic conflict. First, it is merely a library away to have high-level data types in C, but that is not enough. So the wish is not merely to have, it needs to be native. What does that mean? First, it needs syntax support. That can be managed by macros and, in c++, templates. But that is still not enough. So the wish really goes into the semantic level, that is, high-level data types that do not need worry or understanding of its low-level subtleties. But, you now should realize, that without understanding and worrying about these low-level subtleties, there is simply no reason at all to use C versus, e.g. python. We use C because we want to program in low-level that is impossible or difficult to do in the other languages, e.g. efficiencies. Wanting to do low-level programming and refusing to distinguish and understand low-level subtleties are intrinsically conflicting.
So any such effort of adding high-level data types natively to C will end up with syntax full of details of low-level subtleties: types, memory management, runtime, etc. The language will no longer be simple and they still doesn't feel like native (as python).
So honestly don't make another C++. Embrace C. Use Python.
> I'd take most thing that make C and python popular, to make something as simple as possible, and as similar to C as possible. (...) What I think is the most important thing for a language is that it must be easy to learn and read for beginners. There is no need to have advanced features for advanced programmers.
I understand that your aim would be to create a 'beginners' language. But why keeping the syntax of C? C's syntax is not so much appropiate for beginners, Python is.
C's advantage, or reason-to-be, was to create a better alternative to assembler. That is, to create a language almost as fast as assembler, but with higher-level constructs. Still, C is a pretty "low-level" procedural language. That's why all the features for manipulating pointers, etc. C wasn't really thought as a beginners' language.
From your description, perhaps you would better keep Python syntax as is, and perhaps simplify some features a little bit?
Python may have its failures, but the syntax is so clear that it is very suited to beginners. And I dare to say that it is already used in many introductory courses to programming.
I doubt you would agree, but you're somewhat describing Perl. Perl written by a C aficionado is very C like...similar looping and control constructs, pretty direct C library mappings (sockets, for example, are almost exactly the same), etc. Map() and grep() are fair approximations of list comprehensions. And popular libraries are abundant.
Is it just me being in my dotage, or does anyone else feel that websites that override my "open in new tab" action to force their tab to the top of my view plus tout their app should be taken out behind the shed and shot?
It's remarkably hard to find or configure an OS to not steal focus. Everything I've tried for my whole life has failed, with the exception of not using a GUI and just staying on the linux command line without letting startx run.
iOS? Steals focus all the time.
Android? Steals focus all the time.
Windows? Steals focus all the time.
Mac OS X? Steals focus all the time.
Ubuntu? Steals focus all the time.
It's outrageous and annoying. Does anyone have a simple OS configuration that will prevent all focus-stealing system dialogs from actually interrupting focus?
Edit: To continue the rant, Windows used to pop OS update dialogs that interrupted keyboard focus, took keystrokes like spacebar and enter to mean "lose all work and reboot". I remember a long time ago, typing and looking away from the screen, only to find that my computer had turned off. Infuriating.
When was the last time you used windows? ME? Nowadays the worst thing that can happen is that the task bar icon starts blinking or a non focus stealing balloon popup. (not balloons anymore but don't know the new term)
*Win32 does expose functions to enumerate applications and to give arbitrary handles focus. So an app could theoretically use this to give itself focus but that is far from the default for requesting dialogs or attention and no serious application would do such a thing. Such sleazy applications shouldn't be installed in the first place. Unless that application is designed to be a window manager.
I'm with you on this. I'd love to have such a configuration too.
RE your edit: I feel your pain, had that happen to me many times. Ubuntu does this too with "hey, there are updates ready to be installed!", but at least this doesn't cause a reboot.
I ve spend a good part of my lifetime making that program- it deserves to be the star of the desktop. And calling the wizard of OS to find out what happened after it took gratuitous time to load for the user-attention-whoring is simply cowardice.
These languages put you at the level of writing a language for each problem domain immediately and with a comprehensive and useful toolkit. These languages also do this while directly competing with all but the most carefully tuned of C++/C environments. Steel Bank Common Lisp and the Glasgow Haskell Compiler are good references here.
These languages are often considered strange because of this, and feel somewhat alien. But once you realize you're building a language to model your problems, suddenly tons of stuff makes more sense. Previously alien concepts like Macros and Monads are outside the typical language's experience precisely because the language authors created these contexts and put you inside them.
This article is sort of fundamentally wrong that it's difficult to write small, purpose built languages. It's not, and even outside the functional and homoiconic meta-syntactic world we have seen code generators deployed regularly. Successful products and libraries are build using these techniques all the time, and with a modern toolchain it delivers excellent results. It's just that other more restricted and guided approaches are often introduced earlier in people's learning curve and sets the expectations for them subsequently.