Hacker News new | past | comments | ask | show | jobs | submit login
Which Parsing Approach? (tratt.net)
176 points by matt_d on Sept 15, 2020 | hide | past | favorite | 84 comments



I like the way the various approaches are presented here, though the author briefly mentions a factor which I feel like really short circuits the entire discussion: error messages. Most people readily agree that recursive descent parsers can provide the best possible error messages, which if you're writing a compiler in 2020 seems like it trumps all other considerations (especially considering you can also get the best possible performance via recursive descent).

The static typing of LR is nice, but trading off user experience for developer experience seems like a bad deal.


Yeah, I think there are reasons why a lot of big compilers use some recursive descent variant (even a few like gcc used YACC-generated parsers for a long time and switched back to hand-written recursive descent parsers) and error message generation is a big one.

IMO there's a kind of funny progression in which parsing approach turns out to be the most appropriate depending on the scope of your project that circles back on itself:

- For pretty simple languages a hand-written recursive descent is obviously easiest

- Once your grammar is complicated enough that you start hitting precedence and ambiguity issues, or get sick of rewriting a big chunk of your parser as the grammar changes, you look into generating your parser from a BNF-like specification and end up with some variant of LL or LR

- At some point your language's grammar has mostly stabilized and you're struggling with providing good error messages or parsing performance or you've had to add one too many hacks to get around limitations of your parser generator and recursive descent starts looking good again

For my money, I tend to think that Pratt parsing/precedence climbing can extend recursive descent in a way that makes a lot of the common complaints about the complexity of dealing with operator precedence and associativity seem overstated. The trick is just that as you're building an AST, some symbols will cause you to reparent nodes that you thought you'd already placed, according to various rules. See: https://www.oilshell.org/blog/2017/03/31.html

I wrote a compiler for a vaguely C-like language by hand in javascript a while back that's intended to show how simple a hand-written parser (and code generator) can end up: https://github.com/j-s-n/WebBS

It's not that hard to statically track type information along the way - the above example requires a second pass at the AST to nail things into place and make sure all our operators are operating on the right type of thing, but a lot of that information is captured during the first parser pass or even during lexing.


Yes - Pratt embedded in a recursive descent parser is hard to beat. I say this after writing parsers on and off for years. Like the author, I also went through a cycle of tools but in my case it was recursive descent, LR (Flex/Bison), parser combinators (Haskell), LL (Antlr) before returning to recursive descent.

In the end the recursive descent (+Pratt) beats them all, in my opinion:

- you can easily provide properly helpful error messages

- best general performance

- the parser can be debugged directly using normal debugging tools

- flexibility - all the tools and features of the host programming language are available

- zero dependencies, no build tool integration required.

The only issues I could see that the author has regarding recursive descent are excessive boiler plate and the complexity of handling precedence and associativity but:

- there should be no more boiler plate than there is for any other programming task - you write functions, methods, classes, etc. just like normal dev to reduce boilerplate.

- using Pratt provides the structure to handle all the operator rule complexity.


runevault made a similar point below, and it's pretty valid -- if you're writing a new compiler for a new language, the target is frequently changing and you care most about iteration speed. I've mostly done compiler development for existing languages, which provides a nice fixed target and the "only" challenge is in providing a good experience for end users within that fixed target.


It seems like compilers get harder to maintain based mostly on the size of the grammar, along with how much it changes. For a small grammar, just about anything works, but for large grammars, any code that maps one-to-one with grammar nodes gets unwieldy and removing small constant factors in the amount of code per node starts looking worthwhile.


I think it depends slightly. For the final version of your compiler you are entirely correct. But starting with something easier to write so you can rapidly iterate and experiment with your language until you are ready to worry about letting other people see it has value as well, then swapping your parser to RD.


Reursive descent is debuggable. A grammar rule is a function, which means you can put a breakpoint on a grammar rule, and the backtrace tells you how you got there when you hit it.

Recursive descent carries no tooling dependencies.

Recursive descent is re-entrant, unless you don't know what you're doing.


Menhir for OCaml is an LR parser generator with outstanding error messages. Static detection of possible error states is a big part of that.


>Most people readily agree that recursive descent parsers can provide the best possible error messages,

Well, it could, but not all of them do.

My hand-written parsers are pretty bad at it. At the lowest level I run through a string char by char and test whether it parses correctly, i.e. like: p++; if (*p != '(') throw "expected (";

At first this does not even give a line number with the error.


I recently just hand rolled my initial antlr grammar to a hand coded recursive descent version. Beyond better error handling, there is preserving comments to auto format and indexing types on tokens for auto complete.


This is a good essay. However, the author didn't mention Pratt Parsing [1], which cleanly addresses [2] the shortcomings of recursive descent parsing. Parser generators look promising until you actually start building anything of moderate complexity. Then it's a constant cycle of kludges and hair pulling.

[1] https://en.wikipedia.org/wiki/Pratt_parser

[2] Mostly. The Pratt model does introduce potential context around operators (depending on the grammar). It's easy enough to extend the basic Pratt model to support this, but it isn't spelled out in many examples or literature.


> Parser generators look promising until you actually start building anything of moderate complexity

I've done both, by hand and with parser generators (flex/bison and antlr) and getting the machine to do the boring work is total fuckload[0] faster and more productive.

Edit: and unless you know what you're doing, you will screw up when hand-writing a parser. I know of a commercial reporting tool that couldn't reliably parse good input (their embedded language).

[0] 3.14159 shedloads in metric


What do you think is special about recursive descent parsing that makes you more likely to screw up unless you know what you're doing?

My experience has been the exact opposite - particularly as the language gets complicated and/or weird. In which case the generated parser becomes horribly brittle. Adding an innocent looking new rule to your working Antlr or Yacc grammar feels like fiddling with a ticking bomb - it might work straight away or it could explode in your face - leaving you with hours or days whack-a-moling grammar ambiguities.


I didn't say recursive descent parsing wrt screwing up, I just said "hand-writing a parser". Nothing about the flavour.

I guess our experiences differ but I don't know why. I have written a seriously major SQL parser in antlr and had no problem. And it was huge and complex, well that's TSQL for you.

It may be you have been parsing non-LR(1) grammars in bison which could prove a headache but... well IDK. Maybe I've been lucky.


That's an interesting coincidence - the biggest parser I wrote was also an SQL parser using Antlr. In fact, SQL was only part of it - it was a programming language that supported multiple flavours of embedded SQL (DB2 and Oracle). It worked but I always dreaded when it would have to be changed to support a new feature in a new release of Oracle (or DB/2).

I don't think it's an LR vs LL thing either. I feel that there is no sense of locality with parser generators; it's a bit like quantum mechanics - every rule has a "connection" to every other one. Change one seemingly small part of the Antlr grammar and some "far away" seemingly unrelated parts of the grammar can suddenly blow up as being ambiguous.


Coincidence indeed - I'm currently modifying my giant SQL grammar right now and building an AST off it. And struggling a bit with that, but that's mainly down to me not antlr.

It is strange that we're having such different experiences of it. I don't recognise your quantum view of it either, as a antlr rules, and bison, are very context-specific as they can only be triggered in the context of larger rules, and only when given piece of the target language (SQL here). They get triggered only in those very specific cases. I've never had your struggles with it. I don't understand.


I completely agree, yacc/bison are my goto tools - the big different i]s that you are building trees from the bottom up rather top down - if you're building an AST it probably doesn't matter, however a bunch of things (constant folding, type propagation) tend to go in the same direction so sometimes you can combine stuff.


Exercise for the reader: Write a simple arithmetic expression language parser, with numbers, variables, parentheses and working precedence levels. Extend it with "EXP0 EXP1 ... EXPn" syntax for function calls. Now extend it with "VAR1 ... VARn '.' EXP" syntax for lambdas. Yes, with no "fun", or "\", or "λ" to introduce it—you realise you have a function definition only when you arrive at the dot. The function definition spans as long as it makes sense.

Pretty fun, although requires some small hackery (I solved it by testing, at seeing the dot, that the "apply" list I've build to this point contains only variable names, and then re-using it as the "arglist").


I don't know why you'd want to write lambda like that.

Long ago, I made a prototype mixfix syntax for Scheme intended for a shell-like REPL with ISWIM overtones. The paren-less function call syntax (terminated by newline or semicolon, like shell) was done by modifying the Pratt engine to treat juxtaposition as application. Sexp syntax was also valid as parenthesized expressions.


Well, when I played with a minimal lambda-calculus evaluator, I quickly got tired of having to print "\" or whatever to denote the start of a lambda. But look, for example, at this:

    (f. (x. f (v. x x v)) (x. f (v. x x v)))
        fct. n.
            (church0? n)
            (_. church1)
             _. church* n (fct (church- n 1))
Are "\" really needed there? You have to put lots parens to denote where lambdas end anyway, and they also happen to show where they start too, and a dot is a clear enough signal for a human (IMHO) to see that it's a lambda. So that's how I've came up with the idea of writing lambdas like that. Then I extended my lambda-calculus evaluator with some arithmetic primitives, and yes, the syntax got somewhat peculiar.


I haven't made the Show HN post yet, but using the parser combinator library that I've been building[1], here's an answer to your exercise:

  module Joker_vDChallengeParser
    include Parsby::Combinators
    extend self

    def parse(s)
      (expr < eof).parse(s)
    end

    define_combinator(:exp_op) {|left, right| group(left, spaced(lit("^")), right) }

    define_combinator(:pos_op) {|subexpr| group(lit("+") < ws, subexpr) }
    define_combinator(:neg_op) {|subexpr| group(lit("-") < ws, subexpr) }

    define_combinator(:div_op) {|left, right| group(left, spaced(lit("/")), right) }
    define_combinator(:mul_op) {|left, right| group(left, spaced(lit("*")), right) }

    define_combinator(:add_op) {|left, right| group(left, spaced(lit("+")), right) }
    define_combinator(:sub_op) {|left, right| group(left, spaced(lit("-")), right) }

    define_combinator(:call_op) {|left, right| group(left, ws_1, right) }

    define_combinator :identifier do
      first_char = char_in([*('a'..'z'), *('A'..'Z'), '_'].join)
      rest_char = first_char | char_in([*('0'..'9')].join)
      first_char + join(many(rest_char))
    end

    define_combinator :lambda_expr do
      group(
        sep_by_1(ws_1, identifier) < spaced(lit(".")),
        expr,
      )
    end

    def scope(x, &b)
      b.call x
    end

    define_combinator :expr do
      lazy do
        e = choice(
          decimal,
          lambda_expr,
          identifier,
          between(lit("("), lit(")"), expr),
        )

        # Each e = scope ... block is a precedence level. You can switch
        # them around to play with the precedence of the operators.
        #
        # hpe -- higher precedence level expression
        # spe -- same precedence level expression

        # Basing myself on Haskell and making function calls the highest
        # precedence operators.
        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              call_op(pure(left_result), hpe),
            )
          end
        end

        # Our only right-associative precedence level.
        e = scope e do |hpe|
          recursive do |spe|
            choice(
              exp_op(hpe, spe),
              hpe,
            )
          end
        end

        e = scope e do |hpe|
          recursive do |spe|
            choice(
              neg_op(spe),
              pos_op(spe),
              hpe,
            )
          end
        end

        # Left-associativity done by parsing left operand bottom-up and
        # right operands top-down.
        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              mul_op(pure(left_result), hpe),
              div_op(pure(left_result), hpe),
            )
          end
        end

        e = scope e do |hpe|
          reduce hpe do |left_result|
            choice(
              add_op(pure(left_result), hpe),
              sub_op(pure(left_result), hpe),
            )
          end
        end
      end
    end
  end

That was a nice exercise. Here's some example parses:

  pry(main)> Joker_vDChallengeParser.parse "- 3 - foo bar . foo - bar - 5"                                     
  => [["-", 3], "-", [["foo", "bar"], [["foo", "-", "bar"], "-", 5]]]
  pry(main)> Joker_vDChallengeParser.parse "- 3 - (foo bar . foo - bar - 5) - 4 * -2 + 5 / 2"                 
  => [[[["-", 3], "-", [["foo", "bar"], [["foo", "-", "bar"], "-", 5]]], "-", [4, "*", ["-", 2]]], "+", [5, "/", 2]]
  pry(main)> Joker_vDChallengeParser.parse "- 3 - (foo bar . foo 5 6 - bar - 5) - 4 * -2 + 5 / 2"             
  => [[[["-", 3], "-", [["foo", "bar"], [[[["foo", " ", 5], " ", 6], "-", "bar"], "-", 5]]], "-", [4, "*", ["-", 2]]], "+", [5, "/", 2]]

  pry(main)> Joker_vDChallengeParser.parse "foo bar . foo bar baz qux . baz qux"                              
  => [["foo", "bar"], [["foo", "bar", "baz", "qux"], ["baz", " ", "qux"]]]
  pry(main)> Joker_vDChallengeParser.parse "foo bar . foo bar + baz qux . baz qux"                            
  => [["foo", "bar"], [["foo", " ", "bar"], "+", [["baz", "qux"], ["baz", " ", "qux"]]]]

You can try it by cloning my repo, running `./bin/console`, and pasting the module, or putting it in a file in `lib/`. If you put it in a file, you can reload changes with `reload!` in the console.

[1] https://github.com/jolmg/parsby

EDIT: Fixed the syntax for newer versions of Ruby.


Looking back at it, I forgot to allow spacing between parentheses and inner expression

  between(lit("("), lit(")"), spaced(expr)),
and using space as an operator prevents `(foo)(bar)` from being parsed as a call.

This is better, just adjacent expressions with optional in-between whitespace:

  define_combinator(:call_op) {|left, right| group(left < ws, right) }
Too late to edit, though. Oh well.


Also no mention of combinator parsing.

https://en.wikipedia.org/wiki/Parser_combinator


They're mentioned in footnote 1 as a flavor of recursive descent parsing.


I've had good luck with a mix of recursive descent -- for high fidelity error messages -- and Pratt at the expression level. I suspect that if expressions got seriously complicated in my applications (e.g., lots of different types with complicated promotion and mixing rules) that I'd write the expression parser in R-D as well.

I'm happy giving up some performance for a better chance at finding problems. Who cares if your compiler clocks at a bazillion lines a second if all it can say is "line 101: syntax error" or mumble something indistinct about a reduce-reduce conflict.

I'm not writing commercial compilers, just little language processors. And my users need all the help they can get.


I too have had great experiences from doing recursive descent. I don't understand why so many people don't like it. The techniques for handling left- and right associativity of infix operators have been well known since many decades, so I am surprised that the original article thinks it is still an issue. But I have met some people who felt uncomfortable by recursive descent parsers that were implemented via recursive procedure calls, just because they didn't quite understand recursive procedure calls in the first place. So maybe that's part of the reason.


My experience has been that those who don't like RD have a mostly theoretical/academic background and are likewise similarly unimpressed with "simple" or "unsophisticated" code in general.


For the author his case against RD is not for the implementation. At the end of the articles he calls for LR reference grammars (I imagine both for fuzzy testing and documentation) to ensure that your perfectly usable RD parser with nice error messages does not miss some edge cases.


> I don't understand why so many people don't like it.

2 reasons:

- it doesn’t warn you about ambiguous grammar, so you don’t know when you screw up

- it requires a lot of boilerplate for operator precedence (one function for each level), in a LR parser generator that’s trivial so I can spend time on more important things


I'd love for a post on reproducible, incremental parse trees. Roslyn, Rust Analyzer, et al use them to provide good language server facilities. Basically their parse trees are immutable and can recover the entire source file, comments and all. These allow for fast parsing that can be used to do code modification and good semantic analysis.

Another area that should be mentioned is purposefully messy grammars. A common addition to languages these days are arrow functions:

      (a, b, c) => { ...
These are really nice for defining functions, but they usually cause unbounded lookahead. For instance in JavaScript, the arguments list can also be parsed as a comma expression inside parentheses. In my language they can be parsed as a tuple expression. There might be some tricky way of doing this with an LR parser, but I can't figure it out. Therefore I needed to parse arrow functions/tuples by basically assuming they're an arrow function, then if I'm mistaken, I convert my parameters list into a tuple expression and continue parsing. For instance:

     (a, b, 10)
When we hit the 10, my parser goes "ah shoot, this is a tuple" and converts the (a, b to be expressions in a tuple.

Some may argue that grammars should just stay within LR/LALR/LL, but I do think adding a little bit of parsing complexity for ease of use is a worthy tradeoff.


To your first point, I've got a couple of bookmarks:

- Query-based compiler architecture: https://ollef.github.io/blog/posts/query-based-compilers.htm...

- Lezer [a parsing system used by CodeMirror]: https://marijnhaverbeke.nl/blog/lezer.html

- Anders Hejlsberg on Modern Compiler Construction: https://www.youtube.com/watch?v=wSdV1M7n4gQ


I'm familiar with the first and last link. I'd love for there to be a centralized resource or even working group on these ideas. More and more people are learning and implementing them but they're still limited to the cutting edge compiler authors and the CS equivalent of mathematical folklore.


I believe the way it's usually done is to always parse a tuple, then parse the arrow as a binary operator applied to the two surrounding expressions. This potentially produce a syntactically incorrect AST (e.g. if there is an arrow following a tuple containing numbers), so you need to detect those in another pass and produce the corresponding error.


Can be done in one go: just check the type of the left and right operand at the end of X -> expr "=>" expr, or assign "=>" the highest priority when following a tuple, X -> tuple ("=>" "{" body "}")?.


What's subtly tricky about my language is that not all function arguments are valid expressions. For instance I allow:

   (a: int, b: float, c) => { ....
which is not a valid tuple expression. Just had to make my life harder haha


Something like:

  expr ::= <all the other expressions>
         | '(' [ expr { ',' expr } ] ')'
         | '{' { expr } '}'
         | expr '=>' expr
         | identifer '(' expr { ',' expr } ')'
         | literal
Tried adding it to the javascript grammar I play with and the parser generator (Coco/R) didn't complain about conflicts -- though it doesn't have tuples so YMMV.

--edit--

Played with it a bit more and added tuples (as a replacement for '(' expr ')' grouping expression (exactly like the now-edited example above) and it still seems happy. If I were to add this to the grammar for real I'd have to monkey around with it some more to ensure there was an actual '(' ... ')' expression before the => but seems easily doable by transforming the grammar a bit.


What do you mean by incremental?

I recently hand-wrote my own parser so I could achieve some these results. Having used parser generators in the past, I realize now that hand writing a parser is SO much more flexible.


Lots of these parsers figure out what changed and only reparse the difference. This is pretty important if the parser is powering a language server, as reparsing needs to happen while the user is typing, so that autocomplete and other features work.


Lisp Programmers: "You guys are still trying to figure out parsing?"

On a more serious note, the grammar ambiguity mentioned in the article is one of the things that has drawn me to languages that use s-expressions. Lisp are just as varied in their semantics as any other set of languages. What sets them apart is in their shared syntactic core of prefix notation in combination with parenthesis to create s-expressions. This eliminates the vast majority (though not all) ambiguities concerning precedence and associativity.


While I appreciate the benefits of Lisp, there is still a tradeoff. PG said in one of his essays that in Lisp you write your program directly in the parse trees that in other languages get generated when the code is parsed. But that means that in Lisp you have to express everything in terms of parse trees, and parse trees can be cumbersome. That's why other languages have syntax; yes, in a way it's just syntactic sugar for parse trees, but syntactic sugar has uses.


It seems more like you hardly ever write parse trees directly in Lisp. Instead you write macros that get expanded into them.

It’s not really parsed until you have a data structure that lets you search for all occurrences of a special form. Using lists for everything ignores important distinctions between levels.


Macros are parse trees, which define code that operates on other parse trees to expand them into still other parse trees. The ability to operate directly on parse trees and transform them into other parse trees is a key advantage of writing parse trees directly instead of having a layer of syntax in between.


Yeah, they're the ultimate 'syntactic sugar'. They take a bit of time to wrap your brain around, but it's much easier to figure out what a macro is doing under the hood than it is to figure out that exact semantics of syntactic sugars in other languages. I try not to use them as much as I can, but for a certain subset of problems there's nothing better for eliminating repetitive code and for taking the language semantics out of the foreground so that you're just left with your application semantics.


PEG/parser combinators + Pratt has many advantages over LR:

1. It's faster

2. It's quite high level, arguably higher level than LR

3. Can more easily support good error messages

4. Can easily escape hatch into the surrounding Turing complete programming language, so you can parse anything

5. Does character level parsing, so no lexer required, and easily supports parsing sublanguages, e.g. regex literals inside another programming language

6. Can make use of the abstraction facilities of the surrounding programming language

7. No shift/reduce errors

8. The whole parsing stack is WAY simpler than LR

The main advantages of LR are:

1. Its theory is nicer because CFGs have nicer denotational semantics

2. It perhaps supports better automatic error recovery

Anything I forgot?

P.S. whether CFGs are actually a good formalism for programming languages is debatable. At the character level, most programming languages grammars are not context free. So the context freeness depends on the non context freeness of the lexer. It's fine to use a lexer of course (as long as you do not need to combine multiple languages), but this observation makes CFGs lose some of their theoretical appeal, imo. Character level CFG parsers have some non context free extensions to actually parse the things we want to parse.


> We all know that parsing is an important part of designing and implementing programming languages, but it’s the equivalent of Brussels sprouts: good for the diet, but a taste that only a select few enjoy.

Ever since reading Crafting Interpreters I've found myself obsessed with writing parsers. I keep making new ones over and over, looking for little things to improve on in projects that nobody will ever use. I'm actually a little disappointed that parsing is considered "the easy part" of compilers; a "solved problem", compared to things like type analysis and bytecode optimization.


Strong disagree on his comments about performance, unless you know that your inputs will never be "big". Modern computers are fast yes, but modern data sets are also often massive. Thousands of lines a second is not good, its like 1% of what the hardware is capable of. For a personal project or as part of a product with constrained input sizes that may be fine but don't release that to the world for unknown input sizes.


Assuming you are using a deterministic algorithm, like a Deterministic Finite Automaton for regular expressions for your Type-0 languages (eg, the lexer), and something like LL/LR for the Type-1 languages, then processing time is always O(n), where n is the number of tokens in the input. In the end they are all just table driven state machines, with one lookup per input token. I don't see how they could be made much faster.

When he says LR parsers were hard for old machines to handle he was referring to the generating the parsing table which happens once, when the parser itself is created.


"Don't release this to the world" strikes me as a bit hyperbolic, because parsing is rarely the bottleneck. Considering editors, individual source files are typically (relatively) small, even in a large project. Considering compilers, parsing is already the smallest portion of the run-time anyway.


Maybe it's my bias, but in a lot of applications I touched parsing or unparsing (and/or the associated inefficient I/O) would often be the single largest consumer of CPU time.


One thing I don't often see in the discussion of parser technology is what stage in the development cycle you employ a certain strategy.

A serious compiler will inevitably use a hand-written parser. Once your language needs mature IDE support, you will need a parser that not only produces good error messages, but is also error tolerant, e.g. a local parsing failure does not result in a global cascade of errors in your editing environment.

But that's not the stage where grammars and parser-generators shine. LR(k) grammars accept the complete class of deterministic context-free languages, allowing you to produce an efficiently parsable language with no ambiguity. The grammar itself can be easily modified to introduce new syntax, just by updating some production rules, and it also serves as documentation to incorporate into your language specification.

Sure, sometimes you have to deal with shift-reduce and reduce-reduce conflicts (akin to static type errors), but once you get the hang of it, parser generators become instrumental tools in language prototyping. Moreover, current advancements in LR parsers address a lot of these drawbacks.


How do parser combinators like megaparsec fit into this?

https://hackage.haskell.org/package/megaparsec


The first footnote of the article says:

> While that could mean just about anything, nearly everyone who writes a half-decent hand-written parser, whether they know it or not, is writing a recursive descent parser [1].

> [1] Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.

Which, while I haven't thought about it deeply, intuitively makes some sense: parser combinators encourage an approach mostly equivalent to little functions that do something like "parse a $thing, or parse a $different_thing" (etc.) which very much translates down like a recursive descent parser.


There is also this interesting stackexchange answer which compares different types of parsers https://softwareengineering.stackexchange.com/a/338888/37266...


I feel like the error handling is the trickiest part. I remember during my compilers course I fought YACC so hard to work well and provide meaningful error messages. Bison offered some relief at the time through custom extensions. Maybe the state of the art here is much better now. It was definitely fun to do as a project and it was depressing that my course didn’t cover error reporting in any way, likely because there’s no formal theory behind how to make it readable to humans (beyond a “wow - yours is the first to do gold error messages” in the project feedback, it wasn’t mentioned in class)


This should probably be titled "why I like LR parsers".

There was basically no discussion of PEGs or parser combinators, just a footnote that he doesn't care for them.

Which is fine! LR parsing is a great approach, he's committed significant amounts of his life energy to working with them.

I think PEGs are great, though, I'd recommend them to any developer as the default tool for munging structured data. Nearly any mess of regular expressions can be improved with PEGs.

There are unsolved problems in the PEG world, particularly with error recovery. But not insoluble problems by any means.

I'm not quite ready to show my work on PEGs but I can say I've gotten pretty far with the approach. It's noteworthy that he starts with recursive descent, notes that PEGs and combinators (same same basically) generalize recursive descent, and then just kinda walks away from that whole line of inquiry!

But hey, I'm a parsing nerd. I'm glad Laurence Tratt has pushed the state of the art in LR, just like I'm glad Terrence Parr has done the same for LL. I'm working in a similar space with PEG, and have a long way to go.


I'm experimenting with a toy language and I wanted to introduce generic syntax with angle brackets. I recall the golang team saying that it could lead to unbounded lookahead. Is there a parsing approach that mitigates that?


Yeah, not using angle brackets. ^_^

If you allow a, b = f < t1, t2 > ( c ), it can only be parsed if you know whether f is generic (if it's not, then a and b are bools), which requires whole program parsing because f could be defined above, below, or in another file.

Basically, if you want to have a sane grammar, don't use angle brackets for generics or pipes for lambdas. It looks good to humans but really screws up the computers.


Or you could use a simple trick and in cases where '<' can either open a generic, or be a compasrison, you require it to have a whitespace before it to be a comparison. So your example is an assignment. To make it a function call, you'd need "a, b = f< t1, generic_type < t2, generic_type < t3 >> > ( c )". Notice that when you're already inside of a generic's arguments, you can't use comparison, so whitespace there is allowed to be before "<".

IIRC, Elm uses this approach to disambiguate '.' and unary minus in function applications.

The major nuisance is that you need to somehow handle "<<" and ">>" properly: either by a "lexer hack"; or by lexing only single "<" and ">", and recombining them in the parser when appropriate; or by splitting "<<" and ">>" in the parser when appropriate; or by not having "<<" and ">>" in the first place.


Doesn't golang not support `a < b < c` and chaining of comparisons? I don't see why they make that argument in their specification if it's not already legal Golang syntax

specification: https://go.googlesource.com/proposal/+/refs/heads/master/des...


It does not, so the argument is very much valid.

https://play.golang.org/p/NQL1rVp-wVC -- it's trying to interpret that as (1<2)<3, and the types make no sense for that.


Which argument is valid?


That using angle brackets for generics would unduly complicate Go parsing.


Then although I agree, I'm still not sure I follow.

> It does not [support chained comparisons], so the argument is very much valid.

Was your comment to the effect of "although Go does not support chained comparisons, it is syntactically valid for a comparison operator to be passed the result of a comparison so the issue can still arise"?

That's valid, but I would have expected "but" in place of "so".

(sorry if it seems I'm picking nits; I'm just trying to understand and explaining my attempts)


It's not chaining of comparisons, but rather a multi-valued expression - note the comma.


> pipes for lambdas

Whats wrong with pipes? The only conflict I can think of is LOGICAL-OR || but afaik no language lets you shove shit between the pipes and still be an OR


`|` is bitwise-or in many C-style languages.


t1 and t2 should be (user defined?) types so I would imagine LR(1) should be able to disambiguate between a generic function and the less than operator.

Not saying the grammar would be context-free though.


You could help the grammar by having some lexical distinction between types and other identifiers.

Relying on that is going to prevent you from including numbers as things you can be generic over (without some additional syntax for that case). That may or may not be something you care about.


I know Rust uses the turbofish (foo::<Type>()) to solve it in expressions. As long as you keep type definitions strictly separated from expressions, it should be fine. But also there's various common syntax patterns that cause unbounded lookahead (arrow functions for instance)


Unbounded lookahead is a feature of the problem, not the parsing approach. Just ask yourself: if you wouldn't see what comes after it, could you decide unambiguously what kind of construct is being parsed?

There are several possible solutions in language syntax (and it's possible you won't like any of them).

One is to use whitespace to disambiguate. a < b is parsed as a comparison, a< b or a<b as an opening angle bracket.

Another is to keep track of the type while parsing, and parse a < b as an angle bracket only if a is known to be callable.

Yet another would be to require some fancy quotes instead, like f«a, b» for generics, sidestepping the problem completely.


The waz I solve this in my parsers, is that I have a generic "node" syntax that isn't differentiated between types and expressions. So e.g.

    x[1] : array[int]
would be parsed as

    cast(item(ident("x"), int("1")), item(ident("array"), ident("int")))
Then the semantic disambiguation (`x[1]` -> expression, `array[int]` -> type) would be done in the next stage.


I'd love to read more about the golang team's comments about generic syntax - the only thing I could find was this: https://go.googlesource.com/proposal/+/refs/heads/master/des...

Was that what you were referring to?


Yep


I'd contend that the obvious, dead simple and user-friendly way to deal with this is not treat `a<b>(c)` the same as `a < b > (c)`. Mandate whitespace on both sides of relational operators and forbid it where it would look wrong anyway in generic expressions. Ambiguity resolved at the lexing level.


How about PEG parsers - or are they just a special case of recursive descent parsers?


He shows us Flex/Yacc and shift/reduce errors then moves on. I'd greatly appreciate tools tips to either

- help better diagnose where/why shift/reduce arises in the context of a given grammar

- argument if Prat or another approach entirely avoids shif/reduce errors.


Ctrl+F "pratt", no matches. How is this technique still not widely known? (Though one of their sources does mention it in passing)


Pratt parsing (aka top-down operator precedence) is a special case of LR parsing. Since Laurie Tratt (the author of the article) is advocating LR parsing, he has no need to talk about Pratt parsing since it is a special case of the approach he advocates.

The nicest explanation of this fact is in Olivier Danvy and Kevin Millikin's paper Refunctionalisation at Work -- as an example of their program transformation technique, they demonstrate how operator precedence turns into shift-reduce. It's really cute.


Nice overview, but it misses the important subfield of incremental parsing. Modern IDEs couldn't exist without it.


I'd like to learn more about formal grammars and parsing in depth. Can anyone recommend a good textbook?


Engineering a Compiler by Cooper & Torczon covers the theory behind Regular expressions, LL and LR parsing.


Everyone in this thread seems to be saying Pratt and PEG matter most, which I don't see in that book. Any recs on that front? (Or do you think they're not important?)


I recommend becoming familiar with the theory and making the decision for yourself.

My personal preference is pretty much the same as Tratt's, except I never had an aversion to LR grammars to begin with. I always found them intuitive. I had used them before understanding the theory.


Thanks!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: