I wrote this paper in January 2015, so some things changed in the meantime:
- the benchmarks have been optimized a bit, and someone contributed a cereal parser that beats nom ( https://github.com/Geal/nom_benchmarks ). It's alright, I just wanted to estimate where it stands
Langsec is one of the most interesting approaches to follow in security, and the ideas presented this year were amazing (I'm especially fond of the heap exploitation algebra). I highly recommend watching the videos once they will be available.
Great work! I've been working on a Rust parser combinator library myself ( https://github.com/DanSimon/peruse ), though I've been focusing more on parsing syntax trees than binary formats, and it's really cool to see the different approaches you've taken. I had been planning on working on stream parsers next, but I may just defer to nom since it basically looks like what I had wanted to do.
I really like your approach with trait objects. This is something I would have liked in nom when I started it, but macros proved more useful for my experimentations (less types to rewrite when I change something).
For the streaming part, it is still a work in progress. I need a way to better represent input enabled state machines, otherwise we'll end up with streaming switch based state machines, and they are a security nigtmare.
> Langsec is one of the most interesting approaches to follow in security, and the ideas presented this year were amazing (I'm especially fond of the heap exploitation algebra). I highly recommend watching the videos once they will be available.
Getting parsing done correctly and efficiently is a requirement that pops up over and over. More work in the field is always good. Interesting paper, although I haven't heard of the parsers. I usually hear about yacc, GOLD parsing system, and so on. Like to see comparisons with most common ones for probable use cases in terms of productivity, safety, and performance.
Far as next project, Leroy's people at INRIA did excellent work in verified, LR parsers [1]. I'm not sure that anyone is building on it at the moment. Implementing that in or integrating with Rust might make for one heck of a parsing system. So long as correspondence was proven, it would be the safest one in a systems language.
The parser combinators are a common approach in functional languages. Basically, instead of generating the whole parser from a grammar, you assemble a lot of small functions, in other functions. The resulting code often ressembles the grammar very closely, and all of the intermediate parsers are very easy to test.
Proving with Coq the soundness of a parser compared to its grammar is a cool approach! The biggest problem one has when writing parsers in "safe" systems like parser generators or parser combinators, is the gaps between the input language intended by the designer, the input language described by the grammar and the input language described by the code. Anything that can reduce those gaps is welcome.
I shoul try at some point to use Coq or some SMT based system to hunt ambiguity in formats. That would make an interesting research ;)
That makes sense. The combinator approach might be verified using a form of design by contract or VCG's if it's simple functions. I'll look into them when I take the deep dive into functional programming.
As OP says "benchmarks, they have to be considered with skepticism", but with Nom implementing the "most naive recursive parsing", even the UN-optimised benchmarks look impressive
Ok, if Rust adds higher-kinded types, I'm giving it a try. This is very interesting, being able to write almost Haskell-style code without laziness or garbage-collection.
I am wondering how it compares to rust-peg[1].
I am currently using it and the syntax is very nice. At first look it seems to be more verbose to use Nom.. Is it much faster then rust-peg, though?
I do not know if it is faster, I should add it to the benchmarks :)
The biggest difference here is the use of a syntax extension to parse the grammar. This is something I would like to do at some point, because it can make the code nicer. Right now, macros are good, because you can use nom directly without Rust 1.0, no need for feature gates.
That'd be nice.. i am currently contemplating to move or not.
Nom does run on Rust stable, right? Although i find rust-peg super easy to use, i don't like that it's only working on nightly. A difference in performance would be another reason to switch over.. or not.. :)
Parser combinators are recursive descent parsers, so they do not manage ambiguities at compile time. At runtime, the first parsing path that works will win.
Nom can handle easily regular, context-free and context sensitive grammars (a lot of binary formats are a bit context sensitive, so that was a requirement). I suspect it could do recursively enumerable too.
How hard is it to translate something written for "pyparsing" to Nom? The approach seems reasonably similar, but are there limitations in Nom that aren't in "pyparsing"?
I have never used pyparsing, but the approach seems similar. Right now, the only real limitation is that I still have a few combinators to implement. Otherwise, anything complex you might want can be done by writing a simple function.
I wrote this paper in January 2015, so some things changed in the meantime:
- the benchmarks have been optimized a bit, and someone contributed a cereal parser that beats nom ( https://github.com/Geal/nom_benchmarks ). It's alright, I just wanted to estimate where it stands
- there is better error management: https://github.com/Geal/nom/wiki/Error-management
- the error management types give powerful debugging features: http://dev.unhandledexpression.com/slides/langsec-2015/img/c... (test code: https://github.com/Geal/nom_colored_hexdump )
- It is now easy to embed in C ( https://github.com/Geal/nom_in_c ). I plan to make API compatible versions of some C libs
- there are more example parsers now: https://github.com/Geal/nom/issues/14 feel free to pick one up!
Here are the slides for the conference at the IEEE Langsec workshop: http://dev.unhandledexpression.com/slides/langsec-2015/
Langsec is one of the most interesting approaches to follow in security, and the ideas presented this year were amazing (I'm especially fond of the heap exploitation algebra). I highly recommend watching the videos once they will be available.
Writing parsers with nom is easy and fun, please give it a try ;) https://github.com/Geal/nom