Awk is a great and oft-forgotten tool. Not only is it useful, the awk way of thinking about stream processing generalizes nicely to a bunch of other areas. You have a block that runs before anything else happens, a block run just before the program exits, and a block run for every piece of input. In awk, the input is a line of text, but nothing stops you from generalizing this to say a frame from a video (split into channels in various colorspaces, fed through a processing pipeline, returning another, processed image), a sound frame, a sensor measurement...
"nothing stops you from generalizing this to say a frame from a video"
This is 100% true. A coworker of mine implemented an elevation-bitmap-to-3d-model conversion tool in 160 lines of Awk. It ran faster than our "good" Matlab tool by a factor of 10.
Awk (or Perl) doubles the usefulness of Unix. Most of the common commands in Unix are query commands. When you need to start manipulating queried data, Awk is where the rubber meets the road. Piping data through the shell stops being read-only, and becomes interactive.
> It ran faster than our "good" Matlab tool by a factor of 10.
Could you give a bit more details there? I don't have any experience with matlab, but I tend to think of awk as fast to write code in (and start up), though not particularly fast in execution. (Roughly on par with Python, i.e., usually good enough.)
That's mawk. I'm talking about the implementation that post calls "nawk", and either way, I mean orders of magnitude - I care about a 10-100+x difference in speed, not a 1.1-5x one. Awk and Python fall in roughly the same performance tier for that kind of code.
Also: "I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe." makes me uneasy, as does the fact that it was unmaintained for a while.
Afaict, mawk's maintenance seems to be a bit up in the air--- the original maintainer basically disappeared years ago and hasn't blessed any successor, so the Debian-patched version became the de-facto current version, since at least it staved off bitrot. Recently someone (Thomas Dickey) picked up maintenance of a new upstream version unilaterally, starting from the Debian-patched version, but he hasn't managed to convince the Debian mawk maintainer to accept his new version as a new upstream (somewhat testy thread here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=554167). I'm personally a little more comfortable with something actively maintained like gawk, despite the speed differences.
Right. I usually use (n)awk because it's the default on OpenBSD, but have to admit gawk's artificial-filesystem-based networking support is pretty cool.
Ultimately, what are you implying? Am I wrong? Awk (any implementation) isn't as fast as C, LuaJIT, or OCaml. It's likely to be good enough for many things, though (certainly prototyping), and it's definitely convenient for quick scripts.
General term for this: "Hylomorphism", defined as the composition of an anamorphism (a generator function) and a catamorphism (a fold/map-reduce function.) The initial base case of the generator runs BEGIN{}, and the terminal base case of the fold runs END{}.
Actually, I think the general term for this is a "pipe". (I've seen it called "generate and test [programming]" in Prolog books, but that's specific to a filtering pipe.)
perl -nle is a nice substitute if you need a bit more code or its version of regexps. This proved to be quite useful when working with multiple Unices who all had different awks.
Still, the One True Awk still has my favorite opening line in its "b.c" source file:
Awk isn't so bad if you're clever about RS, but sed sucks. A tragic gap in the Plan 9 legacy has been structural regular expressions, which deal with these situations adroitly.