Awk is a great and oft-forgotten tool. Not only is it useful, the awk way of thi...

jakevoytko · on Sept 29, 2010

"nothing stops you from generalizing this to say a frame from a video"

This is 100% true. A coworker of mine implemented an elevation-bitmap-to-3d-model conversion tool in 160 lines of Awk. It ran faster than our "good" Matlab tool by a factor of 10.

Awk (or Perl) doubles the usefulness of Unix. Most of the common commands in Unix are query commands. When you need to start manipulating queried data, Awk is where the rubber meets the road. Piping data through the shell stops being read-only, and becomes interactive.

silentbicycle · on Sept 29, 2010

> It ran faster than our "good" Matlab tool by a factor of 10.

Could you give a bit more details there? I don't have any experience with matlab, but I tend to think of awk as fast to write code in (and start up), though not particularly fast in execution. (Roughly on par with Python, i.e., usually good enough.)

goosemo · on Sept 29, 2010

You might want to rethink that a bit: http://anyall.org/blog/2009/09/dont-mawk-awk-the-fastest-and...

silentbicycle · on Sept 29, 2010

That's mawk. I'm talking about the implementation that post calls "nawk", and either way, I mean orders of magnitude - I care about a 10-100+x difference in speed, not a 1.1-5x one. Awk and Python fall in roughly the same performance tier for that kind of code.

Also: "I have since found large datasets where mawk is buggy and gives the wrong result. nawk seems safe." makes me uneasy, as does the fact that it was unmaintained for a while.

_delirium · on Sept 29, 2010

Afaict, mawk's maintenance seems to be a bit up in the air--- the original maintainer basically disappeared years ago and hasn't blessed any successor, so the Debian-patched version became the de-facto current version, since at least it staved off bitrot. Recently someone (Thomas Dickey) picked up maintenance of a new upstream version unilaterally, starting from the Debian-patched version, but he hasn't managed to convince the Debian mawk maintainer to accept his new version as a new upstream (somewhat testy thread here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=554167). I'm personally a little more comfortable with something actively maintained like gawk, despite the speed differences.

silentbicycle · on Sept 29, 2010

Right. I usually use (n)awk because it's the default on OpenBSD, but have to admit gawk's artificial-filesystem-based networking support is pretty cool.

silentbicycle · on Sept 30, 2010

Ultimately, what are you implying? Am I wrong? Awk (any implementation) isn't as fast as C, LuaJIT, or OCaml. It's likely to be good enough for many things, though (certainly prototyping), and it's definitely convenient for quick scripts.

derefr · on Sept 29, 2010

General term for this: "Hylomorphism", defined as the composition of an anamorphism (a generator function) and a catamorphism (a fold/map-reduce function.) The initial base case of the generator runs BEGIN{}, and the terminal base case of the fold runs END{}.

silentbicycle · on Sept 29, 2010

Actually, I think the general term for this is a "pipe". (I've seen it called "generate and test [programming]" in Prolog books, but that's specific to a filtering pipe.)

mhd · on Sept 29, 2010

perl -nle is a nice substitute if you need a bit more code or its version of regexps. This proved to be quite useful when working with multiple Unices who all had different awks.

Still, the One True Awk still has my favorite opening line in its "b.c" source file:

    /* lasciate ogne speranza, voi ch'intrate. */

shadowfox · on Sept 29, 2010

> You have a block that runs before anything else happens, a block run just before the program exits, and a block run for every piece of input

Interestingly enough, windows powershell structures its cmdlets in the same way. Makes lots of sense for stream processing as you said.

blasdel · on Sept 29, 2010

On the other hand, both awk and sed quickly spiral out of control if you need to do anything nontrivial that spans newlines.

If the unit of input in this kind of stream processing system doesn't match the problem domain exactly, things get very difficult very quickly.

gchpaco · on Sept 29, 2010

Awk isn't so bad if you're clever about RS, but sed sucks. A tragic gap in the Plan 9 legacy has been structural regular expressions, which deal with these situations adroitly.

silentbicycle · on Sept 29, 2010

(RS = record separator, it just defaults to newline. You can handle multi-line patterns in awk.)

blasdel · on Sept 30, 2010

My problem was that the records really were purely newline-delimited, but I needed to process them using information from their context in the stream.

silentbicycle · on Sept 30, 2010

Fair enough. That's beyond the common cases awk addresses. At that point, I just switch to Lua. (I forget if you're a Python or Ruby guy.)