Fd and ripgrep/rg are the two "new" alternatives I use on a regular basis, and which are just huge improvements to life. Both of these find/search programs respect your .gitignore files, which helps enormously & makes searching my department's entire codebase really fast.
Fd is featured on Julia Evans' recent "New(ish) command line tools"[1]
https://github.com/chmin/sd: "sd uses regex syntax that you already know from JavaScript and Python. Forget about dealing with quirks of sed or awk - get productive immediately."
It would be interesting to test the ~1.5GB of JSON the author uses for the benchmark against sed, but there are no details on how many files nor what those files contain.
When trying something relatively small and simple, sd appears to be slower than sed. It also appears to require more memory. Maybe others will have different results.
sh # using dash not bash
echo j > 1
time sed s/j/k/ 1
time -p sed s/j/k/ 1
time sd j k 1
time -p sd j k 1
Opposite problem as the sd author for me. For system tasks, more familiar with faster sed and awk than with slower Python and Javascript, so I wish that Python and Javascript regex looked more like sed and awk, i.e., BRE and occasionally ERE. Someone in the NetBSD core group once wrote a find(1) alternative that had C-like syntax, similar to how awk uses a C-like syntax. Makes sense because C is the systems language for UNIX. Among other things, most of the system utilities are written in it. If the user knows C then she can read the system source and modify/repair the system where necessary, so it is beneficial to become familiar with it. Is anyone is writing system utility alternatives in Rust that use a Rust-like syntax.
Agree, I've started replacing my `perl -pe s/.../.../g`s with `sd`. It seems it's actually slightly faster than the equivalent Perl for the same substitutions (which it should be since it does less).
It is somewhat notable that rg and fd differ significantly in that rg is almost perfect superset of grep in terms of features (some might be behind different flags etc), but fd explicitly has narrower featureset than find.
Yeah, this was very intentional. Because this is HN, I'll say some things that greps usually support that ripgrep doesn't:
1) greps support POSIX-compatible regexes, which come in two flavors: BREs and EREs. BREs permit back-references and have different escaping rules that tend to be convenient in some cases. For example, in BREs, '+' is just a literal plus-sign but '\+' is a regex meta character that means "match one or more times." In EREs, the meanings are flipped. POSIX compatible regexes also use "leftmost longest" where as ripgrep uses "leftmost first." For example, 'sam|samwise' will match 'sam' in 'samwise' in "leftmost first," but will match 'samwise' in "leftmost longest."
2) greps have POSIX locale support. ripgrep intentionally just has broad Unicode support and ignores POSIX locales completely.
3) ripgrep doesn't have "equivalence classes." For example, `echo 'pokémon' | grep 'pok[[=e=]]mon'` matches.
4) grep conforms to a standard---POSIX---where as ripgrep doesn't. That means you can (in theory) have multiple distinct implementations that all behave the same. (Although, in practice, this is somewhat rare because some implementations add a lot of extra features and it's not always obvious when you use something that is beyond what POSIX itself strictly supports.)
I think that probably covers it, although this is all off the cuff. I might be forgetting something. I suppose the main other things are some flag incompatibilities. For example, grep has '-h' as short for '--no-filename'. Also, since ripgrep does recursive search by default, there are no -r/-R flags. Instead, -r does replacements and -R is unused. -L is used for following symlinks (like 'find').
The specific reason is hard to articulate precisely, but it basically boils down to "difficult to implement." The UTS#18 spec is a tortured document. I think it's better that it exists than not, but if you look at its history, it's undergone quite a bit of evolution. For example, there used to be a "level 3" of UTS#18, but it was retracted: https://unicode.org/reports/tr18/#Tailored_Support
And to be clear, in order to implement the Turkish dotless 'i' stuff correctly, your implementation needs to have that "level 3" support for custom tailoring based on locale. So you could actually elevate your question to the Unicode consortium itself.
I'm not plugged into the Unicode consortium and its decision making process, but based on what I've read and my experience implementing regex engines, the answer to your question is reasonably simple: it is difficult to implement.
ripgrep doesn't even have "level 2" support in its regex engine, nevermind a retracted "level 3" support for custom tailoring. And indeed, most regex engines don't bother with level 2 either. Hell, many don't bother with level 1. The specific reasoning boils down to difficulty in the implementation.
OK OK, so what is this "difficulty"? The issue comes from how regex engines are implemented. And even that is hard to explain because regex engines are themselves split into two major ideas: unbounded backtracking regex engines that typically support oodles of features (think Perl and PCRE) and regex engines based on finite automata. (Hybrids exist too!) I personally don't know so much about the former, but know a lot about the latter. So that's what I'll speak to.
Before the era of Unicode, most things just assumed ASCII and everything was byte oriented and things were glorious. If you wanted to implement a DFA, its alphabet was just consisted of the obvious: 255 bytes. That means your transition table had states as rows and each possible byte value as columns. Depending on how big your state pointers are, even this is quite massive! (Assuming state pointers are the size of an actual pointer, then on x86_64 targets, just 10 states would use 10x255x8=~20KB of memory. Yikes.)
But once Unicode came along, your regex engine really wants to know about codepoints. For example, what does '[^a]' match? Does it match any byte except for 'a'? Well, that would be just horrendous on UTF-8 encoded text, because it might give you a match in the middle of a codepoint. No, '[^a]' wants to match "every codepoint except for 'a'."
So then you think: well, now your alphabet is just the set of all Unicode codepoints. Well, that's huge. What happens to your transition table size? It's intractable, so then you switch to a sparse representation, e.g., using a hashmap to map the current state and the current codepoint to the next state. Well... Owch. A hashmap lookup for every transition when previously it was just some simple arithmetic and a pointer dereference? You're looking at a huge slowdown. Too huge to be practical. So what do you do? Well, you build UTF-8 into your automaton itself. It makes the automaton bigger, but you retain your small alphabet size. Here, I'll show you. The first example is byte oriented while the second is Unicode aware:
This doesn't look like a huge increase in complexity, but that's only because '[^a]' is simple. Try using something like '\w' and you need hundreds of states.
But that's just codepoints. UTS#18 level 2 support requires "full" case folding, which includes the possibility of some codepoints mapping to multiple codepoints when doing caseless matching. For example, 'ß' should match 'SS', but the latter is two codepoints, not one. So that is considered part of "full" case folding. "simple" case folding, which is all that is required by UTS#18 level 1, limits itself to caseless matching for codepoints that are 1-to-1. That is, codepoints whose case folding maps to exactly one other codepoint. UTS#18 even talks about this[1], and that specifically, it is difficult for regex engines to support. Hell, it looks like even "full" case folding has been retracted from "level 2" support.[2]
The reason why "full" case folding is difficult is because regex engine designs are oriented around "codepoint" as the logical units on which to match. If "full" case folding were permitted, that would mean, for example, that '(?i)[^a]' would actually be able to match more than one codepoint. This turns out to be exceptionally difficult to implement, at least in finite automata based regex engines.
Now, I don't believe the Turkish dotless-i problem involves multiple codepoints, but it does require custom tailoring. And that means the regex engine would need to be parameterized over a locale. AFAIK, the only regex engines that even attempt this are POSIX and maybe ICU's regex engine. Otherwise, any custom tailoring that's needed is left up to the application.
The bottom line is that custom tailoring and "full" case matching don't tend to matter enough to be worth implementing correctly in most regex engines. Usually the application can work around it if they care enough. For example, the application could replace dotless-i/dotted-I with dotted-i/dotless-I before running a regex query.
The same thing applies for normalization.[3] Regex engines never (I'm not aware of any that do) take Unicode normal forms into account. Instead, the application needs to handle that sort of stuff. So nevermind Turkish special cases, you might not find a 'é' when you search for an 'é':
$ echo 'é' | rg 'é'
$ echo 'é' | grep 'é'
$
Unicode is hard. Tooling is littered with footguns. Sometimes you just have to work to find them. The Turkish dotless-i just happens to be a fan favorite example.
I use Frawk (https://github.com/ezrosent/frawk) a decent amount too! I downloaded it to do some parallel CSV processing and i've just kind of kept it ever since.
I had someone ask me (a self described grep monkey) how I navigate grepping very long lines (minified js for example) to which I replied ‘lol I just ignore them’. I’d love ‘only select 200 chars if longer than 200 chats, but to my knowledge there’s no easy way to do this with grep. I’d love to hear suggestions on how people navigate this
It works well outside of git repos automatically. And can search across multiple git repos while respecting each repo's respective gitignores automatically. ripgrep also tends to be faster, although the absolute difference tends to be lower with 'git grep' than a simple 'grep -r', since 'git grep' does at least use parallelism.
There are other reasons to prefer one over the other, but are somewhat more minor.
Here's one benchmark that shows a fairly substantial difference between ripgrep and git-grep and ugrep:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ git rev-parse HEAD
3b5e1590a26713a8c76896f0f1b99f52ec24e72f
$ git remote -v
origin git@github.com:torvalds/linux (fetch)
origin git@github.com:torvalds/linux (push)
$ time rg '\w{42}' | wc -l
1957843
real 0.706
user 7.110
sys 0.462
maxmem 300 MB
faults 0
$ time git grep -E '\w{42}' | wc -l
1957843
real 7.678
user 1:49.03
sys 0.729
maxmem 411 MB
faults 0
$ time ugrep -r --binary-files=without-match --ignore-files '\w{42}' | wc -l
1957841
real 10.570
user 46.980
sys 0.502
maxmem 344 MB
faults 0
$ time ag '\w{42}' | wc -l
1957806
real 3.423
user 8.288
sys 0.695
maxmem 79 MB
faults 0
$ time grep -E -r '\w{42}' ./ | wc -l
grep: ./.git/objects/pack/pack-c708bab866afaadf8b5da7b741e6759169a641b4.pack: binary file matches
grep: ./.git/index: binary file matches
1957843
real 47.441
user 47.137
sys 0.290
maxmem 4 MB
faults 0
The GNU grep comparison is somewhat unfair because it's searching a whole lot more than the other 3 tools. (Although notice that there are no additional matches outside of binary files.) But it's a good baseline and also demonstrates the experience that a lot of folks have: most just tend to compare a "smarter" grep with the "obvious" grep invocation and see that it's an order of magnitude faster.
It's also interesting that all tools agree on match counts except for ugrep ang ag. ag at least doesn't have any kind of Unicode support, so that probably explains that. (Don't have time to track down the discrepancy with ugrep to see who is to blame.)
And if you do want to search literally everything, ripgrep can do that too. Just add '-uuu':
$ time rg -uuu '\w{42}' | wc -l
1957845
real 1.288
user 8.048
sys 0.487
maxmem 277 MB
faults 0
And it still does it better than GNU grep. And yes, this is with Unicode support enabled. If you disable it, you get fewer matches and the search time improves. (GNU grep gets faster too.)
$ time rg -uuu '(?-u)\w{42}' | wc -l
1957810
real 0.235
user 1.662
sys 0.374
maxmem 173 MB
faults 0
$ time LC_ALL=C grep -E -r '\w{42}' ./ | wc -l
grep: ./.git/objects/pack/pack-c708bab866afaadf8b5da7b741e6759169a641b4.pack: binary file matches
grep: ./.git/index: binary file matches
1957808
real 2.636
user 2.362
sys 0.269
maxmem 4 MB
faults 0
Now, to be fair, '\w{42}' is a tricky regex. Searching something like a literal brings all tools down into a range where they are quite comparable:
$ time rg ZQZQZQZQZQ | wc -l
0
real 0.073
user 0.358
sys 0.364
maxmem 11 MB
faults 0
$ time git grep ZQZQZQZQZQ | wc -l
0
real 0.206
user 0.291
sys 1.014
maxmem 134 MB
faults 1
$ time ugrep -r --binary-files=without-match --ignore-files ZQZQZQZQZQ | wc -l
0
real 0.199
user 0.847
sys 0.743
maxmem 7 MB
faults 16
I realize this is beyond the scope of what you asked, but eh, I had fun.
How fast is magic wormhole? In my experience most of the new(er) file transfer apps based on WebRTC are just barely faster than Bluetooth and are unable to saturate the bandwidth. I am not sure if the bottleneck is in the WebRTC stack or whether there is something fundamentally wrong about the protocol itself.
Fd is featured on Julia Evans' recent "New(ish) command line tools"[1]
[1] https://jvns.ca/blog/2022/04/12/a-list-of-new-ish--command-l... https://news.ycombinator.com/item?id=31009313 (760 points, 37d ago, 244 comments)