Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fd and ripgrep/rg are the two "new" alternatives I use on a regular basis, and which are just huge improvements to life. Both of these find/search programs respect your .gitignore files, which helps enormously & makes searching my department's entire codebase really fast.

Fd is featured on Julia Evans' recent "New(ish) command line tools"[1]

[1] https://jvns.ca/blog/2022/04/12/a-list-of-new-ish--command-l... https://news.ycombinator.com/item?id=31009313 (760 points, 37d ago, 244 comments)



It's fd, ncdu and sd (sed alternative) for me.

https://github.com/chmln/sd

https://dev.yorhel.nl/ncdu


A while ago I came across this post: https://towardsdatascience.com/awesome-rust-powered-command-...

I’ve also been using bat and exa which are pretty good replacements for cat and ls, respectively.

https://github.com/sharkdp/bat

https://github.com/ogham/exa


scc is an insanely fast alternative to cloc: https://github.com/boyter/scc

nnn is also my go to file tree navigation / file moving tool these days too: https://github.com/jarun/nnn


For counting coding lines I use tokei and I like it: https://github.com/XAMPPRocky/tokei



No file previews yet? I'd stick with ranger or lf.


It makes you feel less painful reading man pages with bat being the colorized pager.

https://github.com/sharkdp/bat#man


This snipper in my ~/.profile colorizes man pages for like ten years already:

  #   Colorify man (changes default $MANPAGER (less) settings)
  export LESS_TERMCAP_mb='^[[01;31m'
  export LESS_TERMCAP_md='^[[01;31m'
  export LESS_TERMCAP_me='^[[0m'
  export LESS_TERMCAP_se='^[[0m'
  export LESS_TERMCAP_so='^[[01;44;33m'
  export LESS_TERMCAP_ue='^[[0m'
  export LESS_TERMCAP_us='^[[01;32m'


https://github.com/chmin/sd: "sd uses regex syntax that you already know from JavaScript and Python. Forget about dealing with quirks of sed or awk - get productive immediately."

It would be interesting to test the ~1.5GB of JSON the author uses for the benchmark against sed, but there are no details on how many files nor what those files contain.

When trying something relatively small and simple, sd appears to be slower than sed. It also appears to require more memory. Maybe others will have different results.

   sh # using dash not bash
   echo j > 1
   time sed s/j/k/ 1
   time -p sed s/j/k/ 1
   time sd j k 1
   time -p sd j k 1
Opposite problem as the sd author for me. For system tasks, more familiar with faster sed and awk than with slower Python and Javascript, so I wish that Python and Javascript regex looked more like sed and awk, i.e., BRE and occasionally ERE. Someone in the NetBSD core group once wrote a find(1) alternative that had C-like syntax, similar to how awk uses a C-like syntax. Makes sense because C is the systems language for UNIX. Among other things, most of the system utilities are written in it. If the user knows C then she can read the system source and modify/repair the system where necessary, so it is beneficial to become familiar with it. Is anyone is writing system utility alternatives in Rust that use a Rust-like syntax.




ncdu is amazing. I foolishly spent way too much time trying to massage du's output into something human-friendly.


sd is my favorite of the newish command line tools. Its super fast and i like the syntax a lot


Agree, I've started replacing my `perl -pe s/.../.../g`s with `sd`. It seems it's actually slightly faster than the equivalent Perl for the same substitutions (which it should be since it does less).


It is somewhat notable that rg and fd differ significantly in that rg is almost perfect superset of grep in terms of features (some might be behind different flags etc), but fd explicitly has narrower featureset than find.


Yeah, this was very intentional. Because this is HN, I'll say some things that greps usually support that ripgrep doesn't:

1) greps support POSIX-compatible regexes, which come in two flavors: BREs and EREs. BREs permit back-references and have different escaping rules that tend to be convenient in some cases. For example, in BREs, '+' is just a literal plus-sign but '\+' is a regex meta character that means "match one or more times." In EREs, the meanings are flipped. POSIX compatible regexes also use "leftmost longest" where as ripgrep uses "leftmost first." For example, 'sam|samwise' will match 'sam' in 'samwise' in "leftmost first," but will match 'samwise' in "leftmost longest."

2) greps have POSIX locale support. ripgrep intentionally just has broad Unicode support and ignores POSIX locales completely.

3) ripgrep doesn't have "equivalence classes." For example, `echo 'pokémon' | grep 'pok[[=e=]]mon'` matches.

4) grep conforms to a standard---POSIX---where as ripgrep doesn't. That means you can (in theory) have multiple distinct implementations that all behave the same. (Although, in practice, this is somewhat rare because some implementations add a lot of extra features and it's not always obvious when you use something that is beyond what POSIX itself strictly supports.)

I think that probably covers it, although this is all off the cuff. I might be forgetting something. I suppose the main other things are some flag incompatibilities. For example, grep has '-h' as short for '--no-filename'. Also, since ripgrep does recursive search by default, there are no -r/-R flags. Instead, -r does replacements and -R is unused. -L is used for following symlinks (like 'find').


> 2) greps have POSIX locale support. ripgrep intentionally just has broad Unicode support and ignores POSIX locales completely.

Does this mean that there's no support for language specific case mappings (e.g. iİ and ıI in Turkic)?


Correct. ripgrep only has Level 1 UTS#18 support: https://unicode.org/reports/tr18/#Simple_Loose_Matches

This document outlines Unicode support more precisely for ripgrep's underlying regex engine: https://github.com/rust-lang/regex/blob/master/UNICODE.md


Thx! Is there a specific reason for the lack of that feature or was this just not implemented yet?


I've added this to the ripgrep Q&A discussion board: https://github.com/BurntSushi/ripgrep/discussions/2221 --- Thanks for the good question!

The specific reason is hard to articulate precisely, but it basically boils down to "difficult to implement." The UTS#18 spec is a tortured document. I think it's better that it exists than not, but if you look at its history, it's undergone quite a bit of evolution. For example, there used to be a "level 3" of UTS#18, but it was retracted: https://unicode.org/reports/tr18/#Tailored_Support

And to be clear, in order to implement the Turkish dotless 'i' stuff correctly, your implementation needs to have that "level 3" support for custom tailoring based on locale. So you could actually elevate your question to the Unicode consortium itself.

I'm not plugged into the Unicode consortium and its decision making process, but based on what I've read and my experience implementing regex engines, the answer to your question is reasonably simple: it is difficult to implement.

ripgrep doesn't even have "level 2" support in its regex engine, nevermind a retracted "level 3" support for custom tailoring. And indeed, most regex engines don't bother with level 2 either. Hell, many don't bother with level 1. The specific reasoning boils down to difficulty in the implementation.

OK OK, so what is this "difficulty"? The issue comes from how regex engines are implemented. And even that is hard to explain because regex engines are themselves split into two major ideas: unbounded backtracking regex engines that typically support oodles of features (think Perl and PCRE) and regex engines based on finite automata. (Hybrids exist too!) I personally don't know so much about the former, but know a lot about the latter. So that's what I'll speak to.

Before the era of Unicode, most things just assumed ASCII and everything was byte oriented and things were glorious. If you wanted to implement a DFA, its alphabet was just consisted of the obvious: 255 bytes. That means your transition table had states as rows and each possible byte value as columns. Depending on how big your state pointers are, even this is quite massive! (Assuming state pointers are the size of an actual pointer, then on x86_64 targets, just 10 states would use 10x255x8=~20KB of memory. Yikes.)

But once Unicode came along, your regex engine really wants to know about codepoints. For example, what does '[^a]' match? Does it match any byte except for 'a'? Well, that would be just horrendous on UTF-8 encoded text, because it might give you a match in the middle of a codepoint. No, '[^a]' wants to match "every codepoint except for 'a'."

So then you think: well, now your alphabet is just the set of all Unicode codepoints. Well, that's huge. What happens to your transition table size? It's intractable, so then you switch to a sparse representation, e.g., using a hashmap to map the current state and the current codepoint to the next state. Well... Owch. A hashmap lookup for every transition when previously it was just some simple arithmetic and a pointer dereference? You're looking at a huge slowdown. Too huge to be practical. So what do you do? Well, you build UTF-8 into your automaton itself. It makes the automaton bigger, but you retain your small alphabet size. Here, I'll show you. The first example is byte oriented while the second is Unicode aware:

    $ regex-cli debug nfa thompson -b '(?-u)[^a]'
    >000000: binary-union(2, 1)
     000001: \x00-\xFF => 0
    ^000002: capture(0) => 3
     000003: sparse(\x00-` => 4, b-\xFF => 4)
     000004: capture(1) => 5
     000005: MATCH(0)
    
    $ regex-cli debug nfa thompson -b '[^a]'
    >000000: binary-union(2, 1)
     000001: \x00-\xFF => 0
    ^000002: capture(0) => 10
     000003: \x80-\xBF => 11
     000004: \xA0-\xBF => 3
     000005: \x80-\xBF => 3
     000006: \x80-\x9F => 3
     000007: \x90-\xBF => 5
     000008: \x80-\xBF => 5
     000009: \x80-\x8F => 5
     000010: sparse(\x00-` => 11, b-\x7F => 11, \xC2-\xDF => 3, \xE0 => 4, \xE1-\xEC => 5, \xED => 6, \xEE-\xEF => 5, \xF0 => 7, \xF1-\xF3 => 8, \xF4 => 9)
     000011: capture(1) => 12
     000012: MATCH(0)
This doesn't look like a huge increase in complexity, but that's only because '[^a]' is simple. Try using something like '\w' and you need hundreds of states.

But that's just codepoints. UTS#18 level 2 support requires "full" case folding, which includes the possibility of some codepoints mapping to multiple codepoints when doing caseless matching. For example, 'ß' should match 'SS', but the latter is two codepoints, not one. So that is considered part of "full" case folding. "simple" case folding, which is all that is required by UTS#18 level 1, limits itself to caseless matching for codepoints that are 1-to-1. That is, codepoints whose case folding maps to exactly one other codepoint. UTS#18 even talks about this[1], and that specifically, it is difficult for regex engines to support. Hell, it looks like even "full" case folding has been retracted from "level 2" support.[2]

The reason why "full" case folding is difficult is because regex engine designs are oriented around "codepoint" as the logical units on which to match. If "full" case folding were permitted, that would mean, for example, that '(?i)[^a]' would actually be able to match more than one codepoint. This turns out to be exceptionally difficult to implement, at least in finite automata based regex engines.

Now, I don't believe the Turkish dotless-i problem involves multiple codepoints, but it does require custom tailoring. And that means the regex engine would need to be parameterized over a locale. AFAIK, the only regex engines that even attempt this are POSIX and maybe ICU's regex engine. Otherwise, any custom tailoring that's needed is left up to the application.

The bottom line is that custom tailoring and "full" case matching don't tend to matter enough to be worth implementing correctly in most regex engines. Usually the application can work around it if they care enough. For example, the application could replace dotless-i/dotted-I with dotted-i/dotless-I before running a regex query.

The same thing applies for normalization.[3] Regex engines never (I'm not aware of any that do) take Unicode normal forms into account. Instead, the application needs to handle that sort of stuff. So nevermind Turkish special cases, you might not find a 'é' when you search for an 'é':

    $ echo 'é' | rg 'é'
    $ echo 'é' | grep 'é'
    $
Unicode is hard. Tooling is littered with footguns. Sometimes you just have to work to find them. The Turkish dotless-i just happens to be a fan favorite example.

[1]: https://unicode.org/reports/tr18/#Simple_Loose_Matches

[2]: https://www.unicode.org/reports/tr18/tr18-19.html#Default_Lo...

[3]: https://unicode.org/reports/tr18/#Canonical_Equivalents


Is there a benefit to respecting locale and not just using Unicode?


Probably only if you are on an old legacy system that is using an unusual encoding.


I use Frawk (https://github.com/ezrosent/frawk) a decent amount too! I downloaded it to do some parallel CSV processing and i've just kind of kept it ever since.


I had someone ask me (a self described grep monkey) how I navigate grepping very long lines (minified js for example) to which I replied ‘lol I just ignore them’. I’d love ‘only select 200 chars if longer than 200 chats, but to my knowledge there’s no easy way to do this with grep. I’d love to hear suggestions on how people navigate this


My go-to is using -o and pre/appending .{100} to the pattern to capture however much context I need


Pipe to cut -c 1-200?


ripgrep has the -M option that will help here.


I tend to use `git grep` for that. Is ripgrep better in some way?


It works well outside of git repos automatically. And can search across multiple git repos while respecting each repo's respective gitignores automatically. ripgrep also tends to be faster, although the absolute difference tends to be lower with 'git grep' than a simple 'grep -r', since 'git grep' does at least use parallelism.

There are other reasons to prefer one over the other, but are somewhat more minor.

Here's one benchmark that shows a fairly substantial difference between ripgrep and git-grep and ugrep:

    $ locale
    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=
    $ git rev-parse HEAD
    3b5e1590a26713a8c76896f0f1b99f52ec24e72f
    $ git remote -v
    origin  git@github.com:torvalds/linux (fetch)
    origin  git@github.com:torvalds/linux (push)

    $ time rg '\w{42}' | wc -l
    1957843

    real    0.706
    user    7.110
    sys     0.462
    maxmem  300 MB
    faults  0

    $ time git grep -E '\w{42}' | wc -l
    1957843

    real    7.678
    user    1:49.03
    sys     0.729
    maxmem  411 MB
    faults  0

    $ time ugrep -r --binary-files=without-match --ignore-files '\w{42}' | wc -l
    1957841

    real    10.570
    user    46.980
    sys     0.502
    maxmem  344 MB
    faults  0

    $ time ag '\w{42}' | wc -l
    1957806

    real    3.423
    user    8.288
    sys     0.695
    maxmem  79 MB
    faults  0

    $ time grep -E -r '\w{42}' ./ | wc -l
    grep: ./.git/objects/pack/pack-c708bab866afaadf8b5da7b741e6759169a641b4.pack: binary file matches
    grep: ./.git/index: binary file matches
    1957843

    real    47.441
    user    47.137
    sys     0.290
    maxmem  4 MB
    faults  0
The GNU grep comparison is somewhat unfair because it's searching a whole lot more than the other 3 tools. (Although notice that there are no additional matches outside of binary files.) But it's a good baseline and also demonstrates the experience that a lot of folks have: most just tend to compare a "smarter" grep with the "obvious" grep invocation and see that it's an order of magnitude faster.

It's also interesting that all tools agree on match counts except for ugrep ang ag. ag at least doesn't have any kind of Unicode support, so that probably explains that. (Don't have time to track down the discrepancy with ugrep to see who is to blame.)

And if you do want to search literally everything, ripgrep can do that too. Just add '-uuu':

    $ time rg -uuu '\w{42}' | wc -l
    1957845

    real    1.288
    user    8.048
    sys     0.487
    maxmem  277 MB
    faults  0
And it still does it better than GNU grep. And yes, this is with Unicode support enabled. If you disable it, you get fewer matches and the search time improves. (GNU grep gets faster too.)

    $ time rg -uuu '(?-u)\w{42}' | wc -l
    1957810

    real    0.235
    user    1.662
    sys     0.374
    maxmem  173 MB
    faults  0

    $ time LC_ALL=C grep -E -r '\w{42}' ./ | wc -l
    grep: ./.git/objects/pack/pack-c708bab866afaadf8b5da7b741e6759169a641b4.pack: binary file matches
    grep: ./.git/index: binary file matches
    1957808

    real    2.636
    user    2.362
    sys     0.269
    maxmem  4 MB
    faults  0
Now, to be fair, '\w{42}' is a tricky regex. Searching something like a literal brings all tools down into a range where they are quite comparable:

    $ time rg ZQZQZQZQZQ | wc -l
    0

    real    0.073
    user    0.358
    sys     0.364
    maxmem  11 MB
    faults  0
    $ time git grep ZQZQZQZQZQ | wc -l
    0

    real    0.206
    user    0.291
    sys     1.014
    maxmem  134 MB
    faults  1
    $ time ugrep -r --binary-files=without-match --ignore-files ZQZQZQZQZQ | wc -l
    0

    real    0.199
    user    0.847
    sys     0.743
    maxmem  7 MB
    faults  16
I realize this is beyond the scope of what you asked, but eh, I had fun.


What version of time are you using? I don't recognize the output



How fast is magic wormhole? In my experience most of the new(er) file transfer apps based on WebRTC are just barely faster than Bluetooth and are unable to saturate the bandwidth. I am not sure if the bottleneck is in the WebRTC stack or whether there is something fundamentally wrong about the protocol itself.


All magic wormhole is doing is agreeing a key, and then moving the encrypted data over TCP between sender and recipient.

So for a non-trivial file this is in principle subject to the same performance considerations as any other file transfer over TCP.

For a very tiny file, you'll be dominated by the overhead of the setup.


Why use ripgrep over silver searcher?


This could have changed in the last few years, but I think rg does tend to (sometimes significantly) outperform ag, see the author's benchmarks [0].

0: https://blog.burntsushi.net/ripgrep/#code-search-benchmarks


>much better single file performance, better large-repo performance and real Unicode support that doesn't slow way down

By ripgrep's dev (https://news.ycombinator.com/item?id=12567484).


The Silver Searcher appears to be if not dead then certainly resting.


fzf too


shout out to 'ack' as well


If you're still using ripgrep, check out ugrep.

Very fast, TUI, fuzzing matching, and actively maintained.


ripgrep is not maintained anymore? that was fast...


I'm the maintainer of ripgrep and it is actively maintained.


Well that was a quick rollercoaster of emotions. Thanks for all that you do.


ripgrep isn't maintained now? That was fast :)

Or is it just done :)


`rg` is maintained. Last commit was 9 days ago by the creator himself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: