Nifty! Step up over just using time.
Unfortunately, timing things in general isn't going to be a very effective benchmark.
Without understanding what a program is doing, you don't understand what is impacting your results, and have no real knowledge on how things are going to differ when you go to use them in the "real world". Is one process faster when single threaded vs. a low core count, but another is massively parallel, and loses out until scaled higher? Are your commands testing the thing you think they're testing? What is your limiting factor? If you don't know why the results are what they are, instead of higher, you don't have a good benchmark.
Yes, having a "cold" or a "warm" disk cache makes a massive difference for I/O-heavy programs. For one of my other programs, I differentiate between "cold-cache" and "warm-cache" benchmarks: https://github.com/sharkdp/fd-benchmarks
The page cache exists to speed up access and it's transparent to processes.
If another user/process tries to access your SSH files directly, it'll go through the traditional file permissions to determine if it has access or not. If the disk block is in the page cache AND access is allowed to that inode, then the kernel will retrieve the page from the cache and give it to the process.
To read the whole page cache, you'd need code sitting in kernel space. If something manages to load itself in the kernel space (e.g. kernel module), you have bigger problems to worry about.
When given multiple commands, can it interleave executions instead of benchmarking them one after the other?
This would be useful when comparing two similar commands, as interleaving them makes it less likely that e.g. a load spike will unfavorably affect only one of them, or due to e.g. thermal throttling negatively affecting the last command.
Whenever you run 'time <command>' you could consider running 'hyperfine <command>' to get an answer that has been averaged over multiple runs.
I personally use command-line benchmarking to compare different tools. You might want to compare grep, ack, ag and ripgrep. I currently use it to profile my find-alternative fd and to compare it with find itself (https://github.com/sharkdp/fd-benchmarks).
You could also use it to find an optimal parameter setting for a command-line tool (make -j2 vs. make -j8).
The regex-equivalent for "anything" is usually "<dot><asterisk>" where the dot is for "any character" and the asterisk is for "any number of times (including zero)". fd does necessarily pattern-match at the beginning of the file name, so there is no need for an asterisk at the beginning and at the end of the pattern. Your example would be:
Without understanding what a program is doing, you don't understand what is impacting your results, and have no real knowledge on how things are going to differ when you go to use them in the "real world". Is one process faster when single threaded vs. a low core count, but another is massively parallel, and loses out until scaled higher? Are your commands testing the thing you think they're testing? What is your limiting factor? If you don't know why the results are what they are, instead of higher, you don't have a good benchmark.
http://www.brendangregg.com/activebenchmarking.html / http://www.brendangregg.com/ActiveBenchmarking/bonnie++.html