I would also add at least "sane default options", "continues downloads" and "retries on error" to the Wget column. I recently had to write a script that downloads a very large file over a somewhat unreliable connection. The common wisdom among the engineers is that you need to use Wget for this job. I tried using curl but out of the box it could not resume or retry the download. I would have to study the manual and specify multiple options with arguments for this behaviour that really sounds something that should just work out of the box.
Wget needed one option to enable resuming in all conditions, even after a crash: --continue
Wget's introdcution in the manual page also states: "Wget has been designed for robustness over slow or unstable network connections; if a download fails due to a network problem, it will keep retrying until the whole file has been retrieved."
I was sold. Even if I by some miracle managed to get all the options for curl to enable reliable performance over poor connection right, Wget seems to have those on by default and the sane defaults make me believe it will also have this expected correct behaviour enabled even for those error scenarios that I did not think to test myself. Or - if the HTTP protocol ever receives updates, newer versions of Wget will also support those by default, but curl will require new switches to enable the enhanced behaviour - something which I can not add after the product has been shipped.
To me it often seems like curl is a good and extremely versatile low-level tool, and the CLI reflects this. But for my everyday work, I prefer to use Wget as it seems to work much better out of the box. And the manual page is much faster to navigate - probalby in part due to just not supporting all the obscure protocols called out on this page.
Well, the point of the article is that they are not cpmpetitors and are used differently. For me, 99% of the time i'm curl-ing some API and I definitely don't want to save the result to disk (but often want to pipe it to grep/jq).
> I tried using curl but out of the box it could not resume or retry the download.
Maybe I'm misunderstanding, but curl has exactly that feature, it's the `-C` flag. If you want retries, there's `--retry`. I find curls defaults pretty sane, personally, I wouldn't want either of those by default for a tool like curl.
You recall incorrectly. curl's -C flag does not work as-is. You must specify the offset from where it should continue. Why doesn't it take the resumed file's existing length as the guess by default? What else could the user want outside of some very exotic cases?
Yes, I want retries. They should be the default for a user-facing tool. Try searching the curl's manual page for "retry". There are no less than 5 different interdependent flags for specifying retry behaviour: --retry-all-errors, --retry-connrefused, --retry-delay <seconds>, --retry-max-time <seconds> and --retry <num>
If I "just" want it to retry, surely there's a boolean flag like "--retry" that enables sane defaults? Nope! --retry takes a mandatory integer argument of maximum number of retries. Surely I can set it to zero for a sane default? Nope again: "Setting the number to 0 makes curl do no retries."
curl is a good tool if you know it through & through and want exact control over the transfer behaviour. I don't think it's a good tool if you just want to fetch a file and except your tool of choice to apply some sane behaviours for you to that end, that would probably make sense if you are a human rather than an application using a library.
> You recall incorrectly. curl's -C flag does not work as-is. You must specify the offset from where it should continue. Why doesn't it take the resumed file's existing length as the guess by default? What else could the user want outside of some very exotic cases?
> Use "-C -" to tell curl to automatically find out where/how to resume the transfer. It then uses the given output/input files to figure that out.
I just tried it, works perfectly. I don't really see a difference between writing `--continue` in wget and `-C -` in curl. And the use case for specifying it is not so exotic, you might want to do range requests for all sorts of reasons.
Look, it's fine if you prefer wget's command line: I don't think retrying a request is a reasonable default for a tool like curl, but reasonable people can disagree on that for sure. But curl is perfectly capable of resuming downloads automatically, you're just (very arrogantly) wrong on that one.
No, `-C -` is not a flag. It is specifying the `-C` argument with obscure special value of `-`, which causes curl to determine the offset to continue from the output file length. This might be obvious to you if you are well-versed in curl command line, but it's by no means expected or obvious like a simple flag.
> But curl is perfectly capable of resuming downloads automatically, you're just (very arrogantly) wrong on that one.
I've never claimed it doesn't. I've only demonstrated that the default options don't do it and enabling the behaviour is more difficult than it maybe should be for a simple tool. I fail to see the arrogance.
> You must specify the offset from where it should continue
No, you mustn't, you can specify - and it does exactly what you want. The docs are very clear and even provide examples. At some point you should stop blaming curl for your inability to read a man page and admit that you were simply mistaken.
You still fail to understand that curl's -C does not behave as a simple flag but as an switch with a mandatory argument. And there's a magic special value for that argument that finally enables the expected behavior. It's unintuitive, hard to remember and not in agreement with usability. While I agree that curl is powerful I will not concede that it's CLI is user friendly.
It's not hard to remember if you're familiar with unixs tools and syntax. But no one is demanding you concede anything. The point of the conversation is explain the difference in expectations between how you expect a command line tool to act and how most people expect a command line tool to act. If I try to `cp src dest` and it fails. I don't want the tool to guess how to fix the issue, that's not it's job. Ditto for `dd` it shouldn't try to guess offsets. curl exists as a knife, you're expecting it to do the job have a food processor and blender combination. You're not wrong to expect a tool to behave that way you're wrong to expect curl to behave that way. And no one is asking you to concede that a knife is easier than a blender when you want a blender. everyone is pointing out it's not a blender it's a knife but you can still do everything you want to do.
> It is specifying the `-C` argument with obscure special value of `-`
It is not an “obscure special value”. Not only is `-C -` (or `--continue-at -` for the long form) well documented in the correct place in the manual, `-` is a common value in command-line tools (e.g. when specifying that a tool’s input will be STDIN instead of a file).
What is a sane default for retries? is it to loop indefinitely? should it retry against the same TCP connection or establish a new one? To the same IP it picked the first time or a different one, or reresolve the DNS entirely? against the same resolver?
IMO theres too much complexity for 'sane defaults' to not just be 'surprising behavior' for someone else's use case.
Not only are you wrong, I had the opposite experience - whether was all well and good for downloading pages, but it fell over for making e.g. authenticated POST requests. I think that might be possible now/have gotten better, but wget has always been too restrained for me.
I am in agreement that wget has "sane" defaults i.e. it acts like a bot or web crawler, or basic browser. Curl has always been easier to get things done with, though. At least in the land of http requests.
Strong agree. The only misbehaviour I believe curl displays out of the box is globbing, which has burned me enough times that I’ve come to believe it would’ve been better disabled by default and enabled with -g instead of vice versa.
Also add -i which lets wget read URLs from a file. In particular wget -i - which makes it read from standard input, and is very useful in pipelines.
curl cannot, AFAIK, do this. People usually suggest using xargs, which is a mediocre substitute because it waits for all the URLs to arrive before invoking curl, giving up any chance at parallelism between the command generating the URLs and the one downloading them.
xargs doesn't have to wait, you can specify the number of items to include in a single sub-command and it'll batch things as they come in. For instance:
ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs -L5 echo
1
2
3
4
5
1 2 3 4 5
6
7
8
9
10
6 7 8 9 10
11
12
[... and so on ...]
If the xargs call uses -I then --max-lines=1 is implied anyway.
If you replace echo with something that sleeps you'll see that the pipe doesn't stall waiting on xargs so the process producing the list can keep pushing new items to it as they are found:
ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs --max-lines=5 ./echosleepecho
1
2
3
4
5
starting 1 2 3 4 5
6
7
8
9
10
11
12
13
14
done 1 2 3 4 5
starting 6 7 8 9 10
15
16
17
18
19
[... and so on until ...]
98
99
100
done 46 47 48 49 50
sleeping for 51 52 53 54 55
done 51 52 53 54 55
sleeping for 56 57 58 59 60
[... and so on until xarg's stdin is exhausted]
And you can stop the calls made by xargs being sequential too for more parallelism with the --max-procs option (or use parallel instead of xargs):
ds@swann3:~# (for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs --max-lines=3 --max-procs=10 ./echosleepecho
1
2
3
sleeping for 1 2 3
4
5
6
sleeping for 4 5 6
7
8
9
sleeping for 7 8 9
10
11
12
sleeping for 10 11 12
done 1 2 3
13
14
15
sleeping for 13 14 15
done 4 5 6
16
[... and so on ...]
(I adjusted max-lines in that last example because my current timings made things line up in a manner that made the effect less obvious, adjusting the timings would have been equally valid, in a less artificial example like calling curl to get many resources timings will of course be less regular, perhaps these examples can be improved by randomising the sleeps)
I'm not sure what you would do about error handling in all this though, more experimentation necessary there before I'd ever do this in production!
Reply to self to add a note of something that coincidentally came up elsewhere¹ and is relevant to the above: of course xargs being able to push existing things forward while the list of actions is still being produced relies on it getting a steam of the list instead of the whole thing in one block. If your earlier stages cause a pipeline stall it can't help you.
For an artificial example, change
(for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | xargs -L5 echo
to
(for x in {1..100}; do sleep 0.1s; echo $x >&2; echo $x; done) | sort | xargs -L5 echo
The sort command will, by necessity, absorb the list as it is produced and spit it all out at once at the end. xargs can still use multiple processes (if max-procs is used) to make use of concurrency to speed the work, but can't get started until the full list is produced and sorted.
----
[1] An unnecessary sort in an ETL process causing the overall wall-clock time to increase significantly
Valid criticism, ish, but that wasn't in what was previously asked for so well done on being like my day-job clients and failing to specify the problem completely :)
You can specify multiple URLs on the same command in curl so using xargs in this way would do what you ask to an extent (the connection would be dropped and renegotiated at least between each batch) as long as you don't use any options that imply --max-lines=1.
With the --max-procs option you could be requesting multiple files at once which may improve performance over wget -i – though obviously take care doing this against a single site (with wget -i too for that matter) as this can be rather unfriendly (if requesting from multiple hosts this is moot, as is the multiple files-from-one-connection point).
Both tools have their use-cases. I think with the advent of LLMs like ChatGPT it has also become a lot easier to get the proper command line incantation for whatever tool you're using. Even if you've read the manual previously, it's easy to forget the exact flags you want to use, and validating the generated command line is usually less effort than having to build it up yourself from scratch by reading the manual.
With the modern web, sometimes it's easier to use a tool like Puppeteer from a custom script. Especially if the sites you're interacting with are using a lot of JS.
You don't need much of anything, it's just another tool to help people figure stuff out. If you've already invested the time in reading through the manuals for these tools then maybe you don't get much out of something like ChatGPT, but consider that there's thousands of new people entering the industry every year and the number of tools which they're expected to learn to use has increased over time.
It's fine when you've read the man page at least once. If you just cowboy your way through every task, you'll never even be aware of what your tools can do. Which results in comments we've all seen: "I can't believe curl/Firefox/vim/readline/whatever can do something like that!" about something so trivial that every good poweruser/sysadmin has known it for decades (because it's in the man page).
Sure, but that's knowledge you pick up over years. You can't reasonably expect people to frontload all of it on the first pass with every single tool they interact with, especially when starting off.
I've read through most of the man pages of every tool I use at least once, but it has taken me years, and I've done it incrementally.
I just memorise the man page. I can't believe etc etc etc
(Seriously, there's a million nix flags, and only so many brain cells. ChatGPT's better than Google for simple "what's the magic incantation?" searches, and laziness is a virtue. If you don't want to be lazy that's fine, but I think you're missing out).
I agree with this sentiment. But I also have personal antidotes that make me cautious about it. When I was a child I memorized dozens of phone numbers of people I called a couple times a month. Now my phone memorizes any phone number for me. We used to memorize poems and such. I can't give you any scientific reason why it is good to have memorized the Gettysburg Address. But I feel intuitively that there is some benefit to exercise your brain similar to the intuition I have about exercising muscles.
If a person thinks they benefit from 100 situps a day, I'm not going to disagree. And if they think there is some benefit in reading man pages, well, thanks to those who take the time to write all that ducumentation.
I like tldr for quick examples. If it doesn't have what I need, then I fall back to phind or some other llm. Nice thing about tldr is there's an offline version.
Another pet peeve of mine is that curl's URL parser is a lot more strict compared to wget.
For example:
$ curl -sSLOJ 'example.com/file name.txt'
curl: (3) URL using bad/illegal format or missing URL
$ curl -sSLOJ 'example.com/file%20name.txt'
$ ls
file%20name.txt
On the other hand, wget (without any additional flags) will produce a file called "file name.txt" for both URLs. Well, technically you also need to add a --content-on-error flag to wget because this example URL 404s.
Retry with `wget` was one of the most incredible Linux distro included features when I started running it at home. Pretty crucial thing on 56K dialup, and it worked better than the Windows tools I was aware of at the time.
I think our setup was very particular to the UK. We didn't (and still don't?) have free local calls like the US, so we paid per minute for ISPs.
Almost all ISPs went through a scheme setup by British Telecom - you could either have free internet but you paid for your calls, or you could pay for your internet, and have access via a freefone number - so effectively flat-rate.
But the flat-rate option disconnected after two hours, on the dot. Which was hugely frustrating because we had a voicemail variant that was hosted by the telco, and let you know you had messages waiting by pulsing the dialtone. And my modem did not recognise the pulsed dialtone as a valid dialtone, and refused to connect until we called the number and marked them read.
Which lead to one of my most UK-centric retro stories. I tried to connect to the internet, and it refused to dial. I blew away my wvdial config, and it refused to dial. I blew away my ppp config, and it refused to dial. I grepped / for the error message and it didn't exist. I ended up blowing away my OS (and accidentally installing onto the wrong drive, and blowing away everything non-OS too), and it still wouldn't dial.
So I dragged the modem & extension cord to my mother's PC, and shot off a mail to my preferred mailing list (one hosted by John @ linuxemporium, my preferred source of mail-order distros), and swiftly received the response that in order to be certified by BT to operate on their network, one of the rules equipment had to obey was to refuse to redial the same number x many times. And that all I needed to do was power-cycle the modem. Which I'd done by dragging it upstairs to my mother's PC. And my own machine had been wiped twice over needlessly.
Aside, there was a lady named Helen on that mailing list who knew everything about everything, and is everything I aspire to be today. She had opinions on which harddrives best survived salt/sea air, and why they weren't deathstars. Just an incredible amount of lived experience. I miss mailing lists.
Interesting information w.r.t. UK telephone practices! In the USA, it was usual to get free local calls, so ISPs would set up modem banks to try and get maximum coverage for a given NPA-NXX range. There was some arrangement with CLECs and ILECs where it was extremely profitable for IIRC CLECs to pass data-only calls through to ILECs, so one or the other was subsidizing a lot of the early dialin ISPs, to the point of buying them modem banks and whatnot!
That was probably one of the biggest death-bringers for the BBS era, no more long distance calls to get to what you wanted.
> Which was hugely frustrating because we had a voicemail variant that was hosted by the telco, and let you know you had messages waiting by pulsing the dialtone. And my modem did not recognise the pulsed dialtone as a valid dialtone, and refused to connect until we called the number and marked them read.
Yeah, some VM providers in the USA did that too, and it similarly confused modems. It's called "stutter dialtone" here, and the usual fix was to put some delay elements in the dial string, which were commas for Hayes command set modems.
> and why they weren't deathstars
They sure did earn that name! I was so hesitant to switch to HGST for ZFS pools due to my 90s/2000s deathstar experiences. Wouldn't run them in production for a while, of course now that I'm over it and trust them as well as any other enterprise brand, they'll screw it up again!
I only used aria2c once a long time ago, but it was awesome for huge files. As I recollect one thing it does is download different sections of a file in parallel over multiple connections, which speeds up downloads from servers that throttle per connection.
One point where I would argue wget doesn't have a sensible default is on filenames - it really should make --content-disposition the default, at least for single file downloads. Otherwise it will often use the wrong name if there is a redirect or similar in the chain, which seems increasingly common.
I'll argue not trusting the server to dictate the saved file name is the only correct default behavior, and taking the filename from the user input (URL) is reasonable.
A server configured with a docroot to serve a static site will map requested URLs to hierarchical filesystem paths, but that isn't the only possibility; it's a common but quite loose coupling of ideas.
But the filename directive of the content-disposition response header is entirely coupled to the idea of a filename. Therefore, it ought to take precedence.
Fair enough, but do you have that fear when using a browser? I guess a Downloads folder is lower stakes than whatever other working directory you're wgetting from, though.
curl is an excellently powerful library and utility but I agree that wget has better defaults. I am almost certain to get the behavior that I want by just throwing a URL at wget, including retrying from the point where it had an issue. I actually ran into a case where our corporate firewall was a little too eager to block a download being performed by the Visual Studio installer because of a signature match partway through a specific download. All I had to do to grab the file was have wget download it. No magic incantations, it was just smart enough to not start the download from the beginning after being cut off, and since it started midway it no longer tripped the signature match rule.
Indeed - and curl requires `-L` to follow redirects whereas wget just does that by default too. So for ad-hoc CLI use, I turn to wget rather than remember all the curl options required.
Also -OJ : with -O you get the name derived from the URL (the initial one, I think, even if redirects are being followed), with -OJ you get the one from the Content-Disposition header or the final URL, the way browsers do it. Of course, plain -O is safer. (For parity with Wget, you might also want to add -R to set the downloaded file’s mtime according to the Last-Modified header.)
Wget will give you an equivalent with the --content-dispisition flag. I would like for it to be the default, but it would likely break backwards compatibility with some scripts that except a different output filename.
True. AFAIU the reason is that Curl wants to make a single request (modulo redirects), whereas making -OJC- work would require two: issue a HEAD to receive the Content-Disposition header and learn the file name, then look at that file and see how long it is, then issue a GET with a Range header to request the suffix you still need to download. With other methods I don’t think you could make this work at all. I don’t know if Stenberg is opposed to a GET-specific solution, perhaps that could be a fun project. (Although I’ve encountered noncompilant servers that couldn’t handle HEADs.)
I'm well aware of curl's -O (long form: --remote-name), but it has an unfortunate clash with an option of the same name from wget (long form: --output-document). These options have closely related yet very different meanings. Using these options without looking at the manpage fills me with a sense of dread that I prefer to avoid.
--remote-name-all
This option changes the default action for all given URLs to be dealt with as if -O, were used for each one. So if you want to disable that for a specific URL after --remote-name-all has been used, you must use "-o -" or --no-remote-name.
I much prefer the fact that curl doesn't do this by default (but has the option), it much closer matches the behavior of most unix-y tools. Makes it so much easier to put it into pipelines.
well, depends on the usecase. sometimes you want the whole url, like when i want to mirror a site and it has stuff like foo.html?page=1 foo.html?page=2 ...
wget does have options to use the name proposed by the server, and so another option to remove the query arguments would be useful, and in line with those.
A new option to strip query parameters from the output filename would be interesting. But its not so simple. When combined with recursion, one will often see a lot of pages with the same name but different query parameters. How should they be stored on disk? There's a couple of different issues I can think of.
However, if the potential issues can be resolved with sane defaults, I think this would be a great new switch to add.
yes, exactly. i think that the option would have to be ignored when doing recursion. or alternatively use the .1 .2 ... method like with all cases where a file of that name already exists.
It's a simple solution to give the file the right extension, and preserving query parameters can be the right thing to do if you hit the same path repeatedly e.g. for pagination.
I’ve always seen this as a misfeature of wget, on the general principle that command-line utilities should write their principal result to stdout unless otherwise instructed.
that "killer feature" for cat would be turn `cat file.html` into `cat file.html > file.html` which means if you actually wanted to cat instead of cp you'd also need `cat file.html -o -` kinda glad curl doesn't have that killer feature.
Daniel Stenberg is among those rare breed of developers who put their heart and soul into their creation, a fading trait in the modern world of big tech that shadowy developers seem to be replaceable cogs of a money-making machine.
It's as if he treats curl as his mark on the world of IT.
Free software is full of people like this. That's why I use free software even if it's technically inferior. Of course, a lot of it is actually technically superior these days which makes it an even easier choice.
Maybe if you work for a company you don't put your heart into your creation, but if you have a popular personal project that brings you a lot of cash I'm sure you'll be as dedicated as him.
> I work for wolfSSL doing commercial curl support. If you need help to fix curl problems, fix your app's use of libcurl, add features to curl, fix curl bugs, optimize your curl use or libcurl education for your developers... Then I'm your man. Contact us!
From Wikipedia’s wolfSSL page²:
> In February 2019, Daniel Stenberg, the creator of cURL, joined the wolfSSL project.
Given that, saying cURL is “a popular personal project that brings [Daniel] a lot of cash” seems like a bit of a stretch.
Curl consistently has more options and flexibility, but there's several things on the right side of the venn diagram where wget does have some capability.
Ok, wow, I didn't know that curl supported so many protocols - but the fact remains that that small intersection area is probably what > 90% of curl/Wget users are using the tools for. So, from a developer's perspective, the overlap is not that big, but from a user's perspective it might appear much bigger...
Personally I'm a fan of httrack for mirroring, although wget has some href/src translation capabilities that are occasionally a better match for particular goals.
I guess the most common usage is the overlap between the two. That's why I'd love to see a Venn diagram of where (OS and docker images) each is installed by default!
I read “downloads recursively GPLv3 licensed” and wondered whether even Stallman would really claim that a file downloaded by wget becomes retroactively GPLv3.
Used to use parallel downloading of axel a lot 20 years ago whe it’s long fat pipes, window scaling wasn’t always enabled, and it wasn’t on the corporate proxy I had to use.
A trick I've found useful when searching large man pages for a flag --foo is to search for `␣␣--foo` (note the two leading spaces). In my experience this always hits the line where the flag is defined instead of irrelevant mentions of that flag, and it's faster than paging through the manual by hand.
I find wget is more likely to be on a given system than curl by default so I usually reach for that first. But I am squarely in the middle of the venn.
Curl is very widely used and has a ton of features which means that it gets a lot of CVEs, but their severity is often significantly overstated for users outside of specific niche configurations - for marketing purposes, it’s nice to be able to say that you found a HIGH in libcurl without mentioning that it only affected Windows domain authentication on ARM. The lead developer has written about this providing a lot of noise without much tangible security benefit:
Previously I worked on an open source project that pulled in many third party libraries. Users would run their corpo vulnerability scanners on the project and find dependencies with open CVEs and demand fixes, not understanding that in our usage of the libraries, the vulnerability is not exposed.
I think in 4 years, we had users open roughly 50 issues like this, which corresponded to exactly 0 real world exploitable issues.
A central vuln DB makes sense for sysadmins, but too many make it the end-all-be-all.
I think this ends up devolving to Goodhart’s law: once CVEs became marketing, a ton of people had a huge incentive to game their stats at the expense of everyone else’s time.
Can anyone explain "happy eyeballs"? Did find one page about it, but wasn't 100% clear what the use case for it being an option was, or where on earth the name came from...
Happy Eyeballs makes a simultaneous connection over IPv4 and IPv6 to an HTTP server, and uses the first connection that gets a server acknowledgement. This is useful because many networks have noticeably different response times for IPv4 and IPv6, and many have one of them configured but not working properly (usually IPv6).
Without Happy Eyeballs web browsers can be slow fetching some web pages, for some users, waiting for a request timeout on IP addresses that don't work before trying one that works, or working but with the slower IP.
It's called Happy Eyeballs because it improves the visible page load time in web browsers for many users.
Within the python ecosystem, I find httpx to be more similar to curl, and requests to be more like wget. For example, when following redirects or handling connection issues.
Wget needed one option to enable resuming in all conditions, even after a crash: --continue
Wget's introdcution in the manual page also states: "Wget has been designed for robustness over slow or unstable network connections; if a download fails due to a network problem, it will keep retrying until the whole file has been retrieved."
I was sold. Even if I by some miracle managed to get all the options for curl to enable reliable performance over poor connection right, Wget seems to have those on by default and the sane defaults make me believe it will also have this expected correct behaviour enabled even for those error scenarios that I did not think to test myself. Or - if the HTTP protocol ever receives updates, newer versions of Wget will also support those by default, but curl will require new switches to enable the enhanced behaviour - something which I can not add after the product has been shipped.
To me it often seems like curl is a good and extremely versatile low-level tool, and the CLI reflects this. But for my everyday work, I prefer to use Wget as it seems to work much better out of the box. And the manual page is much faster to navigate - probalby in part due to just not supporting all the obscure protocols called out on this page.