Is there a reason why you'd want to save archival material in a proprietary form...

alexwlchan · 2024-06-04T15:40:03 1717515603

1/ Why not wget?

For this project I wanted a consistent file format for my entire collection.

I have a bunch of stuff I want to save which is behind paywalls/logins/clickthroughs that are tricky for wget to reach. I know I can hand wget a cookies file, but that’s mildly fiddly. I save those pages as Safari webarchive files, and then they can drop in alongside the files I’ve collected programatically. Then I can deal with all my saved pages as a homogeneous set, rather than being split into two formats.

Plus I couldn't find anybody who'd done this, and it was fun :D

This is only for personal stuff where I know I'll be using Safari/macOS for the foreseeable future. I don't envisage using this for anything professional, or a shared archive -- you're right that a less proprietary format would be better in those contexts. I think I'm in a bit of a niche here.

(I'm honestly surprised this is on the front page; I didn't think anybody else would be that interested.)

2/ Proprietary format: it is, but before I started I did some experiments to see what's actually inside. It's a binary plist and I can recover all the underlying HTML/CSS/JS files with Python, so I'm not totally hosed if Safari goes away.

Notes on that here: https://alexwlchan.net/til/2024/whats-inside-safari-webarchi...

pvg · 2024-06-04T15:53:39 1717516419

I didn't think anybody else would be that interested.

'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem, especially programmatically, so the niche is probably a little roomier than you might initially suspect.

DaSHacka · 2024-06-04T17:05:08 1717520708

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

You may be interested in SingleFile[1]

[1] https://github.com/gildas-lormeau/SingleFile

I use it all the time to archive webpages, and I imagine it wouldn't be hard to throw together a script to use FireFox's headless mode in combination with SingleFile to selfhost a clone of the wayback machine.

freedomben · 2024-06-05T00:37:42 1717547862

This is what I was going to say as well. Somebody on HN told me about SingleFile and I use it all the time now! Really amazing extension.

pvg · 2024-06-04T17:13:05 1717521185

Thanks, I've seen it, last I tried it it missed bg images. But my point is this is something browsers should support better and kind of sort of do now but even with that it's a hassle.

tedmiston · 2024-06-04T17:36:31 1717522591

I tested this just now on the blog post that this HN page points to and SingleFile handled the background image fine.

cxr · 2024-06-04T22:27:12 1717540032

> FireFox's

It's just "Firefox".

sturakov · 2024-06-04T21:49:34 1717537774

I've enjoyed using this

https://github.com/webrecorder

It has a standardized format and acts like a recorder for what you see.

factormeta · 2024-06-04T21:47:58 1717537678

Thanks all the JS - SPA develops that insisting on putting JS all over the place. Wouldn't it be better to have everything in one .html, using <script> <style> just inline. Then it is also just one file over the internet. There must be a bundler that does that no?

Seems JS developer just want their code to the obfuscated and unachievable as possible unless it is via their web server.

cxr · 2024-06-04T22:25:59 1717539959

> using <script> <style> just inline

These SPA bundles are on the order of megabytes, not kilobytes. You want your users, for their own sake and yours, to be able to cache as much as possible instead of delivering a unique megablob payload for every page they hit.

vmfunction · 2024-06-05T03:08:46 1717556926

Good point on the cache. However things such as putting background image in CSS, so user can right click to download the image is just stupid. Why is css all the sudden in control of the image display? It just makes archiving pages harder.

diggan · 2024-06-04T16:18:43 1717517923

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

Is it really? I remember hacking around with with JavaScript's XMLSerializer (I think) like 5 years ago and solved that for ~90% of the websites I tried to archive. It'd save the DOM as-is when executed.

Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

pvg · 2024-06-04T17:05:00 1717520700

90% feels like an overestimate to me but it's already quite poor, you wouldn't accept that for saving most other things. Another problem is highlighted in the piece - it's a hassle to ensure external tools handle session state and credentials. Dynamic content is poorly handled, the default behaviours are miserable (a browser will run random Javascript from the network but not Javascript you've saved, etc).

There's a lot of interest in 'digital preservation' and perhaps one sign of how it's very much early days of the field - it's tricky to 'just save' the results of one of the most basic current computer interactions - looking at a web page.

diggan · 2024-06-04T17:08:48 1717520928

But if you serialize the DOM as-is, you literally get what you see on the page when you archive it. Nothing about it is dynamic, and there is no sessions nor credentials to handle. Granted, it's a static copy of a specific single page.

If you need more than that, then WARC is probably the best. For my measly needs of just preserving exactly what I see, serializing the DOM and saving the result seems to do just fine.

pvg · 2024-06-04T17:19:45 1717521585

Yes you save something that's mildly better than print-page-to-PDF. But it still misses things and the interactive stuff is very much part of 'exactly what I see'. Like, a random article with an interactive graph, for instance - like this recent HN hit https://ciechanow.ski/airfoil/

It's not that there aren't workarounds, it's that they are clunky and 'you can't actually save the most common computery entity you deal with' is just a strange state of affairs we've somehow Stockholmed ourselves to.

tedmiston · 2024-06-04T17:48:36 1717523316

> Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

One category that the archivers do poorly with is news articles where a pop-up renders on page load which then requires client-side JS execution to dismiss the pop-up.

Sometimes it is easily circumvented by manual DOM manipulation, but that's hardly a bulletproof solution. And it feels automateable.

brnt · 2024-06-05T05:13:30 1717564410

Print to PDF seems to be the only way to ensure you record what you saw.

buildbot · 2024-06-04T16:13:04 1717517584

How can you actually open and view a warc file? I've never found a good application for this, have I missed something obvious?

diggan · 2024-06-04T16:16:06 1717517766

Lots of tools available, best index I've found of the ecosystem is this: https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

Ultimately, this is the best viewer I've found so far: https://replayweb.page/

tivert · 2024-06-04T16:21:59 1717518119

> Lots of tools available, best index I've found of the ecosystem is this: https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

It looks like there's a lot of tools for creating them, but not a a lot for viewing.

What they really need is browser support, or at least an extension so a browser can open the files directly.

cxr · 2024-06-04T22:35:33 1717540533

> What they really need is browser support, or at least an extension so a browser can open the files directly

That's probably the wrong thing. What browsers really need is a thin but standardized API that lets any third-party app that the user has installed on their machine to supply the content for various fetch/reads.

You'd open the WARC in Firefox or Safari or whatever, but Safari et al wouldn't have any special understanding of the format. It would know that your app does WARCs, though, and then knock on the door and say, "Please tell me the content I should be showing here; I'll defer to you for any further "requests" associated with the file/page loaded in this tab—just tell me the content I should use for those, too."

tivert · 2024-06-05T14:23:30 1717597410

That's too complicated, though.

One of the main use cases for an archived web page would be to share archives, and in that case I think you'd want them to be double-clickable with little fuss.

cxr · 2024-06-16T00:17:58 1718497078

So ship an implementation with the browser.

jamesgeck0 · 2024-06-04T17:47:43 1717523263

There is a browser extension. It can record WARC files, but also has a viewing interface identical to ReplayWeb.page. https://archiveweb.page/guide

zie · 2024-06-05T16:12:34 1717603954

Sadly it's Chrome only it seems.

lostemptations5 · 2024-06-04T16:41:50 1717519310

I've never easily been to read warc files.

codetrotter · 2024-06-04T15:36:48 1717515408

FTA:

> Although Safari is only maintained by Apple, the Safari webarchive format can be read by non-Apple tools – it’s a binary property list that stores the raw bytes of the original files. I’m comfortable that I’ll be able to open these archives for a while, even if Safari unexpectedly goes away.

simongr3dal · 2024-06-04T15:26:03 1717514763

It is proprietary, but it's not like it's difficult to decode or interpret.