I've been working on polishing my YouTube archiver project for the last while and I've finally released a solid version of it, it has an offline web viewer/visualiser for archived channels and it's managed using a cli. Most importantly, its easy to use :)
I like it, it's much better than what I've used previously.
I made a docker container to run it (https://github.com/na4ma4/docker-yark), when I get time I'll do a PR if you're interested so it isn't a separate project.
(I'll also fix it so the host is a command line argument not just changing the binding from 127.0.0.1 to 0.0.0.0)
I have a cron job that uses yt-dlp to download just the audio tracks of any videos that I have saved to a public playlist, can this be a replacement for that?
Is there a longer documentation anywhere? It's not clear from the README if you add a whole channel when you create an archive, or if you can add videos one by one to archive them?
Great project! I have a playlist that I use to keep track of videos that my young kid likes to watch, but wanted to get away from yt because of the ads, related videos, comments, autoplaying next vids, etc. So I just followed the simple instructions and bam, I have all the videos on my computer now with a UI that is even easier to use for him.
Occasionally my personal/literature/academic/tech notes link to a youtube source, but when clicking on it I find the video has vanished with no way of knowing what it was or even what it was called (it's sometimes impossible to track down an identical/replacement source). I lost many valuable references that way.
Wayback machine solves for webpages, but nothing I'm aware of (short of youtube-dl-ing the video yourself and storing it somewhere openly, probably at risk of various infringements) solves this. Quite a lot of hassle for something rather simple.
It would be great to be able to immortalise them on a per-video basis, so if it's important enough, we can be sure that references made to the content will still be there in the future when needed.
I do not see how that becomes possible without the Internet archive effectively mirroring a large percentage of YouTube. I recall at one point, IA wanted to archive just the video metadata and realized even that would be technically challenging.
Only if a user chooses to submit the URL - I think parent comment is referring to an organized attempt by IA to archive a significant portion of YT's videos.
It would be nice to prioritize videos which are deemed at "high risk" of being deleted, with bayes statistics or machine learning or something like that.
Any video you care about, you need to make it your own responsibility to backup the metadata and/or streams. If you're lucky you can internet search the video ID to get the metadata, even after deletion.
I'm a big fan of the historical information that Yark shows.
Arrimus 3D recently replacing a large chunk of their 3D modeling tutorials with religious content was a pretty big lightbulb moment for me that so much of the content I rely on - not just for the initial learning of a new skill, but as a continual reference when I forget something - is so fragile.
I immediately bought a NAS and began backing up everything that I gleam even the tiniest bit of learning from using a similar project, TubeArchivist[0]. Projects like this are really important for maintaining all of the great knowledge on the web.
Seconding tubearchivist. One of the killer features IMO is the browser extension [0], which adds a button on every video to send that video to the server to archive.
Does this have the ability to bet set to "grab highest available resolution" instead of specifying one? A lot of the material I'd like to archive has material from well before and after Youtube started supporting HD resolutions.
I am fairly sure that if this uses yt-dlp with the default options it will grab whatever is the highest available resolution video (and most modern codec) and merge it with the highest quality available audio track.
Working on some of the groundwork of it now in https://github.com/Owez/yark/pull/57. Higher quality might be a bigger issue to tackle because of dependence on ffmpeg but it'll be done in v1.3, slated for release in about a months time.
I'm going to cap videos to 1080p by default and have a config setting to customize this.
Frankly I'd be ecstatic if I could limit it to 30fps versions as well. Aside from the occasional speedrun I have zero desire to increase the file size by 40 percent for nothing.
Normally in modern adaptive streaming, every video variant is muxed into a separate stream without audio, and different audio variants are muxed into their own individual streams.
wow, I wonder if that's why it always feels so frequent an experience of mine where the audio and video feel subtly out of sync with one another. It's very minute but detectable. Feels like that experience has increased in the last 6 months or so.
YouTube has been that way (separate streams) for a long time, definitely not anything new in the last 6 months. And they reassemble to be indistinguishable from the original combined stream, so that's not going to be the cause.
There are plenty of causes of delayed audio, however. Bluetooth is a big one, if your device and software aren't properly compensating for the Bluetooth transmission delay.
That kind of issue on the encoder OR player side would've been easily caught in testing, including through automated tests, so as crazygringo says, it's most likely your OS/hardware.
that's pretty good, bip bop, those line up spot on. I wonder if its quality/resolution dependent. That test video is 360 max. Whatever I'm perceiving if I'm actually seeing something is in the ms range, it's not obvious it's like right on the border if I trust my senses.
Only some formats which I guess are to be used with older browsers contain both video and audio. In general, these days video and audio are delivered through separate streams on YouTube.
If you want to provide an option to upload artifacts to the Internet Archive, you could crib off of https://github.com/bibanon/tubeup It too relies on yt-dlp for extraction.
Importantly, pay close attention to what artifacts are uploaded to an item created, and what metadata is set as part of the upload process.
How cool would it be if everyone had IPFS running in their browser, and everyone dedicated some time to filling it with a backup of the internet, including YouTube.
That's impressive indeed, but if we boil it down to just the things that are worth saving, removing duplicates, long and pointless livestreams, long videos that are just endless loops, etc., it could be done with much less space. If we also ignore auto generated spam[1] and harmful content (Elsagate, etc.), it would require even less.
It's a nice thought experiment, but we really don't need to archive all of YT. That's why I appreciate projects like this and yt-dlp that allow me to not just archive what I'm interested in, but to watch it when and how I want, without Google tracking my every move, and interrupting every few minutes with ads. Paying for YT Premium only partially solves the second issue. I don't want to see sponsored content either.
I would argue that backing up selectively instead of just grabbing everything makes the job harder, not easier. You might need less storage, but who will decide (and how) what gets stored and what doesn't?
In a world where backing up to IPFS would be as easy and widespread as hitting Ctrl+D to bookmark and sharing backups would not be dangerous legally, your answer gets an easy answer.
Each person gets to decide what they find important, backup and commit resources to.
We already have the sharing technology (BitTorrent, IPFS), and the backuping tech (ArchiveBox, TubeArchivist). But they are not integrated and they are not easy to configure and use for a nontechnical person. And they are unlikely to become mainstream thanks to the copyright cartel.
Many years ago there was the dream of every home having a home computer giving people ownership over their digital life. A world where everyone has their own email server, their own diaspora pod for their family, their own blog, etc. etc. and all they had to do was buy or build a box and plug it in.
Projects like freedomplug and sandstorm.io and YunoHost and others. There are even newer projects like Umbrel. And of course NASs that now have apps on them.
Arguably, NASs and Umbrel are really easy to configure and use. OMV is somewhat harder but not unreasonable.
Alas, a self sovereign world is not what people desire. People are content with letting themselves be controlled by a few megacorps. Everything is in the cloud (someone else's computer).
Also, with the rise of botnets and DDOSs, it became unfeasible to self host something public without something like Cloudflare in front.
The current decentralized trend ("web3", etc.) becoming mainstream is a pipe dream only tech enthusiasts care about. The sad reality is that it has very slim chances of ever gaining mass adoption. The general public and non-technical users couldn't care less about owning and managing their data, running their own services, paying for services, and everything that entails. Even if they're aware that their privacy is being violated and that their personal data is sold on shady adtech markets, they see it as a cost worth paying for in exchange for the services they get for "free".
So even if all these technical solutions to problems only technical users care about become as easy to use for laypeople as modern web browsers are, the general public just won't care about it.
I've long believed that the blame for this lies mostly on early WWW architects. If the focus from the very start had been on sharing content as much as it was on consuming it, and user-friendly tools analogous to the web browser had been built, then the general public would be educated that the web works by being in control of your data, and sharing it selectively with specific people, companies, or the world. ISPs would be forced to deliver symmetrical connections to enable this, centralized services would be much less influential, and the web landscape would look much different today.
This was actually planned as a second phase in the original HyperText proposal[1], but was never completed for some reason. I'd be very interested to know what happened to this effort. If someone has insider knowledge, or can contact TBL, I'd be very grateful.
Alas, it's too late for this now. The centralized web is how most people experience the "internet", and that train has no chance of stopping.
I fully agree with what you wrote but I just want to mention that, as you also imply, the ideas behind "the current decentralized trend ("web3", etc.)", really aren't new. They just build on older ideas with regards to decentralization.
There are many ideas that fit under decentralization: torrents, the fediverse, crypto, even the old idea of the semantic web (because it was about standardized formats for metadata and carrying that metadata with the data instead of having it siloed in a central entity).
All of the hype around web3 really is only about crypto, because web3 is a marketable term for speculators.
Currently I am very cautiously hopeful about what the hype surrounding Mastodon (caused by Twitters self-immolation) will lead to.
On the contrary. Instead of everyone grabbing everything, each person would only archive what's important to them. This not only distributes the workload naturally, but serves as an implicit filter of content people find enjoyable and would actually watch, rather than archiving content nobody cares about.
But this is all hypothetical. We don't need a global YouTube archive. We need to stop using it altogether, and replace it with decentralized services. In the meantime, the existing personal archiving solutions work well.
My understanding is that Wikipedia does not survive on, or even need donations, at all: they are entirely funded through their endowment.
The donations go to the Wikimedia Foundation which spends the money on a bunch of other social stuff that's not related to running Wikipedia at all. So unless you want to support all those other causes, you're effectively wasting your money by donating to Wikipedia's frequent donation requests. Wikipedia isn't the storage hog that YouTube is; it doesn't cost that much to keep it afloat, and the people moderating the content (the editors) are all volunteers anyway.
While that is true and it is much more worthwhile to donate to the Internet Archive instead of Wikipedia (many Wikipedia references depend on being archived by IA), I would not want to start a anti-Wikipedia-donations movement. Sure they have plenty of money now and are wasting some of it and are begging for more, but if they would stop receiving any, they can eventually burn through everything they have. And having reserves and a steady inflow are important for planning future projects.
However, at this point, IA needs your donations much much much more than Wikipedia does.
From what I've read (I may be wrong), Wikipedia has enough in its endowment to survive indefinitely. Of course, if all donations suddenly stopped and they kept wasting money on non-Wikipedia stuff, this might no longer hold true, but that's an issue with bad management.
Bandwidth is expensive and videos take up way more bandwidth than text. I'd say it would be pretty much impossible to run something like YouTube, on that big of a scale, solely on donations.
But most people have more free disk space than 10gb, and most people also have more than 1 device. I have 2 phones, 3 laptops, a desktop and a NAS, with some 40TB between them.
My workplace has a private cloud with some 70PB of storage, plus tape, and tons of desktops and laptops.
a) i seriously doubt your "most" claim about free disk space. Maybe "most privileged white folks in rich countries" but not "most people"
b) just because i HAVE 10Gb of free disk space doesn't mean i'm going to offer it up for archiving of random internet crap
c) if it was on my phone it'd cost money and now we're very actively ignoring just how expensive it is to be online in 3rd world countries if you're not a rich expat.
> My workplace has a private cloud with some 70PB of storage, plus tape, and tons of desktops and laptops.
sure. HOw much of that do you think they'd be willing to contribute, for free to backing up random crap from people on the internet that may or may not be legal and could open them up to litigation because they're not an ISP / platform and thus not protected by the Shield laws?
Why bring race into it at all? Do privileged black/asian/etc people not exist? I'm sure you didn't mean it, but it's a very shortsighted thing to say.
Also, I think you're taking OPs comment too literally. Yes, not everyone has several devices and not everyone would want to store random videos on their devices, the point being though that 10gb is a veryy conservatively small number - and it being a fun thought experiment to imagine things like this.
Your account is 5 days old. Either you need to learn the culture instead of telling other people, or you're a coward that won't voice their real opinions on their main account.
Or you've been banned already and need to take the hint.
Wow, the word racism is used really inflationary. I cannot find that the GP has posted a derogatory opinion about other "races". Maybe they are a bit narrow minded or ignorant, but calling them "racist maggot" is ridiculous.
yt-dlp and a batch file that runs via Task Scheduler has been doing this for me for a couple of years now. I also grab the captions and throw that into a database so that I can search transcripts for a clip that I can remember but can't remember which video it's in. It was a fun weekend project.
Long ago I had my podcast downloader keep all files it downloads and recently I've been using OpenAI's Whisper to go through and create transcripts of the 8000 or so hours of data I have downloaded over the years.
It's very cool to be able to search through and remind myself of something I heard once. Not exactly life changing, but still, nice to be able to quickly drill down and find audio for something when a curiosity strikes me.
What kind of hardware do you have that makes it feasible to process thousands of hours of podcasts? I want to do the same but I’ve heard that Whisper requires some serious GPU might for decent accuracy (Linux Unplugged podcast specifically).
Yep, it takes a bit of GPU RAM. I'm using 3 machines with NVidia 3080 or better. I let them go for a few weeks over the winter break when I was mostly disconnected from the tech world. The workers prioritized podcasts I'm personally likely to want to search, and got through almost a third of my archive.
Now it's down to 1 or 2 machines depending on what's going on, so it'll take much longer to finish up, but I'm in no rush.
This includes data from 1995 on. The early data is backfill of radio shows that transitioned to podcasts and dumped old episodes in their feed at some point. My reader itself started in 2012, I downloaded around 7000 hours of new podcasts, which works out to 1.7 hours per day. So, around 2 hours per day, since I don't listen every day, and to be fair, I haven't listened to every podcast I've downloaded, some don't interest me. But 1-2 hours of listening a day is the sweet spot for me.
I prefer the file prefixed with a number that indicates "air date". 01 being the first uploaded video. The default is by index and the top of the channel or playlist is number 01 which is the most recent.
A while ago I did something similar. I'm already downloading various playlists via yt-dlp and wrote a web interface [1] to view/play/search them.
My biggest annoyance at the time was importing my existing videos (and converting them to a streamable format, generating thumbnails and hover-previews etc).
Do you have any plans of allowing to import existing yt-dlp folder (in the standard layout with a bunch of mkv files, the info.json, the subtitles etc). Because my current archive contains a lot of already-deleted videos :(
BitTorrent magnet uri's or hashes are suppose to be made from torrents but they really just point at a torrent.
One could make torrents for each video, take the youtube url v param and make a hash from that and point it ("erroneously") at the torrent.
That way, provided it was downloaded before, anyone who has the url can obtain the video.
The idea needs one more trick to validate the download. I suppose one could compare a chunk of yt to the same piece downloaded over BitTorrent but perhaps there are better ideas to be had.
Eventually, with tit for tad one could swap one chunk of one video with a chunk from a different video on the same channel.
My focus is on error handling and trying to differentiate between unrecoverable errors and recoverable ones (try different proxy) but there's still a lot of work to be done.
Wow, I love the "daily tabs" concept. I'll install this and give it a go. Thanks!
"The use-case of tabs are websites that you know are going to change: subreddits, games, or tools that you want to use for a few minutes daily, weekly, monthly, quarterly, or yearly."
i mean, it's not really important to a user which ever library it uses to scrape youtube - i suppose it's important if you want to contribute/develop it.
Bit of a noob question here. What's an archiver for?
Is it a library for things you've watched and want to store outside of youtube? Or is this for storing content you've created / managing your own portfolio of content?
Personally, for me it's archiving. In case I want to go back to it. Videos just keep disappearing from YouTube because channels get deleted by YouTube, by their owners, videos get copystriked, geoblocked, privated, and so on. As I'm lazy as f### I didn't create anything as sophisticated as OP, but a simple 10 line PHP Script on my home server that just pretends being Kodi enough to fool yatse (android remote for Kodi). So every time I watch a video on YouTube (on my phone) that I want to keep I tap "share" and then "play on Kodi", my php script gets the video url from the post data and launches youtube-dl. It sucks because I never get feedback if it worked and when it's finished, but I log all the URLs and at some point in the future I'll eventually add a cronjob that checks the list and sends reports and whatnot. Some day.
I think it’s the latter. I’ve no issue with most things being one-and-done. But some channels have phenomenal content that I’d like to keep for the long term. Something might happen to their channel that makes it difficult to get, so I’ll regularly update my downloads with new videos, pictures, etc.
This applies to ripping, too. Funimation removed Drifters years ago, but I’ll always have a copy of it because I ripped it. Of course, I need to store it so it still costs money. But I can be content that I have the content.
This is pretty cool, I did a similar personal project that I describe here https://news.ycombinator.com/item?id=28480790. The only historical thing I log however is if a video was removed/reuploaded.
I have a youtube archiver script I'm using right now that I pulled from a thread on the data hoarder subreddit. My main issue is that I need the downloader to remove emojis from the filename because I sometimes sync them to Dropbox. Can this project do that?
I'd add some more flags/options for downloading specific videos or updating the library of downloaded content. I don't really want to download all the videos from one specific channel- instead I want to download the last 10 videos, for example.
youtube-dl basically became unusuable for a while now. It gets only tens of KB/s on my 1gbps connection. yt-dlp is a more maintained alternative, I think. It seems to get great speeds the last time I checked.
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
There's a difference between not reading an entire article or missing reading a paragraph versus not even clicking the link or skimming a couple screenshots to figure out what you're commenting on.
Thank you. I have a couple like that and didn’t know how confident to be with replicating
It seems Apple is fine with adblocking by default for web content, but inconsistently as with youtube. Hard to predict what's risky to invest dev time into
Apparently they don't like when your app does something in a way that breaks ToS or EULA, even if they are not worth the pixels they are written on. Apple being the trillion dollar company it is, is naturally averse to the legal risk posed by having a ToS-breaking app on their store. This is why having a single company controlling your entire device is bad -- they do what's best for them, rather than whatever you might happen to want.
I'm not an expert on Apple's rules but my understanding is the download thing is circumvention of the terms under which Youtube licenses/re-licenses content so it's treated fairly strictly.
Then why do they allow other ad-blocking products or browsers that block ads by default? I don't understand the consistency of their position except that YouTube has a powerful lobby
Afrer reading the description, this project seems to be solely focused on downloading all of a specific channel's videos.
I've been taking my first steps at having a home server, and one of the things I'd love to do with it is having an archive of the videos that I have saved in my private playlists on YouTube. In my mind, the service would periodically check all my playlists, compare with what exists locally, and download any missing video. Maybe even with a nice web UI so it's easier to visually configure and use.
Does such a service already exist so I can self-host it?
Nice web ui aside, if I'm not mistaken youtube-dl already supports this kind of usage. You can `youtube-dl --download-archive archive.txt https://youtu.be/your-playlist` and it'll keep track in the archive.txt of everything it's already downloaded. Supplement with authentication options as necessary, set up a cronjob, done.
It's even simpler than that, just give youtube-dl the channel name and it will download all videos skipping any that already exists in your current directory.
Yark will be able to do this in v1.3 releasing in ~1 month provided it has access to the playlists, I'm not sure how to do creds currently but I'll look into it.
For auth it seems the preferred way to login with Google is OAuth2, that's I believe what third-party apps use, e.g. Thunderbird uses it when setting up a new GMail account.
However, for apps that don't support OAuth2, there is also the possibility of using "App Passwords" [1], I've used one in the past and it worked well. (Update: I'm just reading it only works if 2FA is enabled, which I use)
> I'm not sure how to do creds currently but I'll look into it
Welcome to the nightmarish world of authentication to Google products, which all have 4 different versions of documentation and not a single one up to date.
You can probably set flags to download in lower bitrate formats. Some formats use a LOT less data than others, and usually the extra quality isn't really needed anyway if it's only being kept for reference. The difference between 1080p (or, say, 4K60) and 240p or 360p is massive.
Also the codec matters. AV1 is quite a bit smaller than VP9, which is in turn smaller than x264. Of course the downside is AV1 is harder to decode. For the purpose of well-encoded content, you can estimate that the bitrate doubles whenever you double both dimensions of the video, e.g. 720p would be twice the bitrate of 360p and half of 1440p.
With the rise of cheap 4k cameras, it goes faster than you think.
I download guitar training videos where a 25 minute tutorial can be 2-3gbs. I believe those are “only” 2k resolution. If you do not download them in a space conscious format or recompress (using mildly aggressive settings can drop them to 100mb), a handful of videos can quickly fill a hard drive.
My downloader defaults to the highest quality, so unless I remember to specify a setting, I get gigabytes of video for something that could easily be 480 quality. In the event I forget, I reencode with ffmpeg.
I think youtube-dl as well as yt-dlp can both download playlists. You can create a script to download all your playlist and make it a cron job. Videos that do already exist in the target folder will be skipped automatically.
This is really slick. I was figuring it was just a simple wrapper for yt-dlp which scraped some additional things (comments, views, etc) but you went above and beyond with the web interface. Nice job!
does anyone have recs on how to run this on a continuous basis in the cloud? this obviously will take a lot more storage than like a normal heroku setup (not that I would use heroku). should i use Railway or Render or is that overkill compared to something else?