Hacker News new | past | comments | ask | show | jobs | submit login
Yark: Advanced and easy YouTube archiver now stable (github.com/owez)
444 points by Owez on Jan 5, 2023 | hide | past | favorite | 143 comments



I've been working on polishing my YouTube archiver project for the last while and I've finally released a solid version of it, it has an offline web viewer/visualiser for archived channels and it's managed using a cli. Most importantly, its easy to use :)


Can you edit your submission to add a "Show HN:" before the title? Like these: https://news.ycombinator.com/show


Will do


I like it, it's much better than what I've used previously.

I made a docker container to run it (https://github.com/na4ma4/docker-yark), when I get time I'll do a PR if you're interested so it isn't a separate project.

(I'll also fix it so the host is a command line argument not just changing the binding from 127.0.0.1 to 0.0.0.0)


I have a cron job that uses yt-dlp to download just the audio tracks of any videos that I have saved to a public playlist, can this be a replacement for that?


Is there a longer documentation anywhere? It's not clear from the README if you add a whole channel when you create an archive, or if you can add videos one by one to archive them?


From the readme, it is not clear to me what is meant by metadata.

Will this archive subtitles?

Will it archive comments? If is can, can comments be updated?

Also, from the readme it looks like all metadata is kept centralized instead of each video having its own metadata file.

Are containers other than mp4 supported?


Great project! I have a playlist that I use to keep track of videos that my young kid likes to watch, but wanted to get away from yt because of the ads, related videos, comments, autoplaying next vids, etc. So I just followed the simple instructions and bam, I have all the videos on my computer now with a UI that is even easier to use for him.

Very nice, thank you!


Occasionally my personal/literature/academic/tech notes link to a youtube source, but when clicking on it I find the video has vanished with no way of knowing what it was or even what it was called (it's sometimes impossible to track down an identical/replacement source). I lost many valuable references that way.

Wayback machine solves for webpages, but nothing I'm aware of (short of youtube-dl-ing the video yourself and storing it somewhere openly, probably at risk of various infringements) solves this. Quite a lot of hassle for something rather simple.

It would be great to be able to immortalise them on a per-video basis, so if it's important enough, we can be sure that references made to the content will still be there in the future when needed.


I do not see how that becomes possible without the Internet archive effectively mirroring a large percentage of YouTube. I recall at one point, IA wanted to archive just the video metadata and realized even that would be technically challenging.


IA does seems to archive YT video content, at least last time I tried to watch a deleted but popular video.


Only if a user chooses to submit the URL - I think parent comment is referring to an organized attempt by IA to archive a significant portion of YT's videos.


It would be nice to prioritize videos which are deemed at "high risk" of being deleted, with bayes statistics or machine learning or something like that.


Any video you care about, you need to make it your own responsibility to backup the metadata and/or streams. If you're lucky you can internet search the video ID to get the metadata, even after deletion.



I'm a big fan of the historical information that Yark shows.

Arrimus 3D recently replacing a large chunk of their 3D modeling tutorials with religious content was a pretty big lightbulb moment for me that so much of the content I rely on - not just for the initial learning of a new skill, but as a continual reference when I forget something - is so fragile.

I immediately bought a NAS and began backing up everything that I gleam even the tiniest bit of learning from using a similar project, TubeArchivist[0]. Projects like this are really important for maintaining all of the great knowledge on the web.

[0] https://github.com/tubearchivist/tubearchivist


Seconding tubearchivist. One of the killer features IMO is the browser extension [0], which adds a button on every video to send that video to the server to archive.

[0] https://github.com/tubearchivist/browser-extension


that's quite a pivot. Can you link to an example of a typical before and after video of that content creator? I'm curious what the connection is.


You might consider making torrents of those reference videos. It will help other people find and use them, and provide robustness to your collection.


Does this have the ability to bet set to "grab highest available resolution" instead of specifying one? A lot of the material I'd like to archive has material from well before and after Youtube started supporting HD resolutions.


It's just a wrapper for YT-DLP


I am fairly sure that if this uses yt-dlp with the default options it will grab whatever is the highest available resolution video (and most modern codec) and merge it with the highest quality available audio track.

same as just "yt-dlp https://url-of-video" from a CLI


Looks like the "best" quality setting is hardcoded currently:

https://github.com/Owez/yark/blob/676074ee3d9e379d15e52ffe2e...

It would be nice to expose this setting via config file and/or the cli.


Working on some of the groundwork of it now in https://github.com/Owez/yark/pull/57. Higher quality might be a bigger issue to tackle because of dependence on ffmpeg but it'll be done in v1.3, slated for release in about a months time.

I'm going to cap videos to 1080p by default and have a config setting to customize this.


I would actually like the opposite: cap the resolution. I don't really need 4k videos.


Frankly I'd be ecstatic if I could limit it to 30fps versions as well. Aside from the occasional speedrun I have zero desire to increase the file size by 40 percent for nothing.


What if the highest available resolution does not have an audio stream?


Normally in modern adaptive streaming, every video variant is muxed into a separate stream without audio, and different audio variants are muxed into their own individual streams.


wow, I wonder if that's why it always feels so frequent an experience of mine where the audio and video feel subtly out of sync with one another. It's very minute but detectable. Feels like that experience has increased in the last 6 months or so.


YouTube has been that way (separate streams) for a long time, definitely not anything new in the last 6 months. And they reassemble to be indistinguishable from the original combined stream, so that's not going to be the cause.

There are plenty of causes of delayed audio, however. Bluetooth is a big one, if your device and software aren't properly compensating for the Bluetooth transmission delay.


That kind of issue on the encoder OR player side would've been easily caught in testing, including through automated tests, so as crazygringo says, it's most likely your OS/hardware.

If you need to troubleshoot sync issues, this is a helpful tool: https://www.youtube.com/watch?v=HD4emXqHCsE


that's pretty good, bip bop, those line up spot on. I wonder if its quality/resolution dependent. That test video is 360 max. Whatever I'm perceiving if I'm actually seeing something is in the ms range, it's not obvious it's like right on the border if I trust my senses.


grab the audio stream from something else and stitch them together with ffmpeg (like youtube-dl and others do)


Only some formats which I guess are to be used with older browsers contain both video and audio. In general, these days video and audio are delivered through separate streams on YouTube.


If you want to provide an option to upload artifacts to the Internet Archive, you could crib off of https://github.com/bibanon/tubeup It too relies on yt-dlp for extraction.

Importantly, pay close attention to what artifacts are uploaded to an item created, and what metadata is set as part of the upload process.


How cool would it be if everyone had IPFS running in their browser, and everyone dedicated some time to filling it with a backup of the internet, including YouTube.


I did some napkin math. If 1 billion people each backed up 10gb, we’d almost have enough to store a copy of YouTube with zero data redundancy.

Google is massive.


That's impressive indeed, but if we boil it down to just the things that are worth saving, removing duplicates, long and pointless livestreams, long videos that are just endless loops, etc., it could be done with much less space. If we also ignore auto generated spam[1] and harmful content (Elsagate, etc.), it would require even less.

It's a nice thought experiment, but we really don't need to archive all of YT. That's why I appreciate projects like this and yt-dlp that allow me to not just archive what I'm interested in, but to watch it when and how I want, without Google tracking my every move, and interrupting every few minutes with ads. Paying for YT Premium only partially solves the second issue. I don't want to see sponsored content either.

[1]: https://youtube.fandom.com/wiki/Roel_Van_de_Paar


I would argue that backing up selectively instead of just grabbing everything makes the job harder, not easier. You might need less storage, but who will decide (and how) what gets stored and what doesn't?


In a world where backing up to IPFS would be as easy and widespread as hitting Ctrl+D to bookmark and sharing backups would not be dangerous legally, your answer gets an easy answer.

Each person gets to decide what they find important, backup and commit resources to.

We already have the sharing technology (BitTorrent, IPFS), and the backuping tech (ArchiveBox, TubeArchivist). But they are not integrated and they are not easy to configure and use for a nontechnical person. And they are unlikely to become mainstream thanks to the copyright cartel.

Many years ago there was the dream of every home having a home computer giving people ownership over their digital life. A world where everyone has their own email server, their own diaspora pod for their family, their own blog, etc. etc. and all they had to do was buy or build a box and plug it in.

Projects like freedomplug and sandstorm.io and YunoHost and others. There are even newer projects like Umbrel. And of course NASs that now have apps on them.

Arguably, NASs and Umbrel are really easy to configure and use. OMV is somewhat harder but not unreasonable.

Alas, a self sovereign world is not what people desire. People are content with letting themselves be controlled by a few megacorps. Everything is in the cloud (someone else's computer).

Also, with the rise of botnets and DDOSs, it became unfeasible to self host something public without something like Cloudflare in front.

I miss the old internet.


The current decentralized trend ("web3", etc.) becoming mainstream is a pipe dream only tech enthusiasts care about. The sad reality is that it has very slim chances of ever gaining mass adoption. The general public and non-technical users couldn't care less about owning and managing their data, running their own services, paying for services, and everything that entails. Even if they're aware that their privacy is being violated and that their personal data is sold on shady adtech markets, they see it as a cost worth paying for in exchange for the services they get for "free".

So even if all these technical solutions to problems only technical users care about become as easy to use for laypeople as modern web browsers are, the general public just won't care about it.

I've long believed that the blame for this lies mostly on early WWW architects. If the focus from the very start had been on sharing content as much as it was on consuming it, and user-friendly tools analogous to the web browser had been built, then the general public would be educated that the web works by being in control of your data, and sharing it selectively with specific people, companies, or the world. ISPs would be forced to deliver symmetrical connections to enable this, centralized services would be much less influential, and the web landscape would look much different today.

This was actually planned as a second phase in the original HyperText proposal[1], but was never completed for some reason. I'd be very interested to know what happened to this effort. If someone has insider knowledge, or can contact TBL, I'd be very grateful.

Alas, it's too late for this now. The centralized web is how most people experience the "internet", and that train has no chance of stopping.

[1]: https://www.w3.org/Proposal.html


I fully agree with what you wrote but I just want to mention that, as you also imply, the ideas behind "the current decentralized trend ("web3", etc.)", really aren't new. They just build on older ideas with regards to decentralization.

There are many ideas that fit under decentralization: torrents, the fediverse, crypto, even the old idea of the semantic web (because it was about standardized formats for metadata and carrying that metadata with the data instead of having it siloed in a central entity).

All of the hype around web3 really is only about crypto, because web3 is a marketable term for speculators.

Currently I am very cautiously hopeful about what the hype surrounding Mastodon (caused by Twitters self-immolation) will lead to.


On the contrary. Instead of everyone grabbing everything, each person would only archive what's important to them. This not only distributes the workload naturally, but serves as an implicit filter of content people find enjoyable and would actually watch, rather than archiving content nobody cares about.

But this is all hypothetical. We don't need a global YouTube archive. We need to stop using it altogether, and replace it with decentralized services. In the meantime, the existing personal archiving solutions work well.


Perhaps the backup algorithm needs to be coupled to the browser history of users. Stuff that isn't visited (or only for a few seconds) can be skipped.


It's unbelievable that YouTube was as free as it was for as long as it was.

We got too great a deal for so long that we many people can't see things any other way.


Was probably easier when the videos had a time limit and didn't support 4K (or 1080p even).



It's unbelievable that Wikipedia is free and survives on donations.

Youtube sells ads and is subsidized by one of the biggest ad companies in the world that happens to have a lot of cheap cloud storage available.


My understanding is that Wikipedia does not survive on, or even need donations, at all: they are entirely funded through their endowment.

The donations go to the Wikimedia Foundation which spends the money on a bunch of other social stuff that's not related to running Wikipedia at all. So unless you want to support all those other causes, you're effectively wasting your money by donating to Wikipedia's frequent donation requests. Wikipedia isn't the storage hog that YouTube is; it doesn't cost that much to keep it afloat, and the people moderating the content (the editors) are all volunteers anyway.


While that is true and it is much more worthwhile to donate to the Internet Archive instead of Wikipedia (many Wikipedia references depend on being archived by IA), I would not want to start a anti-Wikipedia-donations movement. Sure they have plenty of money now and are wasting some of it and are begging for more, but if they would stop receiving any, they can eventually burn through everything they have. And having reserves and a steady inflow are important for planning future projects.

However, at this point, IA needs your donations much much much more than Wikipedia does.


From what I've read (I may be wrong), Wikipedia has enough in its endowment to survive indefinitely. Of course, if all donations suddenly stopped and they kept wasting money on non-Wikipedia stuff, this might no longer hold true, but that's an issue with bad management.


> It's unbelievable that Wikipedia is free and survives on donations.

Wikipedia is mostly text. You can download the whole thing and fit it on a SD card.


Bandwidth is expensive and videos take up way more bandwidth than text. I'd say it would be pretty much impossible to run something like YouTube, on that big of a scale, solely on donations.


YT is not subsidized by Google, it makes a profit.

The scale of storage and serving is not of any real comparable scale.


Not for the first 15 years it didn't.


I wonder how many orders of magnitude separate the amount of data stored in YouTube vs Wikipedia? 5? 6? How about data served by them?


About 8 for storage, wikipedia is gigabytes, youtube is around exabytes.


It is not "free" at all if you are paying with your data. My data privacy is worth way more than the cost of streaming some video with crap discovery.


We're currently at around $0.02 per GB, so that would be $0.20 per person. A bargain.

(From a random source on the internet, [1])

[1] https://www.petercai.com/storage-prices-have-sort-of-stopped...


But most people have more free disk space than 10gb, and most people also have more than 1 device. I have 2 phones, 3 laptops, a desktop and a NAS, with some 40TB between them.

My workplace has a private cloud with some 70PB of storage, plus tape, and tons of desktops and laptops.


a) i seriously doubt your "most" claim about free disk space. Maybe "most privileged white folks in rich countries" but not "most people"

b) just because i HAVE 10Gb of free disk space doesn't mean i'm going to offer it up for archiving of random internet crap

c) if it was on my phone it'd cost money and now we're very actively ignoring just how expensive it is to be online in 3rd world countries if you're not a rich expat.

> My workplace has a private cloud with some 70PB of storage, plus tape, and tons of desktops and laptops.

sure. HOw much of that do you think they'd be willing to contribute, for free to backing up random crap from people on the internet that may or may not be legal and could open them up to litigation because they're not an ISP / platform and thus not protected by the Shield laws?


>privileged white folks in rich countries

Why bring race into it at all? Do privileged black/asian/etc people not exist? I'm sure you didn't mean it, but it's a very shortsighted thing to say.

Also, I think you're taking OPs comment too literally. Yes, not everyone has several devices and not everyone would want to store random videos on their devices, the point being though that 10gb is a veryy conservatively small number - and it being a fun thought experiment to imagine things like this.


> i seriously doubt your "most" claim about free disk space.

Really? Other than low-end Chromebooks, I'd bet way over 75% of PCs under 8 years old have 10GB free.


[flagged]


> stopped reading here

Your account is 5 days old. Either you need to learn the culture instead of telling other people, or you're a coward that won't voice their real opinions on their main account.

Or you've been banned already and need to take the hint.


Wow, the word racism is used really inflationary. I cannot find that the GP has posted a derogatory opinion about other "races". Maybe they are a bit narrow minded or ignorant, but calling them "racist maggot" is ridiculous.


wait, is the total number of videos on YouTube a known number? thats fascinating, i'd love to see your assumptions for napkin math


I’m not aware that Google make public the total size of YouTube. Your estimate is 10EB. That seems quite low.


yt-dlp and a batch file that runs via Task Scheduler has been doing this for me for a couple of years now. I also grab the captions and throw that into a database so that I can search transcripts for a clip that I can remember but can't remember which video it's in. It was a fun weekend project.


Long ago I had my podcast downloader keep all files it downloads and recently I've been using OpenAI's Whisper to go through and create transcripts of the 8000 or so hours of data I have downloaded over the years.

It's very cool to be able to search through and remind myself of something I heard once. Not exactly life changing, but still, nice to be able to quickly drill down and find audio for something when a curiosity strikes me.


What kind of hardware do you have that makes it feasible to process thousands of hours of podcasts? I want to do the same but I’ve heard that Whisper requires some serious GPU might for decent accuracy (Linux Unplugged podcast specifically).


Yep, it takes a bit of GPU RAM. I'm using 3 machines with NVidia 3080 or better. I let them go for a few weeks over the winter break when I was mostly disconnected from the tech world. The workers prioritized podcasts I'm personally likely to want to search, and got through almost a third of my archive.

Now it's down to 1 or 2 machines depending on what's going on, so it'll take much longer to finish up, but I'm in no rush.


8000 hours? Napkin math time, that's 20 years of 10+ hours daily.

I call BS.


It's about an hour or two or a day.

This includes data from 1995 on. The early data is backfill of radio shows that transitioned to podcasts and dumped old episodes in their feed at some point. My reader itself started in 2012, I downloaded around 7000 hours of new podcasts, which works out to 1.7 hours per day. So, around 2 hours per day, since I don't listen every day, and to be fair, I haven't listened to every podcast I've downloaded, some don't interest me. But 1-2 hours of listening a day is the sweet spot for me.


My math says 365 days x 10 hours/day = 3650 hours. 8000 hours is just over 2 years, not 20.


You need to resize your napkins.


You might be intereseted in these youtube archive scripts: https://github.com/TheFrenchGhosty/TheFrenchGhostys-Ultimate...


Neat. Thanks for sharing


I run mine with cron and it puts files in a special folder for plex: https://github.com/nburns/utilities/blob/master/youtube.fish

Pulls from my watch later playlist which is quite handy


It looks like this depends on a "./add-video.py" script that isn't in the repository.


How do you deal with file numbering?

I prefer the file prefixed with a number that indicates "air date". 01 being the first uploaded video. The default is by index and the top of the channel or playlist is number 01 which is the most recent.


I just use the publish date in the format of YYYY-MM-DD at the beginning of the filename so that they sort properly.


A while ago I did something similar. I'm already downloading various playlists via yt-dlp and wrote a web interface [1] to view/play/search them.

My biggest annoyance at the time was importing my existing videos (and converting them to a streamable format, generating thumbnails and hover-previews etc). Do you have any plans of allowing to import existing yt-dlp folder (in the standard layout with a bunch of mkv files, the info.json, the subtitles etc). Because my current archive contains a lot of already-deleted videos :(

[1] https://github.com/Mikescher/youtube-dl-viewer


I do eventually, might be in v1.3 next month or sadly v1.4 depending on how much I've got on my plate :)


An idea I had some time ago:

BitTorrent magnet uri's or hashes are suppose to be made from torrents but they really just point at a torrent.

One could make torrents for each video, take the youtube url v param and make a hash from that and point it ("erroneously") at the torrent.

That way, provided it was downloaded before, anyone who has the url can obtain the video.

The idea needs one more trick to validate the download. I suppose one could compare a chunk of yt to the same piece downloaded over BitTorrent but perhaps there are better ideas to be had.

Eventually, with tit for tad one could swap one chunk of one video with a chunk from a different video on the same channel.


If you're interested in this kind of thing, you may also want to check out:

- the Distributed YouTube Archive Discord: https://discord.com/invite/PQqks7eSKc

- ArchiveTeam also do a significant amount of YT archiving: https://wiki.archiveteam.org/index.php/YouTube

- a similar, but private effort: https://reddit.com/r/Archivists/comments/5uvfpw/youtube_arch...


I also archive many playlists with some code I wrote but I don't use a GUI.

https://github.com/chapmanjacobd/library/blob/f778e22bf80c58...

My focus is on error handling and trying to differentiate between unrecoverable errors and recoverable ones (try different proxy) but there's still a lot of work to be done.

Also look into https://github.com/swolegoal/squid-dl


Wow, I love the "daily tabs" concept. I'll install this and give it a go. Thanks!

"The use-case of tabs are websites that you know are going to change: subreddits, games, or tools that you want to use for a few minutes daily, weekly, monthly, quarterly, or yearly."


Did you write your own YouTube scraper, which would be quite a task, or is this based on something like ytdl? Might be worth mentioning in the readme.



> Might be worth mentioning in the readme.

i mean, it's not really important to a user which ever library it uses to scrape youtube - i suppose it's important if you want to contribute/develop it.


Bit of a noob question here. What's an archiver for?

Is it a library for things you've watched and want to store outside of youtube? Or is this for storing content you've created / managing your own portfolio of content?


Personally, for me it's archiving. In case I want to go back to it. Videos just keep disappearing from YouTube because channels get deleted by YouTube, by their owners, videos get copystriked, geoblocked, privated, and so on. As I'm lazy as f### I didn't create anything as sophisticated as OP, but a simple 10 line PHP Script on my home server that just pretends being Kodi enough to fool yatse (android remote for Kodi). So every time I watch a video on YouTube (on my phone) that I want to keep I tap "share" and then "play on Kodi", my php script gets the video url from the post data and launches youtube-dl. It sucks because I never get feedback if it worked and when it's finished, but I log all the URLs and at some point in the future I'll eventually add a cronjob that checks the list and sends reports and whatnot. Some day.


Great, thanks for the insight!


I think it’s the latter. I’ve no issue with most things being one-and-done. But some channels have phenomenal content that I’d like to keep for the long term. Something might happen to their channel that makes it difficult to get, so I’ll regularly update my downloads with new videos, pictures, etc.

This applies to ripping, too. Funimation removed Drifters years ago, but I’ll always have a copy of it because I ripped it. Of course, I need to store it so it still costs money. But I can be content that I have the content.


There are coded hints in the link, like:

> Yark lets you continuously archive all videos and metadata for YouTube channels. You can also view your archive as a seamless offline website


Snarky and not answering the question, well done.


Here's a similar project we've been working on, it's running 24/7

https://github.com/VeemsHQ/yt-channels-archive

This just provides the latest HQ of the videos + thumbs & metadata, no historic information such as changes in video titles.


This is pretty cool, I did a similar personal project that I describe here https://news.ycombinator.com/item?id=28480790. The only historical thing I log however is if a video was removed/reuploaded.


By the way, does anybody know a way to include the YouTube time stamps as chapters in a downloaded video file?


In yt-dlp is --embed-metadata (and --embed-chapters if you only want chapters)


I have a youtube archiver script I'm using right now that I pulled from a thread on the data hoarder subreddit. My main issue is that I need the downloader to remove emojis from the filename because I sometimes sync them to Dropbox. Can this project do that?


yt-dlp has a `--restrict-filenames` flag


I'd add some more flags/options for downloading specific videos or updating the library of downloaded content. I don't really want to download all the videos from one specific channel- instead I want to download the last 10 videos, for example.


You can do yark refresh [name] --videos=10



youtube-dl basically became unusuable for a while now. It gets only tens of KB/s on my 1gbps connection. yt-dlp is a more maintained alternative, I think. It seems to get great speeds the last time I checked.


ytdl is one of its dependencies

You clearly didn't even skim the readme to see what this does


To be fair, the readme does not mention ytdl.


From the HN guidelines:

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".


There's a difference between not reading an entire article or missing reading a paragraph versus not even clicking the link or skimming a couple screenshots to figure out what you're commenting on.


Any tool that tries to use youtube in a non-standard way that is stable is soon obsolete.


Anyone know if Apple bans this kind of lib from use in the app store?


Apple is in cahoots with big tech companies and is happy to help protect their monopoly and disallows any kind of adversarial interoperability.


There have been iOS video players with such functionality built-in whose authors have had to remove the functionality at Apple's request.


Thank you. I have a couple like that and didn’t know how confident to be with replicating

It seems Apple is fine with adblocking by default for web content, but inconsistently as with youtube. Hard to predict what's risky to invest dev time into


Apparently they don't like when your app does something in a way that breaks ToS or EULA, even if they are not worth the pixels they are written on. Apple being the trillion dollar company it is, is naturally averse to the legal risk posed by having a ToS-breaking app on their store. This is why having a single company controlling your entire device is bad -- they do what's best for them, rather than whatever you might happen to want.


Agree it's bad, it's just where the money is for b2c indie dev. I hope their app store monopoly gets broken up.

I get confused about their stance on this because they allow products like AdGuard or browsers that block ads by default


I'm not an expert on Apple's rules but my understanding is the download thing is circumvention of the terms under which Youtube licenses/re-licenses content so it's treated fairly strictly.


Then why do they allow other ad-blocking products or browsers that block ads by default? I don't understand the consistency of their position except that YouTube has a powerful lobby


Does this use youtube-dl / yt-dlp underneath for the retrieval of each video's URL and highest-quality video/audio format, and merge with ffmpeg?


Afrer reading the description, this project seems to be solely focused on downloading all of a specific channel's videos.

I've been taking my first steps at having a home server, and one of the things I'd love to do with it is having an archive of the videos that I have saved in my private playlists on YouTube. In my mind, the service would periodically check all my playlists, compare with what exists locally, and download any missing video. Maybe even with a nice web UI so it's easier to visually configure and use.

Does such a service already exist so I can self-host it?


Nice web ui aside, if I'm not mistaken youtube-dl already supports this kind of usage. You can `youtube-dl --download-archive archive.txt https://youtu.be/your-playlist` and it'll keep track in the archive.txt of everything it's already downloaded. Supplement with authentication options as necessary, set up a cronjob, done.


It's even simpler than that, just give youtube-dl the channel name and it will download all videos skipping any that already exists in your current directory.


Yark will be able to do this in v1.3 releasing in ~1 month provided it has access to the playlists, I'm not sure how to do creds currently but I'll look into it.

Issue for downloading playlists: https://github.com/Owez/yark/issues/49


That's great to know!

For auth it seems the preferred way to login with Google is OAuth2, that's I believe what third-party apps use, e.g. Thunderbird uses it when setting up a new GMail account.

However, for apps that don't support OAuth2, there is also the possibility of using "App Passwords" [1], I've used one in the past and it worked well. (Update: I'm just reading it only works if 2FA is enabled, which I use)

[1]: https://support.google.com/accounts/answer/185833?hl=en


> I'm not sure how to do creds currently but I'll look into it

Welcome to the nightmarish world of authentication to Google products, which all have 4 different versions of documentation and not a single one up to date.


I haven't used it personally, but Tube Archivist might be what you're looking for.

https://www.tubearchivist.com/


Ignore youtube-dl, it has speed issues. Use the fork yt-dlp.

If you want gui, check out TubeSync. Web UI for yt-dlp , ffmpeg and nicely packaged in docker


I was doing this for awhile but it became expensive - tens of TBs on an expensive NAS just to hoard data.


You can probably set flags to download in lower bitrate formats. Some formats use a LOT less data than others, and usually the extra quality isn't really needed anyway if it's only being kept for reference. The difference between 1080p (or, say, 4K60) and 240p or 360p is massive.


Also the codec matters. AV1 is quite a bit smaller than VP9, which is in turn smaller than x264. Of course the downside is AV1 is harder to decode. For the purpose of well-encoded content, you can estimate that the bitrate doubles whenever you double both dimensions of the video, e.g. 720p would be twice the bitrate of 360p and half of 1440p.


FSM almighty, how much video are you watching to have that many favorite ones?


With the rise of cheap 4k cameras, it goes faster than you think. I download guitar training videos where a 25 minute tutorial can be 2-3gbs. I believe those are “only” 2k resolution. If you do not download them in a space conscious format or recompress (using mildly aggressive settings can drop them to 100mb), a handful of videos can quickly fill a hard drive.


Why wouldn't one download something like that in 720P video/196kbps audio?

I mean, there's a time and a place for 4K, but watching the zits on the face of a guy who tells you how to play C Am F G on a guitar isn't it.

Not to mention, most of those cheap 4K cameras won't have optics to utilize those pixels; no quality will be lost in 720p.


My downloader defaults to the highest quality, so unless I remember to specify a setting, I get gigabytes of video for something that could easily be 480 quality. In the event I forget, I reencode with ffmpeg.


It's a reasonable default, but if one has gone to the trouble of setting up a NAS and scripts, might as well tick that box.


That depends, if the source is bright enough, those little pixels will have enough signal to work effectively.


Why not archive seldom-accessed files onto BD-ROMs? They're cheap to store


I think youtube-dl as well as yt-dlp can both download playlists. You can create a script to download all your playlist and make it a cron job. Videos that do already exist in the target folder will be skipped automatically.


This is how I accomplish basically the same thing - https://gist.github.com/Gestas/30ac0c3a07404174d0d7b66068221....

It requires a local file with a list of channels/playlists. I use Jellyfin (https://jellyfin.org/) as the frontend/video player.


Look at tubearchivist. It can do what you want.

You can subscribe to playlists, as well as automatically update and download videos.


This is really slick. I was figuring it was just a simple wrapper for yt-dlp which scraped some additional things (comments, views, etc) but you went above and beyond with the web interface. Nice job!


does anyone have recs on how to run this on a continuous basis in the cloud? this obviously will take a lot more storage than like a normal heroku setup (not that I would use heroku). should i use Railway or Render or is that overkill compared to something else?

gasp can i run it as a github action???




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: