Old school Linux administration – my next homelab generation

woopwoop24 · on Sept 11, 2022

HA is overrated, i'd much rather go for a low mean time to repair. Backups, reinstall, ansible playbook is my way to go if hardware fails, which is quite rare to be honest. HA goes beyond hardware in terms of "electricity, internet connection, storage, location etc.., IMO people quite often underestimate what it means to have real HA --> second location with identical setup to shift workload or even have active-active instead of active-passive. i have an intel nuc as plain debian server with containers on it. 2 raspberry pi's(act as loadbalancers with traefik and authelia on top) and a hetzner vm all connected through wireguard. All is configured via ansible and i rely on LE certificates to connect via https or directly via wireguard vpn to http if i want it only exposed via vpn. encrypted backups are copied via wireguard to an offsite storage box.

full down to back online if hardware is not damaged is less than 15 min.

it is very easy to do and rock solid, unattend-upgrades does the trick. i tried almost every combination and even though i am a k8s user from version 1.2 onwoards the complexity for k8s at home or even vsphere is too much for me vs. this super simple and stable configuration i now have.

xani_ · on Sept 12, 2022

> HA is overrated, i'd much rather go for a low mean time to repair.

Those are almost entirely independent domains

the only point where they connect is that

> full down to back online if hardware is not damaged is less than 15 min.

automation makes both easier to deal with.

"Mean time to repair" is all fine if you don't have to drive 30 minutes to datacenter to swap out stuff.

Even if its cloudy cloud and you have backups, restores can take a lot of time.

Also you will want HA when it's your internet gateway to die

There are also levels of implementation. Making fully redundant set of loadbalancers is relatively easy. But any stateful app is much, much harder. In case of applications active-passive type of setups are also much easier than full on active active redundancy, especially if app wasn't written for it.

R0b0t1 · on Sept 11, 2022

Quick to repair is also a lot more versatile. Unfortunately, I had to work in a lot of environments with proprietary, vendor-locked software and hardware. Usually all you can do is make sure you design the system so that you can chuck entire parts of it (possibly for rework, but sometimes not) if it breaks or gets compromised.

Definitely relevant for, say, SCADA controls with terrible security.

bravetraveler · on Sept 11, 2022

I completely agree, but there's an important balance to consider. HA is a good investment for things that are 'stateless' -- eg: VIPs/LBs

Rebuilding existing systems (or adding new ones) is excellent, but sometimes life really gets simpler when you have one well-known/HA endpoint to use in a given context

2143 · on Sept 12, 2022

I'm not from this universe.

What does HA stand for?

Thank you.

ranic · on Sept 12, 2022

High Availability, I believe.

j45 · on Sept 11, 2022

HA is built into platforms like proxmox, what is the stress?

woopwoop24 · on Sept 11, 2022

tl:dr if you need it, go for it :) my 2 cents is not needed at home and even for some of my clients the complexicty vs a simple setup and a few min downtime still speaks for MTTR instead of HA which noone can debug.

not about stress but to have HA proxmox cluster you will need at least 3 machines or fake one with a quorum machine without vms on it. Sure your vms will run if one machines goes down, but do you have HA Storage underneath? ganesha would work but more complexity or another network storage with more machines. Don't get me wrong it is fun to play with, but i doubt any homelab needs HA and cannot have a few minutes downtime. Or what do you do in case your internet provider goes down or when you have a power outage? i don't want to provoke, i have fiber switches and 10G at home and have 2 locations to switch in case one location goes down but i can live with multiple days of downtime if i have to, or if not i take my backups and fire some VM's on some cloudprovider and be back online in a few min and pay for it because backups are in both locations.

i would much rather develop a simple LB + autoscaling group with a deployment pipeline (lambda or some other control loop) and containers on it than a k8s cluster the client is not prepared for. if they outgrow this, most likely the following solution is better than going 100% "cloud" the first time. Most clients go from java 8 Jboss monolith to spring containers in k8s and then wonder why it is such a shitshow. but yeah pays the bills so i am not complaining that often ^^

j45 · on Sept 12, 2022

Your right that an HA bar is a higher one. And the more ornate your tech stack the more work for HA.

More reasonably, mirroring a few Intel NUCs isn’t that far fetched where backups can spill off to a pair of mirrored nas’. If storage failover is needed, it’s not hard to find a used fibre channel storage array on eBay.

For the network, lots of capable little devices like the edgerouters. 2 internet connections (Cable/dsl + LTE) isn’t that crazy anymore. Starlink will be interesting too.

Power? A UPS or two can keep up a few hours on a low wattage system. In turn can be run by a bigger deep cycle battery, maybe feed by some solar or a small generator.

blinkingled · on Sept 11, 2022

It's absolutely fine and I'd argue essential to know Linux system administration even for people who mostly work in the cloud. From my recent experience sys admin skills seem to be a lost art - people no longer have to care about the underpinnings of a Linux system and when problems hit it shows.

It's interesting from a job market standpoint how little incentive there is to learn the basics when more and more we are seeing developers being made responsible for ops and admin tasks.

jbotdev · on Sept 11, 2022

I think nowadays requiring that all application developers should also do ops (DevOps?) is a bad idea. Sure they should have basic shell skills, but when you’re on Kubernetes or similar, understanding what’s underneath is not vital. Instead, rely on specialized teams that actually want to know this stuff, and become the experts you escalate to only when things really go sideways and the abstractions fail (which is rare if you do it well). If your budget is too small for this, there are always support contracts.

As someone who’s been hiring for both sides, I see this reflected in candidates more and more. The good devs rarely know ops, and the good ops rarely code well. For our “platform” teams, we end up just hiring good devs and teaching them ops. I think the people that are really good at both often end up working at the actual cloud providers or DevOps startups.

blinkingled · on Sept 11, 2022

But then there's the problem that now you have two teams - ops team doesn't understand the app and app team doesn't understand the ops/infra side. I of course agree with your point that there should be two teams but you need a few guys who understand both app dev and ops/infra/OSes/k8s internals etc. And finding these people has been nearly impossible.

jethro_tell · on Sept 11, 2022

You can just add an ops guy with minimal coding to a dev team and it works wonders. Ops guy does the ops stuff, writes the shell scrips and consults with the devs on how to build what they need to build.

The ops guy has an opportunity to level up his coding since the devs are doing code reviews for him and devs have an opportunity level up their ops knowledge because they are working with someone who understands what hand how platform should be built.

wernerb · on Sept 11, 2022

I think it's great to "bring" knowledge to devs in this model. But not commit to individual wishes for features and implement them willy illy. The "ops guy" needs a team too, preferably to build a company shared platform that suits the large majority. This only goes of course when you either hit scale or are a large company. When smaller it's perfectly fine to unload unwanted things to one or two people. I do very strongly believe in "you build it you run it". Keeping servers up is different than keeping your application up. I think a dev should know what health checks/probes are, what a good period time they need for their application. It's the "ops guy"'s job to make the dev guys job as easy as possible and bring expertise and knowledge when needed

MaKey · on Sept 11, 2022

I am one of those rare people :) Coming from the Linux sysadmin and networking side I'm currently working on going deeper into programming and expanding my knowledge there. In Europe the pay for someone like me appears to be capped at slightly above six figures though, which I will reach soon, so I'm currently a bit unsure on how to progress career wise.

wjnc · on Sept 11, 2022

I see this in my firm. Non-IT management loves to resort to the lowest denominator of 'systems', that is - systems built by people who are not trained as administrators nor experienced, but mainly follow online guides /and deliver/ (at p50). When I mention the resilience of the systems we maintain to senior directors the most common reply is that 1. the activities are not production and 2. the cost of letting our proper IT-departments handle things go way beyond the willingness to pay. When I reply that our non-production activities are still unmissable for anything more than a few days (which defines it as production for me, but not in our IT-risk landscape) I usually get greeted with something along the lines that in all things cloud all is different. For non-techies I think it's just hard to phantom that most of the effort in maintaining systems is in resilience, not in the just scripting it to work repeatedly.

Fiahil · on Sept 11, 2022

They most likely resorted to skipping IT involvement because the department is difficult to work with.

"The cloud" is both a technical solution for flexible workload requirements, and a political solution to allow others to continue delivering when IT is quoting them 3 months of "prep work" for setting up a dev environment and 2 to 4 weeks for a single SSL certificate.

As a consultant, I am confronted to hostile IT requirements daily. Oftentimes, departments hiring us end up setting and training their own Ops teams outside of IT. despite leading to incredible redundancy, that's often credited as a huge success for that department, because of the gained flexibility.

amtamt · on Sept 11, 2022

So many people somehow believe a single VM with no disk snapshot/ back-up, having a floating IP, running ad hoc scripts to bring up production workload is "production quality", only because it is running in cloud.

bamboozled · on Sept 11, 2022

I agree, I'm lucky that in my career I've been exposed so many different disciplines of Network and Linux administration that I know (mostly) what "the cloud" is made of.

So when problems do occur, I can make some pretty good guesses on what's wrong before I actually know for sure.

bayindirh · on Sept 11, 2022

As a person who manages a big fleet of servers containing both pets and cattle, the upkeep of the pets is nowhere near the cloud-lovers drum-up.

A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).

Also, not installing a n+3 Kubernetes cluster with an external storage backend reduces overheads and number of running cogs drastically.

VMs, containers, K8S and other things are nice, but pulling the trigger so blindly assuming every new technology is a silver bullet to all problems is just not right on many levels.

As for home hardware, I'm running a single OrangePi Zero with DNS and SyncThing. That fits the bill, for now. Fitting into smallest hardware possible is also pretty fun.

GordonS · on Sept 11, 2022

> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).

For my earlier home setups, this was actually part of the problem! My servers and apps were so zero-touch, that by the time I needed to do anything, I'd forgotten everything about them!

Now, I could have meticulously documented everything, but... I find that pretty boring. The thing with Docker is that, to some extent, Dockerfiles are a kind of documentation. They also mean I can run my workloads on any server - I don't need a special snowflake server that I'm scared to touch.

bayindirh · on Sept 11, 2022

We have found out that, while some applications are installed much easier with Docker, operating them becomes much more harder on the long run.

NextCloud is a prime example. Adding some extensions (apps) on NextCloud becomes almost impossible when installed via Docker.

We have two teams with their own NextCloud installations. One is installed on bare metal, and other one is a Docker setup. The bare metal one is much easier to update, add apps, diagnose and operate in general. Docker installation needed three days of tinkering and headbanging to get what other team has enabled in 25 seconds flat.

To prevent such problems, JS Wiki runs a special container just for update duties for example.

I'd rather live document an installation and have an easier time in the long run, rather than bang my head during some routine update or config change, to be honest.

I document my home installations the same way, too. It creates a great knowledge base in the long run.

Not all applications fit into the scenario I told above, but Docker is not a panacea or a valid reason to not to document something, in my experience and perspective.

xani_ · on Sept 12, 2022

At that point I believe that if software wasn't written Docker-first it will just be plainly shit to manage as Docker container.

Docker containers more often than not are a crutch for app that have absolute shitshow install process to just bake that process in Dockerfile instead of making it more sensible in the first place.

selfhoster11 · on Sept 12, 2022

For a home lab/home prod, time to install is not necessarily the worst thing to optimise for. For example, I’m not going to spend several hours each on 8-10 different alternative applications, just to install them so I can get a feel for which one I want to keep.

Docker is an amazing timesaver, and probably the only reason why I run anything more ambitious than classic LAMP + NFS on my home lab.

dvdkon · on Sept 11, 2022

I agree some software doesn't fit into Docker-style application containers well, but I'm still a fan of Infrastructure-as-Code. I use Ansible on LXD containers and I'm mostly content. For example, my Flarum forum role installs PHP, Apache and changes system configs, but creating the MySQL database and going through the interactive setup process is documented as a manual task.

I could automate this too, but it's not worth the effort and complexity, and just documenting the first part is about as much effort as actually scripting it. I think it's a reasonable compromise.

3np · on Sept 11, 2022

If you make a habit to perform all changes via CI (ansible/chef/salt, maybe terraform if applicable) you get this for free too. See your playbooks as "dockerfiles".

irusensei · on Sept 11, 2022

Where I work maintaining server is a big pain in the ass because of the ever growing security and regulatory compliance requirements. The rules makes patching things an exercise in red tape frustration.

Last year when we’ve been asked to redeploy our servers because the OS version was being added to a nope list we decided to migrate to Kubernetes so we don’t have to manage servers anymore (the nodes are a control plane thing so not our problem). Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.

bauruine · on Sept 11, 2022

>Now we just build our stuff using wathever curated image is there that can be used and ship it without worrying about OS patches.

So you basically replaced the "we regularly have to update the OS" with "we regularly have to pull the newest image". It's possible because I have a Linux admin background but I don't see that big of a difference here. Oh and just using whatever curated image is there doesn't necessarily provide you with a secure environment. [0]

[0] https://snyk.io/blog/top-ten-most-popular-docker-images-each...

irusensei · on Sept 12, 2022

Hi. Sorry if I didn’t made this clear. By curated I meant images that were security scanned and approved before being published on the company internal registry. Those images shouldn’t have the issues listed on the link you posted.

yjftsjthsd-h · on Sept 11, 2022

The thing that helped me was realizing that for a homelab set up, running without the extra redundancy is fine. Now for me that meant running k8s on a single box because I was specifically trying to get experience with it, and putting the control plane and actual workloads on a single machine simplified the whole thing to the point that it was easy to get up and running; I had gotten bogged down in setting up a full production grade cluster, but that wasn't even remotely needed for what I was doing.

turtledragonfly · on Sept 11, 2022

Amen (:

Checking uptime on my FreeBSD home server ... 388 days. Probably 15-year-old hardware. That thing's a rock.

I can't say I'm very proud of my security posture, though (:

One day I'll do boot-to-ZFS and transubstantiate into the realm of never worrying about failed upgrades again.

nix23 · on Sept 12, 2022

UFS Boot environments?

https://forums.freebsd.org/threads/ufs-boot-environments.796...

turtledragonfly · on Sept 13, 2022

Thanks — I had never heard of that (:

xani_ · on Sept 12, 2022

Our biggest project (with a bunch of different infrastructure stuff) is also one with the least time spent per instance). We have months where no ops even logged on their VMs.

Our highest maintenance stuff is entirely "the dev fucked up"/"the dev is clueless". Stuff like server deciding to dump gigabytes of logs per minute or run out of disk space coz of some code error. Only difference compared to k8s is that the dead app would signal dev first, not actual monitoring we had.

And honorable mention for WordPress "developers" that make the first place in "most issues per server" every single year. And we have very few WP compared to everything else. Stuff like "dev uploaded some plugins, got instantly hacked (partially, we force using outgoing proxy and that stopped full compromise), we reverted it, he uploaded same stuff and got hacked again". So, nothing to do with actual servers.

Stuff like setting up automation to deploy ceph cluster took some time... once. Now it takes nothing. We even managed to plug it to k8s cluster.

> A server installed with half-decent care can run uninterrupted for a long long time, given minimal maintenance and usual care (update, and reboot if you change the kernel).

could even not be touched if you set up automated updates and reboots, the red tape is usually a problem, like dumbly written support deals needing notification for every restart even if service is in HA. Or some dumbo selling a service with SLA but only on single machine...

ksbrooksjr · on Sept 11, 2022

Even changing the kernel can often be done without rebooting the system depending on which distro you're using. Quite a few distros now include some sort of kernel livepatching service including Ubuntu, Amazon Linux, and RHEL.

bayindirh · on Sept 11, 2022

Yeah, We're aware of that, but none of our pets is mission critical enough to prevent 5 minutes of downtime, at most, in most cases.

ksbrooksjr · on Sept 11, 2022

That's understandable. Not every service needs a perfect uptime record.

don-code · on Sept 11, 2022

My home lab sounds pretty similar to the author's - three compute nodes running Debian, and a single storage node (single point of failure, yes!) running TrueNAS Core.

I was initially pretty apprehensive about running Kubernetes on the compute nodes, my workloads all being special snowflakes and all. I looked at off-the-shelf apps like mailu for hosting my mail system, for instance, but I have some really bizarre postfix rules that it wouldn't support. So I was worried that I'd have to maintain Dockerfiles, and a registry, and lots of config files in Git, and all that.

And guess what? I do maintain Dockerfiles, and a registry, and lots of config files in Git, but the world didn't end. Once I got over the "this is different" hump, I actually found that the ability to pull an entire node out of service (or have fan failures do it for me), more than makes up for the difference. I no longer have awkward downtime when I need to reboot (or have to worry that the machines will reboot), or little bits of storage spread across lots of machines.

Sebb767 · on Sept 11, 2022

I fully agree. If you come from "old-school" administration, Docker and Kubernetes seem like massive black boxes that replace all your known configuration screws with fancy cloud terms. But once you get to know them, it just makes sense. Backups get a lot simpler, restoring state is easy and keeping things separated just becomes a lot easier.

That being said, I can only encourage the author with this plan. All those abstractions are great, but at least for me it was massively valuable to know what you are replacing and what an old-school setup is actually capable of.

snickersnee11 · on Sept 11, 2022

Yes, there's really something about old school system administration. I do really appreciate how Fedora and Debian SA team does it, you can see here: - https://pagure.io/fedora-infrastructure/ - https://docs.fedoraproject.org/en-US/infra/ - https://dsa.debian.org/

nixcraft · on Sept 11, 2022

Once you work in enterprise IDC, setting up home labs is nothing but a liability. Unwanted stuff like UPS, cooling and more bills. I have one Intel NUC running FreeBSD 13 (dual SSD with 32 GB ram and Intel i7 CPU 8th GEN) with one jail and VM. It acts as a backup server for my Ubuntu laptop and MacBook pro. Nightly I dump data to cloud providers for offsite reasons. Finally, I set LXD, Docker and KVM on my Ubuntu dev laptop for testing. That is all. No more FreeNAS or 3 more servers including 2 RPis. I made this change during the first lockdown. They (I mean fun and excitements) went away once I handled all those expensive IT equipment/servers/firewalls/routers/WAFs funded employer ;)

ianai · on Sept 11, 2022

You’re still doing a lot of home serving/IT there.

I take your point and generally agree. Leave the major hardware to the actual data centers. Stuff at home should be either resume driven or very efficient.

neoromantique · on Sept 11, 2022

Agreed, I used to run enterprise hardware at home, and it was certainly fun to tinker with (Not to mention all of the blinking lights!).

Last year I ran some numbers and realized that it just wasn't worth it, as electricity costs alone make it fairly close to what Hetzner/OVH would charge for similar hardware. I also had power go down in my area, for probably first time in 5 years or so that I lived there, which I took as a sign to just migrate away.

I peer it into my internal network using WireGuard, so I barely notice a difference in my use and now that electricity costs are skyrocketing in Europe I'm certainly very happy I went this route.

dade_ · on Sept 11, 2022

I love LXC/LXD for my home server. Far easier to use and maintain than VMs, fast, and use far less resources than VMs. And understanding containers is great foundational knowledge for working with Docker and K8s. They also work great with NextCloud, Plex, PostgreSQL, and Zabbix, and SAMBA. But each are separate, no risk of a library or an OS upgrade taking out an app (and my weekend along with it). Snapshots are the ultimate ctrl-z, and backups are a breeze once you get past the learning curve.

Ansible with ‘pet’ containers is the way to go. Use it to automate the backups and patching. Ansible cookbooks are surprisingly easy to work with. Again, a learning curve, but it pays for itself within months.

Running a single machine with all the apps in a single environment is a recipe for tears as you are always one OS upgrade patch, library requirement change, hard drive failure away from disaster and hours of rebuilding.

willjp · on Sept 11, 2022

This sounds like my setup exactly, except instead of LXC/ansible, I’m using FreeBSD jails/saltstack.

I completely agree about the value of isolation. You can update or up-rebuild for a new os version on one service at a time - which I find helpful when pulling in updated packages/libraries. You can also cheaply experiment with a variation on your environment.

whitepoplar · on Sept 11, 2022

Would you use LXD in production?

yokem55 · on Sept 11, 2022

If you build it yourself from source or use distro packages. Running it under cannonical's snap packages is a bit of a nightmare because of issues around the forced auto updates.

CoolCold · on Sept 13, 2022

Interesting, I run lxd from snaps but not noticing issues, can you provide some examples, I may check myself on that?

dual_dingo · on Sept 11, 2022

I've done that in the past. It was fun and nostalgic. Then I had to upgrade the OS to a newer version and things were not fun anymore. At all. And I remembered why VMs and containers are so extremely useful.

nix23 · on Sept 12, 2022

You are right, but he is using RHEL 9 so you have >10 years without upgrade's ;)

joshka · on Sept 11, 2022

Sounds like the author realized they have a pet that they've been trying to treat like some weird franken-cattle for some time.

The note about dns is fairly orthogonal to this though. Sensible naming should be the default regardless of the physical setup. My photos are at photos.home.mydomain.com, my nas is nas.home.mydomain.com. They're the same hardware - perhaps different containers.

There are a million ways to make your life harder in this world, and a bunch of ways to make it easier. I'd put docker in the simplify category, and a bunch of the other tooling the author mentions in the complexify side. YMMV.

greggyb · on Sept 11, 2022

Are there any good resources for what to mind as a sysadmin? It seems that most resources I find when searching, or that come across my radar (like this article) are focused on mechanism and how-to.

I often feel a bit out of my depth when dealing with sysadmin tasks. I can achieve pretty much any task I set out to, but it is difficult to ascertain if I have done so well, or with obvious gaps, or if I have reinvented some wheels.

Does anyone have a recommendation for in-depth "So you want to be a sysadmin..." content? Whether books, articles, presentations, or any other format.

megous · on Sept 11, 2022

Depends on what system you want to administrate. The topic is very broad.

But it never hurts to understand the fundamentals:

- networking (IPv4/IPv6/TCP/UDP/ICMP/various link layer technologies (ethernet, wifi, ppp, USB, ...)... and concepts like VPNs/VLAN/bridging/switching/routing/...)

- protocols (DHCP, DNS, NTP, SMTP, HTTP, TLS, ...)

- chosen OS fundamentals (so on Linux things like processes, memory, storage, process isolation, file access control, sockets, VFS, ...)

- OS tooling (process managers, logging tools, network configuration tools, firewall, NAT, ...)

- how to setup typical services (mail servers, file servers, http servers/proxies, DNS servers, DNS resolvers, ...)

- how to use tools to debug issues (tracing, packet capture, logging, probing, traffic generators, ...)

- how to use tools to monitor performance of the network and hosts

These are just some basics to run a home network or maybe a small business network.

greggyb · on Sept 11, 2022

This is a helpful review of the fundamentals to cover. I will definitely be doing my own searches for content on the topics.

Do you have any good references for resources in the vein of teaching the fundamentals. Not like "DNS: here is how to configure a BIND server" but like "DNS: Here are the core concepts, what you will typically need to set up, common issues / confusions, and further reading."

I have tried going through RFCs for a lot of these, but they tend to be focused on implementation rather than "why". Similarly, software-specific content is "how to achieve this outcome with this implementation", but lacks context on why one would want that outcome.

megous · on Sept 11, 2022

I'm reading this book and it contains explanations about "why" too, in addition to some light historical context in places:

Douglas Comer: "Internetworking with TCP/IP Volume One 6th Edition"

lakomen · on Sept 11, 2022

He should make 1 big VM and make snapshots before trying new experiments.

That's how I'll go about the next server I rent. And that's how I did with my main project.

You can have nested VMs if you need them.

But do whatever you want ofc.

404mm · on Sept 11, 2022

I’ve been there. Many technologies OP listed .. it all becomes a nightmare without proper HA (but who want’s to have _multiple_ servers at home (servers as enterprise HW, not a computer designation). Then you hit an issue with licensing, old HW not being supported by recent releases (looking at you, ESXi), loud and heat producing boxes in your closet... I am grateful for having this awful experience but I moved on as well.

For me, the middle ground seems to be a very plain Debian server, configured with Ansible and all services running in Docker. I consume a few official images and have a (free) pipeline set up to create and maintain some of mine own. I appreciate the flexibility of moving a container over to different HW, if needed, and also the isolation (both security and dependencies).

Going back pure, OS level configuration is always fun - and often necessary to understand how things work. It’s very true that the new cattle methodology takes away any deeper level of troubleshooting and pushes “SRE” towards not only focusing on their app code but also away from understanding the OS mechanics. Instance has died? Replace it. How did it happen? Disk filled up because of wrong configuration? Bad code caused excessive logging? Server was compromised? _They_ will never know.

jmillikin · on Sept 11, 2022

  > cattle methodology takes away any deeper level of troubleshooting and
  > pushes “SRE” towards not only focusing on their app code but also away
  > from understanding the OS mechanics

This is a minor point, but as a former SRE: We are the people writing the layer between the OS and the product developers (owners of the "app code"). Understanding the OS mechanics is a core part of the job.

I know some places are trying to redefine "SRE" to mean "sysadmin", but that seems silly given that we'd need to invent a new term to mean the people who write and operate low-level infrastructure software that the rest of a modern distributed stack runs on.

404mm · on Sept 11, 2022

SRE seems to mean something else at every company. As a current SRE, I agree that this is what _I_ do most of the time. But while saying that, I also don’t see myself as “true SRE”, not the way Google defined it. SRE should become the code owner once it becomes stable. This alone adds heavy emphasis on having a background in software development.

Most of the time, I see two kinds of SRE around me. 1. The “rebranded sysadmin” SRE. Typically heavy on Ops skills and deep OS level knowledge but no formal education or experience in Software development.

2. The “SDE in Cloud” SRE. Person coming from software development side, having very little Ops experience, relying on benefits of quick and disposable computational power from your favorite Cloud provider.

I don’t mean to “shame” either of those cases, I am just pointing out that person with years (or decades) of experience on one side cannot suddenly become well balanced SRE the way Google defined it.

zrail · on Sept 11, 2022

I really like this article just for the straightforwardness of the setup. Pets not cattle should be the home server mantra.

My setup is not quite as simple. I have one homeprod server running Proxmox with a number of single task VMs and LXCs. A task, for my purposes, is a set of one or more related services. So I have an internal proxy VM that also runs my dashboard. I have a media VM that runs the *arrs. I have an LXC that runs Jellyfin (GPU pass through is easier with LXC). A VM running Home Assistant OS. Etcetera.

Most of these VMs are running Docker on top of Alpine and a silly container management scheme I've cooked up[1]. I've found this setup really easy to wrap my head around, vs docker swarm or k8s or what have you. I'm even in the process of stripping dokku out of my stack in favor of this setup.

Edit: the homeprod environment also consists of a SFF desktop running VyOS and a number of Dell Wyse 3040 thin clients running the same basic stack as the VMs, most running zwave and zigbee gateways and one more running DHCP and DNS.

[1]: https://github.com/peterkeen/docker-compose-stack

neoromantique · on Sept 11, 2022

Honestly, it sounds more as if author is just not experienced enough to reap the rewards of containers/IaaC.

Having some servers as pets is fine for time being, but after being used to the cattle little pesky issues & upkeep starts to annoy more and more and you either give up on it and leave neglected or slowly build up most of the infra you ran away from in the first place.

blueflow · on Sept 11, 2022

It depends on your software selection. I run machines than i do updates on twice a year and thats all the maintenance they need. Less than a person-day per year!

neoromantique · on Sept 11, 2022

You're right, I might've extrapolated my usage/experience onto the author accidentally, maybe his use case is indeed different.

I still don't know why you wouldn't use containers though, even if you dislike Docker, something like LXC makes everything oh-so neater.

blueflow · on Sept 11, 2022

I tried both Docker and Bubblewrap. Containers help with managing dependency creep, but if you avoid that in the first place, you don't need containers.

neoromantique · on Sept 11, 2022

If you only run your own software and are conscious about your dependencies, then dependency creep is gonna be less of an issue, but don't forget that your dependencies often bring theirs too. And third-party software is very often much less mindful of it.

But even besides dependency creep, containers simplify your software env, simplify backups, migrations, permissions(although container are not replacement for secure isolation), file system access, network routing etc,etc,etc.

marginalia_nu · on Sept 11, 2022

Containers also make performance tuning and optimization a lot harder, and increase resource consumption by a far greater degree than people are willing to admit. On a by-container basis it's not a lot, a few hundred megabytes here and there, but it builds up very rapidly.

This is compounded by containerizing making the ecosystem more complex, which creates a demand for additional containers to manage and monitor all the containers.

I freed up like 20 Gb of RAM by getting rid of everything container-related on my search engine server and switching to running everything on bare-metal debian.

neoromantique · on Sept 11, 2022

Containers of course add some overhead, however it is negligible in modern world.

I honestly don't know what you have to do to get 20 gigs of overhead with containers, dozens of full ubuntu containers with dev packages and stuff?

marginalia_nu · on Sept 11, 2022

> Containers of course add some overhead, however it is negligible in modern world.

I feel the people most enthusiastically trying to convince you of this are the infrastructure providers who also coincidentally bill you for every megabyte you use.

> I honestly don't know what you have to do to get 20 gigs of overhead with containers, dozens of full ubuntu containers with dev packages and stuff?

Maybe 5 Gb from the containers alone, they were pretty slim containers, but a few dozen of them in total; but I ran microk8s which was a handful of gigabytes, I could also get rid of kibana and grafana, which was a bunch more.

neoromantique · on Sept 11, 2022

>I feel the people most enthusiastically trying to convince you of this are the infrastructure providers who also coincidentally bill you for every megabyte you use.

They don't though? Cloud providers sell VM tiers, not individual megabites, and even then on Linux there is barely any overhead for anything, but memory, and memory one is, again, if you optimize it far enough is negligible.

>but I ran microk8s which was a handful of gigabytes

My k3s with a dozen of pods fits in couple gigs.

marginalia_nu · on Sept 11, 2022

> They don't though? Cloud providers sell VM tiers, not individual megabites, and even then on Linux there is barely any overhead for anything, but memory, and memory one is, again, if you optimize it far enough is negligible.

They don't necessarily bill by the RAM, but definitely network and disk I/O, which are also magnified quite significantly. Especially considering you require redundancy/HA when dealing with the distributed computing

(because of that ol' fly in the ointment

   r    n-r    n!
  p (1-p)   -------
            r!(n-r)!

)

> My k3s with a dozen of pods fits in couple gigs.

My Kibana instance alone could be characterized as "a couple of gigs". Though I produce quite a lot of logs. Although that is not a problem with logrotate and grep.

neoromantique · on Sept 11, 2022

>They don't necessarily bill by the RAM, but definitely network and disk I/O, which are also magnified quite significantly. Especially considering you require redundancy/HA when dealing with the distributed computing

Network and Disk I/O come with basically 0 overhead with containers.

> Especially considering you require redundancy/HA when dealing with the distributed computing

Why? This is not an apples to apples to comparison then. If you don't need HA, just run a single container, there's plenty other benefits besides easier clustering.

>My Kibana instance alone could be characterized as "a couple of gigs". Though I produce quite a lot of logs. Although that is not a problem with logrotate and grep.

How is Kibana(!) relevant for container resource usage? Just don't use it and go with logrotate and grep? Even if you decide to go with a cluster, you can continue to use syslog and aggregate it :shrug:

blueflow · on Sept 11, 2022

It makes it easier in the sense of "better accessible", not simpler. The docker network setup and namespace management are actually quite complex.

sophacles · on Sept 11, 2022

I don't find docker networking particularly complex, it's just a simple bridge and some iptables rules.

blueflow · on Sept 11, 2022

When you only need docker to isolate your dependencies, its unnecessary overhead.

sophacles · on Sept 11, 2022

You made a point about complexity, I'm talking about that, why are you talking about overhead?

As for overhead, what is the quantified measure of it? Like sure there's overhead, but exactly how much?

blueflow · on Sept 11, 2022

You don't need to run a network bridge because you want to run two daemons of the same kind.

sophacles · on Sept 11, 2022

I never said you did. I'm asking you to quantify the costs because all engineering and ops decisions involve tradeoffs. Just because something incurs overhead, or is on of several solutions, it doesn't mean that it's bad. It's just an additional factor for consideration. So what are the actual overhead costs?

neoromantique · on Sept 11, 2022

Docker is not the only option when it comes to containers

netfortius · on Sept 11, 2022

As much as people on HN frown upon anything reddit related, the /r/SelfHosted is an amazing source for what the article topic is about. In fact I came across large organizations running the likes of Proxmox (VE), Ceph, Minio, k3s, Graphana, etc., etc., with a lot of info available in this subreddit. Highly recommended.

bitlax · on Sept 11, 2022

In my experience, Hacker News has been one of the places that's unusually sympathetic to Reddit, probably because it originated with YCombinator. If you get the impression that people here generally dislike Reddit you should take that as a a strong indicator that Reddit has actually become a cesspool.

UncleEntity · on Sept 11, 2022

> For my own personal tasks I plan on avoiding sudo at all. I've often found myself using it as shortcut to achieve things.

That’s all I use (for administration stuff), found out not too long ago I don’t even know the su password or it isn’t set up to allow users to su — dunno, never missed the ability until a couple weeks ago when I got tired of typing sudo for multiple commands.

Also found out a long time ago installing software to /opt turns into a big enough PITA that I’ll go to all the trouble to make an rpm and install stuff using the package manager. Usually I can find an old rpm script thingie to update so isn’t too much hassle and there’s only a couple libraries I use without (un)official packages. Why, one might ask, don’t I try to upstream these rpm so everyone will benefit? Well, I’ve tried a couple times and they make it nearly impossible so I just don’t care anymore.

dapozza · on Sept 11, 2022

I think this is to oldscool. You can still automate via Ansible, keep your playbooks and roles in Git and run VM's.

b1476 · on Sept 11, 2022

Exactly this. I have a big nostalgia for old school sysadmin ways of working, but just use Ansible. It’s beyond easy and can be as simple or complex as you like. I use containers on my home server (with podman) alongside a bunch of other services and it’s all just so clean and easy to maintain as a collection of roles and playbooks.

irusensei · on Sept 11, 2022

I think homeserver and home lab are different things.

For homeserver I use some rock chip arm computers. It runs stuff that is (arguably) useful like a file server and a local DNS with ad and malware filtering rules. These are two fanless ARM machines that use less than 15W idle and they have a single OS and funny names. You can even say 2 machines is a bit too much already.

For home lab I’ve set a k3s environment on Oracle free tier cloud. No personal data or anything that would cause me problems if Oracle decides supporting my freeloading ass is not worth the hassle. I can test things there for exemple recently dabbling into Prometheus since workplace will soon introduce it as a part of a managed offering for Azure Kubernetes services.

jerrac · on Sept 11, 2022

I can see the appeal. I've been working on containerizing apps at work. There's a lot of complication there. But I also really really want to be able to run updates whenever I want. Not just during a maintenance window when it's ok for services to be offline. And containers are the most likely route to get me there.

At home, well, I enjoy my work, so I containerize all my home stuff as well. And I have a plan to switch my main linux box to ProxMox and then host both a Nomad cluster and a Rancher cluster.

If I weren't interested in the learning experience, I'd just stick with docker-compose based app deployment and Ansible for configuring the docker host vm.

j45 · on Sept 11, 2022

Docker especially with Portainer is pretty serviceable for an at home setup.

Once the line has been crossed with "I should be backing this up", "I should have a dev copy vs the one i'm using", the existing setups can port quite well/into something like ProxMox to keep everything in a homelab setup running relatively like an appliance with a minimum of manual system administration, maintenance and upgrading.

jerrac · on Sept 11, 2022

If Docker Swarm was a bit better, I'd have all my test instances at work running a Docker Swarm, Portainer, and Traefik stack. Unfortunately Swarm has some quirks that make running stateful apps a bit difficult.

For home use, I've been experimenting with Portainer. It seems to work well for apps I'm not developing and am just running.

ancieque · on Sept 12, 2022

Could you elaborate on the quirks regarding stateful apps?

jerrac · on Sept 12, 2022

So, for context, my experience is limited to trying to get a MariaDB Galera cluster running. Specifically using the Bitnami image. So my issues might not apply to every single stateful app out there. I'm also running all of this on a vsphere in our own data center. Not in the cloud.

Swarm does not support dependencies between services. See [0]. It also does not support deploying replicas one at a time. See [1] where I'm asking for that support.

In the case of Galera, you need a master node to be up and running before you add any new nodes. I'm pretty sure that when you're initiating any kind of stateful clustered app, you'd want to start with one node at a time to be safe. You can't do that in Swarm using a replicated service. All replicas start at the same time.

Using a service per instance might work, but you need to be sure you have storage figured out so that when you update your stack to add a new service to the stack, the initial service will get the data it was initiated with. (Since when you restart a stack to add the new service, the old service will also get restarted. If I'm remembering what I found correctly.)

Then there's updating services/replicas. You cannot have Swarm kill a service/replica until after the replacement is actually up and running. Which means you'll need to create a new volume every time you need to upgrade, otherwise you'll end up with two instances of your app using the same data.

To complicate things, as far as I can tell, Swarm doesn't yet support CSI plugins. So you're pretty much stuck with local or nfs storage. If you're using local storage when deploying new replicas/services, you better hope the first replica/service starts up on same node it was on before...

All that combined means I haven't figured out how I can run a Galera cluster on Swarm. Even if I use a service per instance, updates are going to fail unless I do some deep customization on the Galera image I'm using to make it use unique data directories per startup. Even if I succeed in that, I'll still have to figure out how to clean out old data... I mean, I could manually add a new volume and service, then manually remove the old volume and service for each instance of Galera I'm running. But at the point, why bother with containers?

Anyway, I'm pretty sure I've done my research and am correct on all of this, but I'd be happy to be proven wrong. Swarm/Portainer/Traefik is a really really nice stack...

[0] https://github.com/moby/moby/issues/31333 [1] https://github.com/moby/moby/issues/43937

ancieque · on Sept 12, 2022

If you are interested in making this work with any of the constraints, I am sure that there is a way to work around all these issues.

About [0]/[1]: I guess you are right in this not working out of the box, but this could possibly be worked around with a custom entrypoint that behaves differently on which slot the task is running in.

> (Since when you restart a stack to add the new service, the old service will also get restarted. If I'm remembering what I found correctly.)

Are you sure the Docker Image digest did not change? Have you tried pinning an actual Docker Image digest?

> Then there's updating services/replicas. You cannot have Swarm kill a service/replica until after the replacement is actually up and running. Which means you'll need to create a new volume every time you need to upgrade, otherwise you'll end up with two instances of your app using the same data.

Is this true even with "oder: stop-first"?

> To complicate things, as far as I can tell, Swarm doesn't yet support CSI plugins. So you're pretty much stuck with local or nfs storage. If you're using local storage when deploying new replicas/services, you better hope the first replica/service starts up on same node it was on before...

True, but there are still some volume plugins that work around and local storage should work if you use labels to pin the replicas to nodes.

jerrac · on Sept 19, 2022

Finally have time to look into your suggestions. Hopefully you check your comments every once in a while...

> Are you sure the Docker Image digest did not change? Have you tried pinning an actual Docker Image digest?

Mostly sure. Many of my tests only changed the docker-compose file, not the actual image. So even though GitLab was rebuilding the image, the image digest would not have changed. I'll try to find time to pin the digest just to double check, though.

> Is this true even with "oder: stop-first"?

Er, did you mean "order"? I only see `--update-order` as a flag on the `docker service update` command. I do not see it in the docker-compose specification. So far all my tests have been through Portainer's stack deployment feature. So all changes are in my docker-compose file.

Maybe it would just work if I stuck it in the deploy.update section? I'll try.

> True, but there are still some volume plugins that work around and local storage should work if you use labels to pin the replicas to nodes.

I have tried pinning specific services to specific nodes to make local storage work. And I've use labels to force only one replica per node when using replicas.

What volume plugins are you thinking of? I haven't found any that seem to be maintained outside of local storage and nfs. And maybe some that would work if I were in some cloud host...

Anyway, thanks for giving me a couple things to try. :)

ancieque · on Sept 22, 2022

About the order. I mean the order in deploy

``` deploy: mode: replicated replicas: 3 update_config: order: stop-first parallelism: 1 rollback_config: order: stop-first parallelism: 1 restart_policy: condition: on-failure ```

For the volume plugins:

We are using Hetzner, and this one works great: https://github.com/costela/docker-volume-hetzner . Also, there exists one for glusterfs (https://github.com/chrisbecke/glusterfs-volume).

jerrac · on Sept 26, 2022

Thanks!

I also found the docs: https://docs.docker.com/compose/compose-file/deploy/#update_... Not sure how I missed that when I was looking at it before... :\

I'll look into those plugins.

ancieque · on Sept 28, 2022

I learned about a lot of things by watching videos by Bret Fisher. He has a lot of good resources on running Docker Swarm

woopwoop24 · on Sept 11, 2022

if you have more than one pod running at the same time, otherwise will be downtime if the container spawns somewhere else or am i missing something?

jerrac · on Sept 11, 2022

I assume you mean at work?

Yes, we'd have more than one instance of the app running. Ideally 3 or 5 instances. And they'd be load balanced. So I could update the app one instance at a time with no downtime. At least that is the goal.

davidwritesbugs · on Sept 11, 2022

Can someome explain "Pets" and "Cattle"? Not seen this useage before, think I get it but want to be sure. Pets are servers for love and fun, Cattle are servers exploited for loveless meat & milk?

frellus · on Sept 11, 2022

"pets" - servers which require unique, individual care, love, naming and feeding and are irreplaceable in their function. You know it's a "pet" server when you think long and hard when you name it

"cattle" - servers which have no distinct personality traits, are fed, loved and cared for as a group or heard. You know it's a "cattle" server when the name isn't something you can easily remember

Neither pets nor cattle are "exploited", they all have their jobs to do, however they have an ROI and also a consequence of being unique or not. Cattle tend to be born, graze, die and it's part of the cycle. Pets die, and you are emotionally upset and do something stupid -- and you find yourself doing unholy things to resurrect them like using backup tapes or something

davidwritesbugs · on Sept 12, 2022

"Exploited" was my failed attempt at humour :)

pixelmonkey · on Sept 11, 2022

This post has the slide that seared into my memory when I first heard the analogy in 2012-2013 or so:

https://www.engineyard.com/blog/pets-vs-cattle/

Direct link to slide screencap here:

https://www.engineyard.com/wp-content/uploads/2022/02/vCloud...

hereyougo2 · on Sept 11, 2022

https://cloudscaling.com/blog/cloud-computing/the-history-of...

graphviz · on Sept 11, 2022

I can't even expand half the acronyms in this article. When do people have time to work on anything besides keeping their "system" up to date. python-is-python2? Maybe use sed to write all your code, too.

l0rn · on Sept 11, 2022

And the cycle begins once again :)

mt_ · on Sept 11, 2022

Automation is useful, even in homelab environments, since manual configuration introduces human errors, and automation can speed up recovery from disasters. This article is just for the sake of it. The author calls it old administration, but he better call it outdated and inefficient Linux administration.

marginalia_nu · on Sept 11, 2022

Automation is still subject to human errors, and can propagate them much farther.

sophacles · on Sept 11, 2022

So do like you would in old-school, have a staging/testing environment.

Nice thing about automation, is you can just rebuild that easily to a clean state, rather than having to remember how staging is different from prod. And as you write your automation and find those human errors, you put it in code so that you don't forget a step. Nothing more frustrating than thinking you've solved the problem in staging only to run a command in prod that breaks things more, and in a different way, because it turns out the fix was a combination of earlier experimentation plus the command you just ran.

eternityforest · on Sept 11, 2022

Automation makes things repeatable. If errors happen, in the "cattle not pets" mindset, you nuke it from orbit and let Ansible make you a new one from a clean Debian/RHEL image. As long as your backup of user data is good(Better double check that script, if you have a script for that) it doesn't matter.

As long as you avoid technology that takes more than a few minutes to set up, rebuilding is trivial.

jmclnx · on Sept 11, 2022

I agree with him, simpler is better and I still have not jumped on the container bandwagon. I did use Jails on FreeBSD and from my limited experience, I still think Jails is much better than anything Linux offers.

nix23 · on Sept 11, 2022

Nice, did the same but with FreeBSD and Jail's (for organizational reasons), having a pet server makes you learn that system better and solve problems instead of destroy and re-create...and it's fun.

frellus · on Sept 11, 2022

In your list of things to test, let me save you some time/effort:

"Hyper-V" -- Don't. Just... don't. Don't waste your time.

Source: ran Hyper-V in production for 3 years. Still have nightmares.