> With a part time person you will see downtime when a machine fails
If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.
> With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.
Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.
Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.
A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.
There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.
To me, the downside of on premise hardware isn't hardware swap out, it's just dealing with hardware in general. All hardware needs updates which is downtime for that hardware. Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected or just plain "Actually, THAT failure mode isn't redundant."
That can happen to Public Cloud as well but since they work with hardware at much much larger scale and most of time, build actual hardware software, they are much more aware of sharp edges.
Finally, with Broadcom acquisition, what virtualization software are using and is it really cheaper then the cloud?
> Also, anyone in this industry long enough has been around for "Oh, we will just replace that broken piece of hardware" that ended up "WHY IS EVERYTHING ON FIRE?" because versions didn't match up, hardware was rejected
I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this". With the warranty SLA worst case scenario they'll just replace the entire machine if they have to although I don't remember ever seeing it come to that.
> just plain "Actually, THAT failure mode isn't redundant."
When it comes down to it similar issues exist with clouds - regions, availability zones, etc. Big clouds have had multiple widespread outages just this year[0].
From that reference you can see that MS and Amazon themselves struggle to design, build, and run solutions for their own products in their own clouds.
It's always interesting to see marquee household name companies/products/solutions go down when US-East (or whatever) is having a bad day again.
Cloud can be a lot of things but a silver bullet for reliability and uptime isn't one of them.
>I've been doing this for 25 years and I'm not sure what this means. Dell isn't going to come back to you and say "sorry but we can't fix this".
Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.
No, public clouds are not 100% reliable either. It's just their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved.
> Dell/EMC says "Hey, here is drive replacement." We do it, 2 hours later, the volume is knocked offline. Apparently, there was mismatch between backplane version, drive version and through some weird edge case, it knocked the volume offline. Yes, they fixed it, no it wasn't pretty since a bunch of applications had to be recovered.
Anecdotal (as is my position). I can theoretically understand this happening but not only have I never seen it, such an issue would need to be escalated. That's a "this is unacceptable" high-level phone call. A call you more than likely have a chance of someone in actual authority answering because IME unless you have SERIOUS spend with big cloud you'll be lucky to make it a rung or two up sales/support.
Plus backups and redundancies that should prevent even the failure of a chassis/storage/etc from being a significant critical issue.
> their failures tend to be you twiddling your thumbs vs hair on fire on phone with the vendor trying to get it resolved
As a Founder/CTO I have the opposite take - put me and my team in a position to /do something/ vs sitting around waiting for AWS to come back whenever it decides to and while they obscure comms, don't update the fake status dashboards, etc. Meanwhile you're telling your customer "Ummm, we don't know - Amazon has a problem. When it comes back I guess it's back".
Coming from a background of telecom, healthcare, and nuclear energy I can't believe that even flies.
If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.
> With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.
Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.
Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.
A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.
There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.