Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

9s don’t have to drop if you increase the time period! “We still guarantee the same 9s just over 3450 years now”.


In a company where I worked, the tool measuring downtime was at the same server, so even if the server was down they still showed 100% up.

If the server didnt work - the tool too measure didnt work too! Genius


This happened to AWS too.

February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.

https://aws.amazon.com/message/41926/



Five times is no longer a couple. You can use stronger words there.


It happened a murder of times.


Ha! Shall I bookmark this for the eventual wiki page?


https://www.youtube.com/watch?v=HxP4wi4DhA0

Maybe they should start using real software instead of mathematicians' toy langs


Have we ever figured out what “red” means? I understand they’ve only ever gone to yellow.


If it goes red, we aren't alive to see it


I'm sure we need to go to Blackwatch Plaid first.



Published in the same week of October ...9 years ago ...Spooky...


I used to work at a company where the SLA was measured as the percentage of successful requests on the server. If the load balancer (or DNS or anything else network) was dropping everything on the floor, you'd have no 500s and 100% SLA compliance.


Similar to hosting your support ticketing system with same infra. "What problem? Nobody's complaining"


I’ve been customer for at least four separate products where this was true.

I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>


I spent enough time ~15 years back to find an external monitoring service that did not run on AWS and looked like a sustainable business instead of a VC fueled acquisition target - for our belts-n-braces secondary monitoring tool since it's not smart to trust CloudWatch to be able to send notifications when it's AWS's shit that's down.

Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.

(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)


Nagios is still a thing and you can host it wherever you like.


Interestingly, the reason I originally looked for and started using it was an unapproved "shadow IT" response to an in-house Nagios setup that was configured and managed so badly it had _way_ more downtime than any of the services I'd get shouted about at if customers noticed them down before we did...

(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)


If its not on the dashboard, it didn't happen


Common SLA windows are hour, day, week, month, quarter, and year. They're out of SLA for all of those now.

When your SLA holds within a joke SLA window, you know you goofed.

"Five nines, but you didn't say which nines. 89.9999...", etc.


These are typically calculated system-wide, so if you include all regions, technically only a fraction of customers are impacted.


Customers in all regions were affected…


Indirectly yes but not directly.

Our only impact was some atlassian tools.


I shoot for 9 fives of availability.


5555.55555% Really stupendous availableness!!!


I see what you did there, mister :P


I prefer shooting for eight eights.


You mean nine fives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: