Gmail having issues

windexh8er · on Dec 16, 2020

Just got this from the ProtonMail team:

> Dear ProtonMail user,

Starting at around 4:30PM New York (10:30PM Zurich), Gmail suffered a global outage.

A catastrophic failure at Gmail is causing emails sent to Gmail to permanently fail and bounce back. The error message from Gmail is the following:

550-5.1.1 The email account that you tried to reach does not exist.

This is a global issue, and it impacts all email providers trying to send email to Gmail, not just ProtonMail.

Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

We are closely monitoring the situation. At this time, little can be done until Google fixes the problem. We recommend attempting to resend the messages to Gmail users when Google has fixed the problem. You can find the latest status from Google's status page:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

Best Regards, The ProtonMail Team

octoberfranklin · on Dec 16, 2020

This is the Nightmare Scenario for mailing lists.

Many of them auto-unsubscribe after a bounce.

dataflow · on Dec 16, 2020

I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before. The only legit scenario in which that happens is when the user deletes the address, which is a rare event (pretty much always <= 1 time in the lifetime of any address), and there shouldn't be anything wrong with treating that kind of situation the same as any soft error. If you're wrong, your mail will just get rejected a few more times anyway, and you'll know it's genuinely a dead end.

The underlying issue (wherever this occurs) seems to be lack of nuance regarding error codes when people try to implement robust systems. Different codes imply different things and shouldn't all just fall back into generic buckets.

bayindirh · on Dec 16, 2020

> I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before.

Like HTTP, SMTP is also designed to be stateless so, in the first place, the remote server shouldn't return a permanent error in temporary failure scenarios.

The default error should be 450: "Requested action not taken – The user’s mailbox is unavailable”, not "the user has deleted everything and left".

These standards worked well before big players came and told "My responses tell what I chose them to say, and these meaning doesn't always overlap with the established standards". The only exception is spam and we now have standards for helping to reduce it.

ascar · on Dec 16, 2020

Your answer kind of misses the point GP was trying to make.

Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record. In this case the returned "user doesn't exist" error is intended behavior of the mail server and the post you replied to still stands. If you sent to that email successfully earlier, it's much more likely that the server is responding erroneously than that the email actually got deleted.

bayindirh · on Dec 16, 2020

> Your answer kind of misses the point GP was trying to make.

Actually, I don't think so.

> Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record.

As a system administrator and/or provider you have to think about worst case scenarios and provide sensible defaults. Your mail gateway should have some heartbeat checks to subsystems it depend on (AuthZ, AuthN, Storage, etc.) and it should switch to fail-safe mode if something happens. Auth is unreliable? Switch to soft-fail on everyone regardless of e-mail validity. Can hard fail others later, when Auth is sane.

Storage is unreliable? Queue until buffer fills, then switch to error 421 (The service is unavailable due to a connection problem: it may refer to an exceeded limit of simultaneous connections, or a more general temporary problem) or return a similar error.

SMTP allows a lot of transient error communication. Postfix, etc. has a lot of hooks to handle this stuff. Just do it. Being Google doesn't allow you to manage your services irresponsibly. If we can think it, they should be able to do it too.

joana035 · on Dec 16, 2020

Technically speaking it's possible to soft bounce upon 5xx errors, but in practice, retrying even when the destination tells you not to is the quickest way to get reputation ruined.

Google SMTP servers should have returned a soft bounce here (not hard bounce), so then retry can work.

pw6hv · on Dec 16, 2020

But then why would Google's mailserver not know that it once delivered email to that mailbox?

If the protocol is stateful, why the state should be kept by the "sender" and not by the "receiver"? Being stateless removes this ambiguity in my opinion.

Also we should remember how bad is for spam reputation sending emails to a non-existent address and thus I would not blame it on the mailing list for being "overly cautious".

dredmorbius · on Dec 17, 2020

The situation here is that the service was so borked that it didn't know what it didn't know.

Hard-failing good addresses is a much worse bad than soft-failing bad addresses. In the latter case, remote sender tries again later and eventually gets a hard bounce. In the former, good addresses are permanently dropped from numerous services, and sent mail is lost rather than retried.

Critical failures should soft bounce until positively determined otherwise.

onion2k · on Dec 16, 2020

Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place. This issue is Google sending the wrong error code because of a problem on their end.

Mailing lists believing what an email provider tells them and acting in an overly cautious way is a separate issue.

thaumasiotes · on Dec 16, 2020

> Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place.

This can't work; you can say that gmail's system should have a component that recognizes the difference between various failures, but that new component can itself fail. You can't solve the problem of "what if something fails" by saying "just add a new component that won't fail".

makomk · on Dec 16, 2020

Of course it can. Software is complex and that complexity can cause all kinds of problems, as can the fact that the networks linking computers are unreliable, but software is fundamentally deterministic. If you write a piece of code that returns a temporary failure when it can't look up whether a user exists, that code will not mysteriously change itself to start returning permanent user does not exist errors. (Now, if your overall stack is designed in such a way that you can't reliably tell the difference between lookup failures and users that don't exist, you have a problem - but the problem is with the design of the system, not some inherent problem with software.)

Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

thaumasiotes · on Dec 16, 2020

> but software is fundamentally deterministic.

That's true, but human behavior is also fundamentally deterministic, and those two observations are about equally useful.

> Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

No it isn't. Those are deterministic too.

rav · on Dec 16, 2020

> that code will not mysteriously change itself to start returning permanent user does not exist errors

That is true in a perfect world. In the current world, there are all sorts of ways that code implemented one day does not run the same the next day. Say the code is in an interpreted language and an unrelated sysop updates the language runtime in a way that changes the behavior. Again, in a perfect world that doesn't happen, but that is not always the world we live in. I have great sympathy with people who treat software systems AS IF they were "physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways".

labawi · on Dec 16, 2020

> doesn't fail completely but cannot access part of the data

If the a mail server can't tell whether a user/email is valid, it should either return a temporary failure or accept and queue.

Unless of course you're too big to fail, then you just do whatever you want.

tshaddox · on Dec 16, 2020

I think we’re just teasing at the notion that “permanent failure” isn’t a hard and fast distinction. I think some polite retry policy is not unreasonable even for the most explicit “permanent failure” response from a remote server. Imagine the most extreme example: hackers take over the remote server and make it respond with “permanent failure.” After a day, the legit owners regain control of the system. You can’t really argue that “the remote server never should have delivered that response unless the failure truly was permanent,” because clearly there was a mismatch between the apparent intent behind the response and the actual intent.

lukeramsden · on Dec 16, 2020

The issue is that hard bounces can cause big issues with your email sending reputation, and too many can make you lose access to mailing services such as Amazon SES, so you're encouraged at all points during the implementation of anything that sends email to blacklist any bounced emails. This of course works fine, right up until Gmail starts bouncing all emails.

dkdk8283 · on Dec 16, 2020

I think it’s spot on. Gmail’s failure mode in this scenario isn’t correct. The rest of the internet is functioning as designed.

StreamBright · on Dec 16, 2020

This is exactly it. The RFC has error codes for temporary failures (just like HTTP 503 for example). Failing to implement the RFC, the jokes on you.

dqv · on Dec 16, 2020

If Google and other major mail providers weren't opaque about this, then fine, but for me a single bounce is an immediate removal. I can't take the risk. I can't imagine the hell that would ensue trying to get through to Google to ask them to take me off their deliverability shitlist.

alphafredo · on Dec 16, 2020

Has anybody ever received a reply from gmail's postmaster address?

I have good experience with them fixing issues related to their spam-related flagging for messages that are coming from our self-hosted email server, but never got any specific reply.

davewritescode · on Dec 16, 2020

I 100% assure you that everyone handling gmail errors and getting burned isn’t just tossing failures into a single bucket. There’s a zillion reasons mail can bounce and all of them are taken into account. This is a particular bounce code that signifies that an ESP shouldn’t send email again to this address.

Email service providers are HIGHLY incentivized to act 100% in accordance with the wishes of the system where the mailbox exists because it’s highly likely that acting in any way that’s considered abusive could get your emails landing in a spam folder.

Mail boxes cease to exist thousands of times a day at places I’ve worked previously. Employees leave all the time and people shutdown mailboxes, this is Google’s fuckup, nobody else’s.

probst · on Dec 16, 2020

There is actually a very good reason to drop these email addresses, and the reason is that a high-rate of non-deliverable emails hurt your sender score. It's a total pain to get emails delivered to the major email providers in the first place, and you immediately land in spam (or with emails not delivered at all) if they don't trust the sending email server or your score is anything but stellar!

dataflow · on Dec 16, 2020

I have 2 responses to the sender reputation concern:

1. If the user's mail service penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed, that itself is absolutely inexcusable nonsensical behavior that needs to be fixed. You shouldn't do that, just as you shouldn't shoot the mailman (or even arm yourself...) merely because he knocked a second time.

2. Notwithstanding the previous point, I don't buy this as valid justification anyway. The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two. The bounce rate increase due to such an event is very negligible here—people don't suddenly delete their accounts en masse. When that happens, it's clearly due to an outage, not because half the users at that domain suddenly decided to delete their accounts. (Which is something you can also easily detect across the domain as another useful signal to drastically lower the bounce rate across the entire domain, btw, if you're absolutely paranoid about your immaculate delivery rate dropping by an epsilon. But it shouldn't be necessary given how negligible the impact should be.)

So I don't buy this excuse one bit.

judofyr · on Dec 16, 2020

> The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two.

What you're proposing is to explicitly ignore the specification (which says that you should _not_ retry when you receive a 550) and try to implement a custom smart retry logic that handles temporary error cases, but also does not get you blocked.

> So I don't buy this excuse one bit.

I'm all for building resilient services, but "try to detect when a server incorrectly returns 550" is not something I would prioritize at all. I'll happily manually clean up after this occurrence than to have this complicated retry logic. It's not an "excuse", it's a very sensible trade-off.

dataflow · on Dec 16, 2020

No, I am quite explicitly not ignoring the spec. It quite deliberately says should not, not must not. If anyone is ignoring the spec here, it's you, not me. Should not is sound advice; it's telling you what you're supposed to do when you don't have a reason to behave differently. You know, like how you "should not" leave the lights on when you leave your room. Or—more pertinently here—how you "should not" assume everyone is a liar. But when you actively see evidence that deviates from the norm, you are given the power—and arguably the responsibility—to exercise your discretion here to adapt to the situation. If the spec wanted blind obedience, it would say "must not" like it did in 60 other places, but it quite obviously and intentionally decided that would be unwise, and this scenario seems like a pretty clear illustration of that.

dqv · on Dec 16, 2020

But the RFC isn't only for senders it's also for receivers, isn't it?

That means there are two sides to the interpretation of what SHOULD NOT means. And in this case, senders have, due to experience, interpreted what Google does when someone SHOULD NOTs:

- The sender SHOULD NOT send us the same sequence again when we reply 550, if they do they MUST go on our shitlist.

Obviously it's not so binary and it takes retrying to several different recipients, but people have very good reason to interpret this SHOULD NOT as MUST NOT.

dataflow · on Dec 16, 2020

No, that's not a sane way to interpret this RFC for the receiver either. I already answered this, so you'll have to go back to my earlier comment (this might be my last comment as I won't keep repeating myself): any system (be it Google's or anyone else's) that penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed is just plain trash. A sender that attempts delivery to an address that accepted their email a day ago is obviously unlikely to be a spammer; there's no justification for treating them as one. It is absolutely unreasonable to interpret the sentence this way. Just as it's unreasonable to interpret "the mailman shouldn't knock a second time when he's told the recipient has moved" as "I should never open the door for the mailman ever again if he does so".

mcqueenjordan · on Dec 16, 2020

Good callout. The underlying issue of the lack of nuance is probably /state/. Being more nuanced about these errors probably requires managing state, which tends to increase the complexity and scaling challenges.

throwaway201103 · on Dec 16, 2020

Nuance is not called for. The standard states that a 5xx SMTP error is a permanent error and "The SMTP client SHOULD NOT repeat the exact request"

Gmail screwed up here, returning a 550 error, it's not anyone else's job to try to second guess that or retry in contradiction of the accepted standard.

https://tools.ietf.org/html/rfc5321

dataflow · on Dec 16, 2020

Gmail screwed up, but that's beside the point. We're talking about designing robust systems. You don't design a robust system by assuming nobody will screw up!

Re: the RFC, note it says "should not", not "must not". That seems to suggest they acknowledge repeating might actually make sense in some cases. And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.

jimmydorry · on Dec 16, 2020

Try delivering to invalid email addresses too many times (too many of course being up to each mail provider), and you will be the one shitlisted (and rightfully so, as you are likely bruteforce enumerating valid email addresses).

For any small provider, getting on the shitlist is catastrophic as unlike the big providers, getting off of it will be hard / impossible.

alfiedotwtf · on Dec 16, 2020

Rules for thee, not for me

shaan7 · on Dec 16, 2020

> And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.

That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

If you believe the standard is not robust enough to handle problems like this, first work towards a fix to the standard and then implement the solution. Not the other way round.

dataflow · on Dec 16, 2020

> That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

I didn't suggest people should apply this thought process in arbitrary cases. I said it should be applied in this case. You can take any thought process that gives a good outcome in one situation and obtain a bad outcome by applying it to the wrong situation. That's not an indictment of the thought process. It's just an indictment of the person failing to correctly judge its applicability.

That said, by all means, do try and go fix the standard; I wasn't trying to imply you shouldn't do that.

shaan7 · on Dec 16, 2020

Ah I think I did not describe the repercussions of making exceptions (even if they are in highly specialized cases like this). If you allow yourself to make such exceptions, you diminish the motivation for you (or someone else) to fix the problem at the right place. Most workarounds tend to live forever.

dataflow · on Dec 16, 2020

There's no clear-cut rule here. Some workarounds stay workarounds and never get standardized. Some become so well-accepted and adopted that people then put them into standards. It's great to put things into standards, so by all means, do try to improve standards. But that shouldn't block you from everything. At the end of the day, standardization is just a means to an end, and the end is what matters here. Nobody cares if their mailman's knocks follows an RFC or not. They just want their mailman to deliver packages with reasonably minimal disruption.

shaan7 · on Dec 17, 2020

> There's no clear-cut rule here

Exactly, that is why it is important to follow standards. Most engineering decisions are not clear-cut and are born out of tradeoffs. That is why we agree on standards that define those tradeoffs instead of every one of us having our own take on situations.

> Nobody cares if their mailman's knocks follows an RFC or not

If there is a Mailman RFC which says: "If someone opens the door and says `Mike does not live here' then DO NOT attempt delivering the same package"

THEN I expect the mailman to not bother me again, EVEN IF it was actually my mistake that I forgot my roommate Mike actually does live at this address.

dataflow · on Dec 18, 2020

I'm tired of arguing about this. Engineers agree on standards for a good reason, yes, but they also agree on "should not" rather than "must not" for a good reason too. I'll leave this as my last comment, but you might want to read the post-mortem. Turns out their implementation of the RFC wasn't even buggy. They just messed up the domain name in the configuration. Which you can only be resilient to by retrying the request sometime later.

tshaddox · on Dec 16, 2020

But here’s the thing: the standard (like all standards) is obviously not robust enough to physically prevent responses which incorrectly indicate permanent failure.

These incorrect responses could be caused by mistakes which the remote server admins could reasonably avoid, like software bugs. I understand not having much sympathy for that case, especially from an organization with no shortage of resources. But they could also be caused by, for example, hackers or governments exerting control over the remote server temporarily.

A standard which explicitly refuses to acknowledge these possibilities is not what I would describe as “robust.” An obvious better alternative would be to set some standards around what constitutes a polite retry policy.

StreamBright · on Dec 16, 2020

My understanding is that should not means that you should not try to retry. If I do retry than the other party can rightfully claim that I am DDOSing their service, trying to send emails to deleted accounts or put me on a spam list. I do not think that ignoring the RFC and trying to cover up for Google is the best course of action here. Maybe, just maybe, this is the right time when people realise what does it really mean to have an entity like Google. Because as it is stands, we are going to have the DNS infrastructure moved over to them with DoH and a similar outage is going to be even more devastating. The internet was designed to be resilient to failure because of its distributed nature and right now it just shows why concentrating resources in one place is bad.

dataflow · on Dec 16, 2020

You "should not" repeat delivery in basically the same way the mailman "should not" knock a second time if he's told the recipient doesn't reside at the designated address. What "should not" means in these cases is: "knock only once, and assume you're being told the truth in the absence of further evidence to the contrary". But when you clearly saw the recipient reside there yesterday, it makes sense to try to knock and catch him again tomorrow. Because, you know, maybe something went wrong, e.g. maybe the person who opened the door didn't recognize the name (or whatever). At the end of the day, the mailman's job is to deliver the mail with minimal disruption, not to play hot potato with envelopes.

_puk · on Dec 16, 2020

The terminology is well defined [0], so in this case, retrying is not ignoring the RFC.

It's a difficult one though, because as you rightfully state, covering up for Google is not the best course of action for the system as a whole, yet it's likely a good course of action for those users who didn't get their emails.

[0]: 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

[1]: https://tools.ietf.org/html/rfc2119

verst · on Dec 16, 2020

In most internet engineering task force RFCs the standard verbiage for "must not" usually is in fact "should not".

dataflow · on Dec 16, 2020

The phrase "must not" appears some 60 times in this RFC.

verst · on Dec 17, 2020

Thanks for pointing that out. I suppose a RFC writing style guide would be helpful to have consistency in language and interpretation.

maxlybbert · on Dec 16, 2020

The standard says “don’t resend,” it doesn’t say “assume the worst and begin removing user from all systems.” That was the mailing list software’s decision.

shmoogy · on Dec 16, 2020

You generally avoid sending to known bad addresses or your reputation will be destroyed very quickly. The 550 response is (read: was) a clear "you fucked up, this user doesn't exist" prior to this.

I saw someone on Reddit say his SES was suspended for sending tons of bounced emails in a short period of time - it's taken very seriously by ESPs.

E: also user rtx a few comments below

hunter2_ · on Dec 16, 2020

We're not talking about repeating the exact request; a subsequent request for the same recipient would be to deliver a completely different message: whatever subsequent message is sent to the mailing list.

dataflow · on Dec 16, 2020

Right. In this case it's already pretty typical for mailing lists to track bounces and retry under some errors, so I imagined that part is mostly done, and the missing piece would be taking more care in checking the error conditions.

zikzak · on Dec 16, 2020

Aside - I'm not an expert but systems like MailChimp will get very worked up if your list has lots of undeliverable addresses on it. This can trigger an audit of your list which prevents sending, etc. These audits seem to take quite a while, in my very limited experience.

xkcd-sucks · on Dec 16, 2020

So what you're saying is, if you're annoyed by "subscribe to our mailing list" modal popups, "doesno5exist@garbage.blah" is better than "jeff@amazon.com" ?

clowd · on Dec 17, 2020

In practice, no, it's more nuanced than that. Any mailing list operated through any remotely legitimate ESP will require subscriptions to be confirmed/acknowledged up front before any delivery is attempted to a recipient. If the confirmation step fails, i.e. the "check your email and click a link to verify you really signed up" email bounces, or nobody ever clicks the link, the list owner isn't generally going to be penalized for that.

If you want revenge for modal popups, your best bet is to create a bunch of throwaway email accounts, subscribe to the mailing list from them, and start reporting the individual messages as spam when they arrive. Flag them as junk at the mailbox provider (Gmail, Outlook, etc.) and use the links in the List-Unsubscribe headers to flag them at the ESP's end, too.

fogihujy · on Dec 16, 2020

If you're trying to get the web site's mail server blacklisted, definitely.

StreamBright · on Dec 16, 2020

Aka throw the RFC out of the window and implement a broken system because Google did that?

himinlomax · on Dec 16, 2020

> I said this in another comment but this seems like a naive way

That's the standards-compliant way. Also I'd argue that spec'ing your code to handle cases where Google fails that badly is (was?) a poor allocation of LoCs.

dataflow · on Dec 16, 2020

You're entirely missing the point by blaming this on Google. This is meant to detect and handle some failure modes, and they could happen to anyone (including Google), for reasons that can be both inside and outside their control.

Ndymium · on Dec 16, 2020

I had this issue with GitLab. My email provider returned a permanent error one day (due to an issue on their end), so GitLab silently stopped sending any emails to my address. I checked my email in the preferences many times and had no idea it was blocked on GitLab's end. Eventually, after not getting any notifications, I contacted their customer service and was told of this hidden setting.

So if you are not getting any notifications from GitLab, even though your email is correct, I suggest contacting them and asking if you have been blocked due to an error.

Abishek_Muthian · on Dec 17, 2020

I posted this as a problem in my problem validation platform[1] and a user has built a quick solution by displaying a token if the email service received an email from the sender.

[1]'Check email service status before sending emails' - https://needgap.com/problems/178-check-email-service-status-...

rob-olmos · on Dec 16, 2020

Great point. And potentially email delivery services that have auto-suppression lists to protect reputation, at least they might be able to remove entries on behalf of their customers.

afarviral · on Dec 16, 2020

Good. I was hoping this was the case. Unfortunately I already moved to fastmail so there will be little benefit to me.

iamacyborg · on Dec 16, 2020

Oh no.

rtx · on Dec 16, 2020

My account with Amazon went in to review because of this. I hope their team is aware about it.

CapriciousCptl · on Dec 16, 2020

Interesting response. And spot on from the technical integrity side. It’s also more fair to email providers as a whole to treat them all the same and respect their error messages. I mean, maybe there’s even requirements in some jurisdictions to deal with the address not found error in a specific way. As an email sender I think I’d prefer the message get auto re-sent after Gmail comes back online though.

lqet · on Dec 16, 2020

> Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

I fear that this will lead to many lost mails. In my experience, users are often confused by the technical "Mail delivery failed" mails and tend to ignore them or write them off as spam.

karlzt · on Dec 16, 2020

>P.S. You might also consider asking your contacts who are still using Gmail to switch to ProtonMail for more private communications

dredmorbius · on Dec 16, 2020

Confirmed likewise.

systemvoltage · on Dec 16, 2020

This feels like a cheap shot at Google. Shit happens, and they're not immune to it even if the servers are located in Zurich. Running a datacenter is no easy task.

jolmg · on Dec 16, 2020

I see it as Protonmail explaining to their users that the failure is not on their end and why they can't do much about a remedy. Seems purely factual. A cheap shot would be generalizing from the event, but I don't see them doing that.

systemvoltage · on Dec 16, 2020

I think I got this completely wrong. What you and other responses are saying makes sense.

toomuchtodo · on Dec 16, 2020

Being down is okay. Returning an error message that results in the data being thrown away instead of being requeued is not. Block incoming smtp connections until your app layer is fixed.

dathinab · on Dec 16, 2020

> Block incoming smtp connections until you app layer is fixed.

Or returning one of the 4xx status codes which indicate less-permanent failure state like:

- 451 Requested action aborted: local error in processing

Which is kinda like a HTTP internal server error as it can mean anything.

toomuchtodo · on Dec 16, 2020

For my comment’s purposes, I assume if this was possible with a flag or config setting (and the code path existed), it would’ve already been done. Doesn’t seem like they can, so they should’ve pulled the handbrake and gone “full stop” without throwing everyone’s mail away (hence blocking incoming connections and let the mail sit in all of the external MTA queues).

Another option would’ve been to accept everything with a very lightweight smtp ingest service, journal it all, and play it back to the full frontend after their code fix was pushed out.

Not an SRE so ¯\_(ツ)_/¯ just some thoughts from my time in a similar role and similar pain points (but thankfully not at this scale)

Sebguer · on Dec 16, 2020

Yeah, this is a particularly pernicious failure given how email works. Many mailing providers will just mark these as blacklisted, now, and lots of unsophisticated users likely won't notice.

croon · on Dec 16, 2020

I consider myself sophisticated enough, but my Bitwarden has 700 accounts, of which ~30% old ones are registered with a gmail address, and the rest are handled behind g suite. Granted that last bit might be partly my fault, even though I paid for it. But even for a "sophisticated" user, I have no easy way of knowing if any of these accounts have silently failed to function now, other than by the passage of time and eventually finding out.

Sebguer · on Dec 16, 2020

Oh, absolutely, even for sophisticated users mitigating may be difficult or impossible depending on exactly what bounced and how. But you at least are aware that this happened, and that you have a problem. Think how many people are out there with no clue what this error meant, or that it signaled an ecosystem problem, or that just had hundreds or thousands of emails silently bounce and unsubscribe.

the_duke · on Dec 16, 2020

A lot of people and companies use Gmail. Email providers are definitely getting support requests from users that don't know what's going on.

This is not a cheap shot, but a message to inform users that it's an issue with Google that Protonmail can do nothing about.

dathinab · on Dec 16, 2020

More like "if mail to gmail fails it's not us so pleas don't flood the support with comaplains".

> Running a datacenter is no easy task.

Sure, but then there are very view companies which have more experience with running data-centers and (normally) providing reliable email service.

So any outage for more then just a short time is very unusual. I'm really interested what went wrong.

jonathanbull · on Dec 16, 2020

CEO of an email marketing platform here (EmailOctopus). If anyone's curious, here's a chart showing our bounce rate to Gmail addresses over the course of the week:

https://pbs.twimg.com/media/EpUE20UXYAEa_Uv?format=jpg&name=...

That's a peak of 90% of Gmail inboxes bouncing – and this has been going on for almost 24 hours.

vxNsr · on Dec 16, 2020

I know this is your livelihood, but as someone who basically never wants marketing emails, all I can think is "nice" hopefully I get auto-unsub'ed from a ton of lists.

himinlomax · on Dec 16, 2020

If they normally successfully deliver to gmail, it's safe to assume a large number of people who do receive their emails want to receive them.

627467 · on Dec 16, 2020

This is very charitable. How many people live with the nuisance of mailing (they un- or knowingly subscribed to) VS those who actually go through the trouble of unsubscribing/mark as spam in hope to rid of the from inbox?

8fingerlouie · on Dec 16, 2020

I normally just delete mailing list mails. I don't even read them.

This year i decided to do "something" about it, so every mailing list mail received in my inbox that i don't want/care for gets an unsubscribe. It has already reduced my daily mails by a somewhat large amount. It's hard to say exactly how much, but i estimate around 10 emails less every day.

Most of the unsubscribed lists are from companies where i've purchased something andthe seller took the liberty of subscribing me to their mailing list. Those are mostly pre-GDPR that i've just never gotten around to dealing with.

The execption is of course obvious spam mails, to which unbsubscribing will probably do more harm than good.

aksss · on Dec 16, 2020

That conclusion makes zero sense to me unless counting on the nebulous nature of the descriptor, “a large number”. They deliver successfully to my Gmail account on a regular basis so I must want to receive it? Feels like you’re telling me to stop dressing like a slut. ;)

cascom · on Dec 16, 2020

Totally agree, especially as I signed up for exactly zero of them.

Rant: As I side note I usually try and buy direct when shopping online rather than through Amazon (for all but the most trivial purchases) and this is the 2nd largest drawback (behind filling in CC and shipping info) - because I bought one item from you, once in my life does not mean send me a daily email, and then when unsubscribing pretend like I signed up for them! For me it’s one of the easiest ways to destroy brand loyalty/reputation.

webbie917 · on Dec 16, 2020

This would affect all email types including emails like receipts, shipment confirmations, password resets, account verification.

Plenty of critical communications get caught in this storm...

CoffeeOnWrite · on Dec 16, 2020

How do the public gmail addresses compare to the enterprise (used to be G Suite, now Google Workspace) ones?

sebmellen · on Dec 16, 2020

I would be very interested to know this as well. I am trying to switch my company over to Google Workspace right now and support has been telling me my signup issues will be "resolved in 48 hours or less."

What a joke. And this after we're leaving AWS Workmail because of bounced emails.

No luck with signing up so far.

nikanj · on Dec 16, 2020

Heavily recommend you don't switch your company over to Google. Microsoft seems to understand that in the enterprise world you actually have to have support personnel, not just an opaque AI without chance for appeal

lima · on Dec 16, 2020

Google has decent support for paying customers.

rockooooo · on Dec 16, 2020

You can actually appeal things when you start paying.

joering2 · on Dec 16, 2020

Consider yourself lucky. I have some ad words in "approval" porocess for 6 months now. I kid you not - every Friday I receive email stating that the update will be send to me on Monday (insert date here). Then nothing happens on Monday until Friday comes and I get exactly same copy, only date is different. At this point I literally laugh.

About your query

I gather that you are concerned about your Ads Disapproval for your Google Ads Account.

Observation

I understand that this is taking a bit longer as we are working with a limited staff due to Global pandemic and there is another team who reviews the account so there can be a slight delay in the decision I apologize for the inconvenience caused as I understand this is not the answer which you are looking for but be rest assured I will get back to you on coming Friday 12/18/2020 end of business day.

For any further assistance, I am just an email away.

Sincerely,

Ekaros · on Dec 16, 2020

SLA of less than 99,5%... Or if there is multiple issues even sub 99%... That really is a joke...

CGamesPlay · on Dec 16, 2020

Anecdotally, my enterprise account seems unaffected.

piou · on Dec 16, 2020

Also anecdotally, during the outage, test messages from my non-gmail account to my standalone/non-enterprise gmail accounts consistently bounced; test messages from my non-gmail account to my G Suite Business-associated account went through.

lemonspat · on Dec 16, 2020

Serious question: how would you know that you are receiving ALL emails from ALL senders?

CGamesPlay · on Dec 16, 2020

Totally valid, and I wouldn't. The status page indicates that "Google Workspaces" is affected, but I don't know if that is synonymous with what I have (which was Google Apps a decade ago, unsure now). All I can say is I was receiving emails during the affected window.

whitepoplar · on Dec 16, 2020

As an ESP, how much of a headache will this be for you in weeks/months to come? I'm guessing this throws a huge wrench in deliverability techniques--how're you handling it?

jonathanbull · on Dec 16, 2020

It's a real headache but should be fully reversible. @shmoogy hit the nail on the head: we'll run through our events in that timeframe, inspect the raw bounce reason to check it relates to the Gmail outage, then undo the actions that the bounce caused.

The reason why this is so nasty is not because Gmail went down, but because they returned a 5XX permanent failure and not a 4XX temporary failure for these bounces. Literally every email provider will respond to a permanent bounce by suppressing all further emails to that email address (it's permanent, after all!), so the fallout from this will be huge.

shmoogy · on Dec 16, 2020

I would imagine since it's a known timeframe, domain, and error response, they can cleanly remove the suppression lists.

I logged into our sendgrid and mailgun accounts and manually purged all the failed gmail records.

rob-olmos · on Dec 16, 2020

Might also be affecting GSuite/Workspace emails.

webbie917 · on Dec 16, 2020

The hard bounce status might be stored outside of your lists. I am not sure customers can easily change a hard bounce status themselves. Do you mean you just deleted those records with intent to re-add to reset the status? On our BigMailer platform this wouldn't work as hard bounce status would get preserved.

shmoogy · on Dec 16, 2020

We use SendGrid and Mailgun right now, and both of these expose the suppression list, email address, time, and reason code + description. In Sendgrid you can filter, and mass select to remove suppressions easily (which was great). In mailgun I had to export a CSV and just removed them manually as there was not too many across my accounts.

Customers generally cannot change this on their end as far as I can imagine -- this is on the ESP end and is a protection built in because you are sending from their IP / Server and they don't take kindly to that.

webbie917 · on Dec 16, 2020

+1 what Jonathan said. Typically, when email service providers are down the response code indicates a temporary issue with a soft bounce code, so you can still try to send to that address in the future.

The action for rectifying isn't too difficult, but the implications are still pretty big...

m3nu · on Dec 16, 2020

Mailgun added a few new suppressions due to bounced Gmail addresses. Hope ESPs just flush those out.

webbie917 · on Dec 16, 2020

Thanks for sharing Jonathan, unprecedented situation. And that's just gmail.com addresses we can see data on, while there are all those business domains that use Google Apps for their email that probably experienced a similar issue...

edoceo · on Dec 16, 2020

What's this do to your mail-queue size - let's see that chart

throwaway201103 · on Dec 16, 2020

Permanent failures, as these are being flagged, don't stay in the queue.

wenbin · on Dec 15, 2020

"Type: Permanent; SubType: General; Code: smtp; 550-5.1.1 The email account that you tried to reach does not exist. Please try 550-5.1.1 double-checking the recipient's email address for typos or 550-5.1.1 unnecessary spaces. Learn more at 550 5.1.1 https://support.google.com/mail/?p=NoSuchUser y128si147264pfg.177 - gsmtp"

com2kid · on Dec 15, 2020

This is pretty much the worst response possible. Hard bounces mean that email delivery services are going to start automatically removing, or at least stopping delivery to, entire slews of email addresses.

A lot of clean up is going to be needed as a result of this.

To add some more details, when using a 3rd party email delivery service, those services will either black-list or just outright remove email addresses when they get a hard bounce "email address no longer exists" message back.

Some providers make re-adding an address after a hard bounce a non-trivial task, since after all, the authority on that email address just said it doesn't exist.

This is going to be really ugly.

octoberfranklin · on Dec 16, 2020

I really cannot believe they did not immediately hack in a new rule to their SMTP server: never return a 5xx (permanent failure), instead return a 421 (temporary failure try again later).

That simple fix buys them 24-72 hours to solve this properly.

Yeah, it burdens servers sending mail to them because now they have to hold on to all mail (including mail that really is permanently undeliverable) for another day or so, but that's still better than what's happening right now.

jeffbee · on Dec 16, 2020

Why would that be better than just shutting off the delivery stack altogether?

shmoogy · on Dec 16, 2020

5xx error results in suppression list addition of an email, so future emails won't be delivered (by most ESPs), and not returning MX response would probably be just as bad, or worse (or result in millions/billions of emails being re-queued due to timeouts?)

His solution would result in exponential retry failures baked into most services, which would buy them a few hours, and result in no lost emails, and no suppression list additions.

pmlnr · on Dec 16, 2020

Failure of response from the server is nearly always treated as temp failure, because it could be down to network connectivity, name resolution, etc.

That is a better scenario, than 5xx.

jeffbee · on Dec 16, 2020

Inability to contact the destination would be treated as a temp-failure by the origin, and taking the service off the air could be effected instantly.

octoberfranklin · on Dec 16, 2020

In case less than 100% of gmail is experiencing this bug.

jsnell · on Dec 16, 2020

This outage seems to have lasted for about 2.5 hours. Probably this was fixed by rolling back whatever caused it. (I don't think the rollout was finished before they resolved it; my mail server sends a lot of emails to Gmail addresses, and even at peak I was only seeing maybe about 1/3 mails be rejected.)

There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.

But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.

The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.

As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.

BubuIIC · on Dec 16, 2020

This has been going on for 2 days, not 2 hours.

jsnell · on Dec 16, 2020

The linked status page shows a 2.5 hour duration.

My mail server logs show about 20 failures in all of the last week until yesterday 20:43 CET, then 350 failures between 20:43-00:21, then nothing after that. So fair enough, from the client side rather than the status page it looks like 3.5 hours rather than 2.5.

But still, given that resolution time, the suggested solution of changing the SMTP server is absolutely ludicrous.

petercooper · on Dec 15, 2020

Yes. I email hundreds of thousands of Gmail users each week (yes, double opt in, they all want the mails!) and we immediately delete any user for whom any Gmail error comes up at all in order to keep a solid delivery record with them. Sounds like we might have deleted 80% of our list if we'd sent today..!

com2kid · on Dec 15, 2020

My sanity tests started acting flaky ~3 hours ago, I never thought it was Gmail...

Kind of happy I had to do something else and I didn't burn hours investigating.

dathinab · on Dec 16, 2020

So new think to do: Quarantine addresses instead of deleting them and if for one provider most addresses fail don't give them another (maybe manually triggered) try later one.

(And if no such thing is detected deleted quarantined mail addresses.)

webbie917 · on Dec 16, 2020

My guess is that how most email service providers handle this - they don't actually delete the email and just have a flag on it - bounced, complain, unsub. This way the list owner can run an export and see all the status code.

ta988 · on Dec 15, 2020

Hope you have a backup just in case.

petercooper · on Dec 15, 2020

Yes, we're unusual in not relying on third parties for list management. We can rollback. Or I might just comment out the 'unsub on hard bounce' code for the rest of the week..! :)

Jap2-0 · on Dec 15, 2020

Unsub on two consecutive bounces seems more reasonable to catch flukes (or Gmail going down)?

petercooper · on Dec 16, 2020

Yes, most likely! That is a common approach for 'soft bounces' in most list management systems (e.g. MailChimp).

The problem here is Gmail has been throwing out "NoSuchUser" errors which are an instant unsub in most systems because Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I'm extremely paranoid about email hygiene, tiny bounce rates and high delivery rates, so we aggressively unsubscribe troublesome addresses (often to the point of getting reader complaints about it) for many reasons beyond that, however.

octoberfranklin · on Dec 16, 2020

> Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I think you mean "reputation purposes"?

If so, wow, that sucks. Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

MagnumPIG · on Dec 16, 2020

> Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

Good for karma, bad for everyone though.

petercooper · on Dec 16, 2020

I think you mean "reputation purposes"?

That better describes what I was trying to say, yes. Reputation then affecting deliverability.

Over 80% of our subscribers use Gmail so to say I'm paranoid about maintaining a good record with them is an understatement ;-) Gmail is a huge weak link for us.

Jap2-0 · on Dec 16, 2020

Ah, thanks for the explanation.

dataflow · on Dec 15, 2020

Logically you'd expect unsubscribe to only act after lots of bounces of this format when the address has been receiving mail fine before. It also seems reasonable not to trust such bounces for the entire domain for a while when this happens to lots of other addresses that have worked fine before. Not that I expect software currently works this way, but it does seem like a common sense thing to code in.

petercooper · on Dec 16, 2020

I mean, it's possible, but you'd need to queue up a day's worth of bounces, do the analysis, and then handle the bounces asynchronously later on to do that.

Most systems operate more immediately in isolation on individual addresses than that right now, because such analysis is generally not needed (until today, of course ;-)).

dataflow · on Dec 16, 2020

Mail agents already queue emails that bounce though; it's a matter of changing the conditions for when you retry and/or unsubscribe. I imagine you can do the analysis in real time too... just look at the bounce and see if it pertains to an email you sent to in the past, and if so, increment some rolling counter for that domain.

octoberfranklin · on Dec 16, 2020

Their SMTP server being unreachable is a 4xx temporary error. The sender MUST keep trying for at least 24 hours, and 72 hours is recommended.

"Gmail going down" would not have caused this problem. Even if all their SMTP servers went offline.

organsnyder · on Dec 16, 2020

Yeah, they would have been better off pulling the (metaphorical) plug—maybe block incoming traffic to port 25 or something—until they had this fixed.

marvion · on Dec 16, 2020

Mailgun send a warning mail about increased bounces from our account. Sure, they know what's going on... but we send 4-5 digit mails per hour - it's a lot of bounces

That means I can't just resend the the emails blindly, because I'm too scared to trigger some sort of automatic suspension...

(I don't do this regularly, so I'm not familiar with all features... additional mail verification could help probably ....)

LinuxBender · on Dec 15, 2020

They should be returning 421 for backend outages so that sending servers queue and retry the emails. 550 can be interpreted by some as deleted [1] or even banned accounts in some cases. Maybe someone here could convince them to change the logic that occurs during an outage.

[1] - https://en.wikipedia.org/wiki/List_of_SMTP_server_return_cod...

mlyle · on Dec 16, 2020

Yah. Maybe there's an unexpected way that things can fail resulting in 550's. But maybe at Google's scale you should have some kind of kill switch to stop answering SMTP or to not send permanent errors at all, so that you could flip a switch and prevent the worst consequences of this rather than let it go on for a couple of hours.

octoberfranklin · on Dec 16, 2020

Absolutely this.

I am astonished that either (a) this switch has not been flipped yet or (b) this switch does not exist.

Somebody is incompetent here.

mrich · on Dec 16, 2020

Perhaps Gmail is just being discontinued ;)

octoberfranklin · on Dec 16, 2020

don't get my hopes up!

megous · on Dec 15, 2020

A lot of people will lose transactional email messages, because of this.

I'd absolutely hate to be hit by this at this time. Thankfully I've made an time investment to run my own mail server years ago. A handful of times it broke down, it either went offline or started returning 4xx codes due to misconfigured or broken milter after an update. Neither meant lost messages from normal senders that use queuing MTAs.

alphadevx · on Dec 16, 2020

Same for me, mainly for privacy concerns. And I back it up daily to my local NAS. It's so easy to configure and run your own mail server, that I'm surprised we are the minority in the tech community.

snazz · on Dec 16, 2020

> It's so easy to configure and run your own mail server

Is it? Is dealing with IP reputation, getting your emails accepted by major providers, and being on the hook for fixing everything yourself very easy? I haven't tried, so I don't have personal experience, but I've heard enough horror stories to think that it's not a good use of my time.

megous · on Dec 17, 2020

Sending side of the MTA can be set up manually in about an hour on a Debian server, with dmarc, dkim, spf, etc. Make that a day if you want to read up on and understand each of the things in more detail, if you haven't configured them before. There's really not much to play with in this direction for a typical personal mail server.

Receiving side is where there is a great range of options, and many things to try and have fun with. You can have anything from a single catchall mailbox with no filtering, no GUI, and a simple IMAP or POP3 access for MUA, to a multi-account, multi-domain setup with server side filtering, database driven mailbox and alias management, proper TLS, web MUA access, etc. It can also be built up gradually, starting from very simple setup to something more complicated so that you never lose account of how things work.

alphadevx · on Dec 16, 2020

Mine are accepted by Gmail so I am good. Considering how dominant Gmail is, that's all that really matters.

Regarding getting a bad IP rating, normally that's due to having an insecure config, like acting as an open relay, or not having DKIM enabled. There are lots of tutorials online about this, if you know Linux it really is easy.

EarthIsHome · on Dec 16, 2020

I had an IP reputation issue and managed to resolve it after some time.

TLDR: Before you spin up a mail server, check if your IP address is on any of the blacklists [0]-[1] as well as Proof Point's list [2]. If it is, then try and get a different IP address.

I spun up a hosted server on Digital Ocean and received an IP address. I checked several black lists from a few email testing/troubleshooting sites [0] and [1] and all was groovy; my IP address wasn't on any list.

I got a bunch of 521 bounces when I tried emailing a neighbor who had an att.net address.

So, I checked the troubleshooting websites, and my IP address was listed as clean.

My logs said I should forward the error to abuse_rbl@abuse-att.net, so I did.

Those emails were never delivered, because abuse-att.net had its own blacklist. I was getting 553 errors. In the logs, the message from their server told me to check https://ipcheck.proofpoint.com.

Proof point runs their own blacklist that some enterprises use (e.g. att and apple [3]). I checked their list, and lo and behold, my IP address from Digital Ocean was blocked [2]. Digital Ocean wasn't able to remove the IP address from their blocklist and suggested I spin up a new droplet with a different IP address.

I didn't want to do that, so I sent Proof Point an email that went unanswered; the email asked them to remove my IP address. I forgot about the issue for five or six months (this is a personal server), and ran into the issue again a few months ago. So I sent Proof Point an email again, this time with different wording emphasizing that "my clients" were having delivery issues. Within a day, they removed my IP address from their block list.

So, my main suggestion is to check if your IP address is on any of the blacklists as well as Proof Point's list before you start on your server. If it is, then try and get a different IP address.

Does anyone have more "enterprise" lists, like Proof Point, to check?

[0]: https://www.mail-tester.com/

[1]: https://mxtoolbox.com/blacklists.aspx

[2]: https://ipcheck.proofpoint.com

[3]: https://www.reddit.com/r/email/comments/6toxzr/ip_blocked_by...

teddyh · on Dec 16, 2020

Used by Microsoft: https://sendersupport.olc.protection.outlook.com/snds/index....

jhawkinson · on Dec 15, 2020

It may be helpful to note that Google has acknowledged they are working on similar issues (the description is vague!) with an ETTR of 1900 EST:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

On the other hand, their status dashboard reported similar issues yesterday and here we are again: https://www.google.com/appsstatus#hl=en&v=status

robarr · on Dec 15, 2020

Yes, hard bounces even between Gmail addresses.

caogecym · on Dec 15, 2020

just curious, how did you check bounces stats for Gmail?

clairegraham · on Dec 16, 2020

I also had the same hard bounce (when emailing from a non-gmail address -- fastmail -- to a gmail address). Sent it again minutes later and then it worked.

awb · on Dec 15, 2020

Incoming Gmail is bouncing, but I'm still able to access all prior received messages.

da_allgeier · on Dec 15, 2020

TL;DR; Don't sent your newsletters today if you can avoid it.

j-wags · on Dec 15, 2020

Over the past 24 hours, I've had GitHub request that I re-verify my gmail three times (roughly 22 hours ago, 2 hours ago, and now), each time resetting my primary email's status to "Undeliverable" and "Unverified"

The triggering event may be an email bounce. I get a lot of github notifications sent to my email, and the failure of just one/a few may trigger the reverification.

Seirdy · on Dec 15, 2020

This is another good reason to have email @yourowndomain.tld

When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.

coldpie · on Dec 15, 2020

I've had way more downtime trying to run my own domain's mailserver for a year than I have with gmail for more than a decade.

Seirdy · on Dec 15, 2020

That's not what I said. With some emphasis added:

> When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.

Use a commercial provider, but fall back to your own server when it goes down without changing your email address.

ocdtrekkie · on Dec 16, 2020

I see two problems here: The likelihood your service is restored before you spin up your own mail server, and the fact that, not expecting this failure, their DNS may have a fairly lengthy TTL.

whichquestion · on Dec 16, 2020

https://mailinabox.email/ Can be set up relatively quickly

imhoguy · on Dec 16, 2020

What about permanent problem, like suspended account?

ocdtrekkie · on Dec 16, 2020

In that case, owning your own domain is golden. I just don't see "spin up your own mail server" as a short term solution.

belorn · on Dec 15, 2020

Having run my own mail server for over a decade, I have yet to have a single time where the server responds by Permanent error of accounts not existing and with email bouncing.

Losing incoming email is pretty much the worst case scenario when it come to configuration errors. It about as bad as not having backups, in that both cases results in unrecoverable loss of data.

Karunamon · on Dec 15, 2020

Use a paid email host, just anything but Google. Life's too short to put up with managing your own email server.

nine_k · on Dec 16, 2020

It can as well be Google, just the paid Apps version. Zero time to get used to a different UI. I suspect there must be a solution to easily migrate all your tags and filtering rules. (Tags are the killer feature to me. Outlook sort of has them but they are less flexible.)

jrochkind1 · on Dec 16, 2020

does the paid apps version have better uptime? Is it not affected by the current issues?

terragon · on Dec 16, 2020

My company has paid apps, and we have been facing issues same as everyone else.

frongpik · on Dec 16, 2020

I switched to a custom domain only when gmail torpedoed one of my secondary gmail accounts.

ta988 · on Dec 15, 2020

You can redirect to a commercial service as well.

brnt · on Dec 16, 2020

Not me, and I'm not even paying for the services I've been switching between.

dataflow · on Dec 15, 2020

Keep in mind other stuff like DNS will go down randomly. At least they won't result in a permanent address-doesn't-exist error, but you'll be putting out potentially more fires that way.

Kye · on Dec 16, 2020

I just switched to Fastmail before all this.

hannofcart · on Dec 16, 2020

Except as an academic exercise, trying to roll and maintain your own email is fraught with difficulties.

ulrikrasmussen · on Dec 16, 2020

You can forward handling to a provider, like gmail. The idea is that you own your email address and can switch providers more easily if you are not satisfied with them or they turn out to be evil.

alexandra_cgg · on Dec 16, 2020

still use gmail to manage email lmao

AnssiH · on Dec 15, 2020

Yep, there was a very similar event yesterday, approx. 22 hours ago: https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=10...

jasonjayr · on Dec 15, 2020

I figured one major incident for Google was enough for the day! We had a bunch of email bounce to @gmail domains yesterday in that timeframe.

Netcob · on Dec 16, 2020

When that happened I panicked a little, realizing how much Google Sheets data I had that wasn't really backed up anywhere since Sheets files in Google Drive are basically just links. I started a Takeout, but it looks like I wasn't the only one - it took well over a day to complete.

senkora · on Dec 16, 2020

Be sure to verify that it worked. Some settings of Takeout don’t download docs/sheets/slides files. I don’t remember what the default is, unfortunately.

bluedino · on Dec 16, 2020

Same from LinkedIn

firebaze · on Dec 15, 2020

As quite a few googlers appear to read and write on HN, I'd really welcome an insider info on what's going on the last few days.

Sure there will be some internal turmoil going on right now, but isn't there some non-confidential info to share? Can't imagine this will hurt the image of google neither in the short nor long run, quite the opposite.

kleinsch · on Dec 16, 2020

I don’t work at Google, I’m at a different big tech that’s in the news frequently. Sharing inside info on an ongoing incident is a great way to get fired. Big tech companies are way different than startups where everyone can do a bit of anything. There are people whose job it is to handle that communication. You make their job a lot harder if you disclose information. The company is so big that as an engineer you may not know all the factors involved in what would hurt the company long term - undisclosed relevant litigation, compliance commitments, partner obligations, etc.

How much do you hate it as an engineer when sales people make tech promises to customers without asking you? For comms people, engineers leaking info publicly feels the same way.

iKevinShah · on Dec 16, 2020

I am very pleased to see this response, genuinely. Our Technical Curiosity aside, there are literally people and teams in such big firms dedicated for this.

rurp · on Dec 16, 2020

What you're saying makes sense but I don't think it really applies to anything the OP said. The "non-confidential" qualifier indicates to me that they only want people to share what they can responsibly.

enneff · on Dec 16, 2020

And the parent post’s point is that there are people whose job it is to specifically share that information, and so we should let them do their job. They are the domain expert in this particular task.

rurp · on Dec 17, 2020

For any incident like this there are tons of details that are both

1) Harmless to share 2) Will never be shared by PR teams

I don't see anything wrong with asking people to share what they can.

enneff · on Dec 17, 2020

There’s nothing wrong with asking. I’m just explaining that as a Google employee, sharing such details is poor form.

programmerslave · on Dec 16, 2020

[flagged]

retsibsi · on Dec 16, 2020

> These companies wouldn’t hesitate to kick you out on the street if they had to

> Sharing inside info on an ongoing incident is a great way to get fired

You're not disagreeing.

asdfaoeu · on Dec 16, 2020

He literally just said they wouldn't hesitate to kick you out on the street if they had to

redwards510 · on Dec 15, 2020

In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.

joatmon-snoo · on Dec 15, 2020

Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).

userbinator · on Dec 16, 2020

We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."

missblit · on Dec 16, 2020

Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)

Andrex · on Dec 16, 2020

Is "exciting" a synonym for "harrowing" where you're from? :P

vitus · on Dec 16, 2020

Chrome web store has no rollback strategy, there is only roll forward :(

joshuamorton · on Dec 16, 2020

You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.

Aperocky · on Dec 16, 2020

> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!

brown9-2 · on Dec 15, 2020

You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003