Hacker News new | past | comments | ask | show | jobs | submit login
Gmail having issues (google.com)
644 points by mangoman on Dec 15, 2020 | hide | past | favorite | 432 comments



Just got this from the ProtonMail team:

> Dear ProtonMail user,

Starting at around 4:30PM New York (10:30PM Zurich), Gmail suffered a global outage.

A catastrophic failure at Gmail is causing emails sent to Gmail to permanently fail and bounce back. The error message from Gmail is the following:

550-5.1.1 The email account that you tried to reach does not exist.

This is a global issue, and it impacts all email providers trying to send email to Gmail, not just ProtonMail.

Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

We are closely monitoring the situation. At this time, little can be done until Google fixes the problem. We recommend attempting to resend the messages to Gmail users when Google has fixed the problem. You can find the latest status from Google's status page:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

Best Regards, The ProtonMail Team


This is the Nightmare Scenario for mailing lists.

Many of them auto-unsubscribe after a bounce.


I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before. The only legit scenario in which that happens is when the user deletes the address, which is a rare event (pretty much always <= 1 time in the lifetime of any address), and there shouldn't be anything wrong with treating that kind of situation the same as any soft error. If you're wrong, your mail will just get rejected a few more times anyway, and you'll know it's genuinely a dead end.

The underlying issue (wherever this occurs) seems to be lack of nuance regarding error codes when people try to implement robust systems. Different codes imply different things and shouldn't all just fall back into generic buckets.


> I said this in another comment but this seems like a naive way to react to an "address does not exist error" that they've already delivered to before.

Like HTTP, SMTP is also designed to be stateless so, in the first place, the remote server shouldn't return a permanent error in temporary failure scenarios.

The default error should be 450: "Requested action not taken – The user’s mailbox is unavailable”, not "the user has deleted everything and left".

These standards worked well before big players came and told "My responses tell what I chose them to say, and these meaning doesn't always overlap with the established standards". The only exception is spam and we now have standards for helping to reduce it.


Your answer kind of misses the point GP was trying to make.

Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record. In this case the returned "user doesn't exist" error is intended behavior of the mail server and the post you replied to still stands. If you sent to that email successfully earlier, it's much more likely that the server is responding erroneously than that the email actually got deleted.


> Your answer kind of misses the point GP was trying to make.

Actually, I don't think so.

> Google's mailserver could genuinely believe that the user doesn't exist, if the user service doesn't fail completely but cannot access part of the data and thus doesn't find a user record.

As a system administrator and/or provider you have to think about worst case scenarios and provide sensible defaults. Your mail gateway should have some heartbeat checks to subsystems it depend on (AuthZ, AuthN, Storage, etc.) and it should switch to fail-safe mode if something happens. Auth is unreliable? Switch to soft-fail on everyone regardless of e-mail validity. Can hard fail others later, when Auth is sane.

Storage is unreliable? Queue until buffer fills, then switch to error 421 (The service is unavailable due to a connection problem: it may refer to an exceeded limit of simultaneous connections, or a more general temporary problem) or return a similar error.

SMTP allows a lot of transient error communication. Postfix, etc. has a lot of hooks to handle this stuff. Just do it. Being Google doesn't allow you to manage your services irresponsibly. If we can think it, they should be able to do it too.


Technically speaking it's possible to soft bounce upon 5xx errors, but in practice, retrying even when the destination tells you not to is the quickest way to get reputation ruined.

Google SMTP servers should have returned a soft bounce here (not hard bounce), so then retry can work.


But then why would Google's mailserver not know that it once delivered email to that mailbox?

If the protocol is stateful, why the state should be kept by the "sender" and not by the "receiver"? Being stateless removes this ambiguity in my opinion.

Also we should remember how bad is for spam reputation sending emails to a non-existent address and thus I would not blame it on the mailing list for being "overly cautious".


The situation here is that the service was so borked that it didn't know what it didn't know.

Hard-failing good addresses is a much worse bad than soft-failing bad addresses. In the latter case, remote sender tries again later and eventually gets a hard bounce. In the former, good addresses are permanently dropped from numerous services, and sent mail is lost rather than retried.

Critical failures should soft bounce until positively determined otherwise.


Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place. This issue is Google sending the wrong error code because of a problem on their end.

Mailing lists believing what an email provider tells them and acting in an overly cautious way is a separate issue.


> Google's user service should be able to tell the difference between a user's data not being available and a user that has been deleted or never existed in the first place.

This can't work; you can say that gmail's system should have a component that recognizes the difference between various failures, but that new component can itself fail. You can't solve the problem of "what if something fails" by saying "just add a new component that won't fail".


Of course it can. Software is complex and that complexity can cause all kinds of problems, as can the fact that the networks linking computers are unreliable, but software is fundamentally deterministic. If you write a piece of code that returns a temporary failure when it can't look up whether a user exists, that code will not mysteriously change itself to start returning permanent user does not exist errors. (Now, if your overall stack is designed in such a way that you can't reliably tell the difference between lookup failures and users that don't exist, you have a problem - but the problem is with the design of the system, not some inherent problem with software.)

Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.


> but software is fundamentally deterministic.

That's true, but human behavior is also fundamentally deterministic, and those two observations are about equally useful.

> Note that this is rather different from physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways due to physical wear and tear, things getting jammed in places, component failure, etc.

No it isn't. Those are deterministic too.


> that code will not mysteriously change itself to start returning permanent user does not exist errors

That is true in a perfect world. In the current world, there are all sorts of ways that code implemented one day does not run the same the next day. Say the code is in an interpreted language and an unrelated sysop updates the language runtime in a way that changes the behavior. Again, in a perfect world that doesn't happen, but that is not always the world we live in. I have great sympathy with people who treat software systems AS IF they were "physical, mechanical systems which can fail in all kinds of exciting and unpredictable ways".


> doesn't fail completely but cannot access part of the data

If the a mail server can't tell whether a user/email is valid, it should either return a temporary failure or accept and queue.

Unless of course you're too big to fail, then you just do whatever you want.


I think we’re just teasing at the notion that “permanent failure” isn’t a hard and fast distinction. I think some polite retry policy is not unreasonable even for the most explicit “permanent failure” response from a remote server. Imagine the most extreme example: hackers take over the remote server and make it respond with “permanent failure.” After a day, the legit owners regain control of the system. You can’t really argue that “the remote server never should have delivered that response unless the failure truly was permanent,” because clearly there was a mismatch between the apparent intent behind the response and the actual intent.


The issue is that hard bounces can cause big issues with your email sending reputation, and too many can make you lose access to mailing services such as Amazon SES, so you're encouraged at all points during the implementation of anything that sends email to blacklist any bounced emails. This of course works fine, right up until Gmail starts bouncing all emails.


I think it’s spot on. Gmail’s failure mode in this scenario isn’t correct. The rest of the internet is functioning as designed.


This is exactly it. The RFC has error codes for temporary failures (just like HTTP 503 for example). Failing to implement the RFC, the jokes on you.


If Google and other major mail providers weren't opaque about this, then fine, but for me a single bounce is an immediate removal. I can't take the risk. I can't imagine the hell that would ensue trying to get through to Google to ask them to take me off their deliverability shitlist.


Has anybody ever received a reply from gmail's postmaster address?

I have good experience with them fixing issues related to their spam-related flagging for messages that are coming from our self-hosted email server, but never got any specific reply.


I 100% assure you that everyone handling gmail errors and getting burned isn’t just tossing failures into a single bucket. There’s a zillion reasons mail can bounce and all of them are taken into account. This is a particular bounce code that signifies that an ESP shouldn’t send email again to this address.

Email service providers are HIGHLY incentivized to act 100% in accordance with the wishes of the system where the mailbox exists because it’s highly likely that acting in any way that’s considered abusive could get your emails landing in a spam folder.

Mail boxes cease to exist thousands of times a day at places I’ve worked previously. Employees leave all the time and people shutdown mailboxes, this is Google’s fuckup, nobody else’s.


There is actually a very good reason to drop these email addresses, and the reason is that a high-rate of non-deliverable emails hurt your sender score. It's a total pain to get emails delivered to the major email providers in the first place, and you immediately land in spam (or with emails not delivered at all) if they don't trust the sending email server or your score is anything but stellar!


I have 2 responses to the sender reputation concern:

1. If the user's mail service penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed, that itself is absolutely inexcusable nonsensical behavior that needs to be fixed. You shouldn't do that, just as you shouldn't shoot the mailman (or even arm yourself...) merely because he knocked a second time.

2. Notwithstanding the previous point, I don't buy this as valid justification anyway. The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two. The bounce rate increase due to such an event is very negligible here—people don't suddenly delete their accounts en masse. When that happens, it's clearly due to an outage, not because half the users at that domain suddenly decided to delete their accounts. (Which is something you can also easily detect across the domain as another useful signal to drastically lower the bounce rate across the entire domain, btw, if you're absolutely paranoid about your immaculate delivery rate dropping by an epsilon. But it shouldn't be necessary given how negligible the impact should be.)

So I don't buy this excuse one bit.


> The proposal isn't that you should blast 100 emails toward the mailbox every time you get a bounce due to an address not existing. The idea was to just exercise some intelligence in the matter. Like maybe just retry a couple times, spaced out by a day or two.

What you're proposing is to explicitly ignore the specification (which says that you should _not_ retry when you receive a 550) and try to implement a custom smart retry logic that handles temporary error cases, but also does not get you blocked.

> So I don't buy this excuse one bit.

I'm all for building resilient services, but "try to detect when a server incorrectly returns 550" is not something I would prioritize at all. I'll happily manually clean up after this occurrence than to have this complicated retry logic. It's not an "excuse", it's a very sensible trade-off.


No, I am quite explicitly not ignoring the spec. It quite deliberately says should not, not must not. If anyone is ignoring the spec here, it's you, not me. Should not is sound advice; it's telling you what you're supposed to do when you don't have a reason to behave differently. You know, like how you "should not" leave the lights on when you leave your room. Or—more pertinently here—how you "should not" assume everyone is a liar. But when you actively see evidence that deviates from the norm, you are given the power—and arguably the responsibility—to exercise your discretion here to adapt to the situation. If the spec wanted blind obedience, it would say "must not" like it did in 60 other places, but it quite obviously and intentionally decided that would be unwise, and this scenario seems like a pretty clear illustration of that.


But the RFC isn't only for senders it's also for receivers, isn't it?

That means there are two sides to the interpretation of what SHOULD NOT means. And in this case, senders have, due to experience, interpreted what Google does when someone SHOULD NOTs:

- The sender SHOULD NOT send us the same sequence again when we reply 550, if they do they MUST go on our shitlist.

Obviously it's not so binary and it takes retrying to several different recipients, but people have very good reason to interpret this SHOULD NOT as MUST NOT.


No, that's not a sane way to interpret this RFC for the receiver either. I already answered this, so you'll have to go back to my earlier comment (this might be my last comment as I won't keep repeating myself): any system (be it Google's or anyone else's) that penalizes you equally regardless of whether the recipient's addressed existed 1 day vs. never existed is just plain trash. A sender that attempts delivery to an address that accepted their email a day ago is obviously unlikely to be a spammer; there's no justification for treating them as one. It is absolutely unreasonable to interpret the sentence this way. Just as it's unreasonable to interpret "the mailman shouldn't knock a second time when he's told the recipient has moved" as "I should never open the door for the mailman ever again if he does so".


Good callout. The underlying issue of the lack of nuance is probably /state/. Being more nuanced about these errors probably requires managing state, which tends to increase the complexity and scaling challenges.


Nuance is not called for. The standard states that a 5xx SMTP error is a permanent error and "The SMTP client SHOULD NOT repeat the exact request"

Gmail screwed up here, returning a 550 error, it's not anyone else's job to try to second guess that or retry in contradiction of the accepted standard.

https://tools.ietf.org/html/rfc5321


Gmail screwed up, but that's beside the point. We're talking about designing robust systems. You don't design a robust system by assuming nobody will screw up!

Re: the RFC, note it says "should not", not "must not". That seems to suggest they acknowledge repeating might actually make sense in some cases. And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.


Try delivering to invalid email addresses too many times (too many of course being up to each mail provider), and you will be the one shitlisted (and rightfully so, as you are likely bruteforce enumerating valid email addresses).

For any small provider, getting on the shitlist is catastrophic as unlike the big providers, getting off of it will be hard / impossible.


Rules for thee, not for me


> And honestly the practicalities of this situation and the risk-reward tradeoff seriously tilts toward repeating the request later regardless of what the RFC says. The world isn't going to end.

That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

If you believe the standard is not robust enough to handle problems like this, first work towards a fix to the standard and then implement the solution. Not the other way round.


> That is exactly the thought process that leads to non-standard mess that we see numerous examples of.

I didn't suggest people should apply this thought process in arbitrary cases. I said it should be applied in this case. You can take any thought process that gives a good outcome in one situation and obtain a bad outcome by applying it to the wrong situation. That's not an indictment of the thought process. It's just an indictment of the person failing to correctly judge its applicability.

That said, by all means, do try and go fix the standard; I wasn't trying to imply you shouldn't do that.


Ah I think I did not describe the repercussions of making exceptions (even if they are in highly specialized cases like this). If you allow yourself to make such exceptions, you diminish the motivation for you (or someone else) to fix the problem at the right place. Most workarounds tend to live forever.


There's no clear-cut rule here. Some workarounds stay workarounds and never get standardized. Some become so well-accepted and adopted that people then put them into standards. It's great to put things into standards, so by all means, do try to improve standards. But that shouldn't block you from everything. At the end of the day, standardization is just a means to an end, and the end is what matters here. Nobody cares if their mailman's knocks follows an RFC or not. They just want their mailman to deliver packages with reasonably minimal disruption.


> There's no clear-cut rule here

Exactly, that is why it is important to follow standards. Most engineering decisions are not clear-cut and are born out of tradeoffs. That is why we agree on standards that define those tradeoffs instead of every one of us having our own take on situations.

> Nobody cares if their mailman's knocks follows an RFC or not

If there is a Mailman RFC which says: "If someone opens the door and says `Mike does not live here' then DO NOT attempt delivering the same package"

THEN I expect the mailman to not bother me again, EVEN IF it was actually my mistake that I forgot my roommate Mike actually does live at this address.


I'm tired of arguing about this. Engineers agree on standards for a good reason, yes, but they also agree on "should not" rather than "must not" for a good reason too. I'll leave this as my last comment, but you might want to read the post-mortem. Turns out their implementation of the RFC wasn't even buggy. They just messed up the domain name in the configuration. Which you can only be resilient to by retrying the request sometime later.


But here’s the thing: the standard (like all standards) is obviously not robust enough to physically prevent responses which incorrectly indicate permanent failure.

These incorrect responses could be caused by mistakes which the remote server admins could reasonably avoid, like software bugs. I understand not having much sympathy for that case, especially from an organization with no shortage of resources. But they could also be caused by, for example, hackers or governments exerting control over the remote server temporarily.

A standard which explicitly refuses to acknowledge these possibilities is not what I would describe as “robust.” An obvious better alternative would be to set some standards around what constitutes a polite retry policy.


My understanding is that should not means that you should not try to retry. If I do retry than the other party can rightfully claim that I am DDOSing their service, trying to send emails to deleted accounts or put me on a spam list. I do not think that ignoring the RFC and trying to cover up for Google is the best course of action here. Maybe, just maybe, this is the right time when people realise what does it really mean to have an entity like Google. Because as it is stands, we are going to have the DNS infrastructure moved over to them with DoH and a similar outage is going to be even more devastating. The internet was designed to be resilient to failure because of its distributed nature and right now it just shows why concentrating resources in one place is bad.


You "should not" repeat delivery in basically the same way the mailman "should not" knock a second time if he's told the recipient doesn't reside at the designated address. What "should not" means in these cases is: "knock only once, and assume you're being told the truth in the absence of further evidence to the contrary". But when you clearly saw the recipient reside there yesterday, it makes sense to try to knock and catch him again tomorrow. Because, you know, maybe something went wrong, e.g. maybe the person who opened the door didn't recognize the name (or whatever). At the end of the day, the mailman's job is to deliver the mail with minimal disruption, not to play hot potato with envelopes.


The terminology is well defined [0], so in this case, retrying is not ignoring the RFC.

It's a difficult one though, because as you rightfully state, covering up for Google is not the best course of action for the system as a whole, yet it's likely a good course of action for those users who didn't get their emails.

[0]: 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label.

[1]: https://tools.ietf.org/html/rfc2119


In most internet engineering task force RFCs the standard verbiage for "must not" usually is in fact "should not".


The phrase "must not" appears some 60 times in this RFC.


Thanks for pointing that out. I suppose a RFC writing style guide would be helpful to have consistency in language and interpretation.


The standard says “don’t resend,” it doesn’t say “assume the worst and begin removing user from all systems.” That was the mailing list software’s decision.


You generally avoid sending to known bad addresses or your reputation will be destroyed very quickly. The 550 response is (read: was) a clear "you fucked up, this user doesn't exist" prior to this.

I saw someone on Reddit say his SES was suspended for sending tons of bounced emails in a short period of time - it's taken very seriously by ESPs.

E: also user rtx a few comments below


We're not talking about repeating the exact request; a subsequent request for the same recipient would be to deliver a completely different message: whatever subsequent message is sent to the mailing list.


Right. In this case it's already pretty typical for mailing lists to track bounces and retry under some errors, so I imagined that part is mostly done, and the missing piece would be taking more care in checking the error conditions.


Aside - I'm not an expert but systems like MailChimp will get very worked up if your list has lots of undeliverable addresses on it. This can trigger an audit of your list which prevents sending, etc. These audits seem to take quite a while, in my very limited experience.


So what you're saying is, if you're annoyed by "subscribe to our mailing list" modal popups, "doesno5exist@garbage.blah" is better than "jeff@amazon.com" ?


In practice, no, it's more nuanced than that. Any mailing list operated through any remotely legitimate ESP will require subscriptions to be confirmed/acknowledged up front before any delivery is attempted to a recipient. If the confirmation step fails, i.e. the "check your email and click a link to verify you really signed up" email bounces, or nobody ever clicks the link, the list owner isn't generally going to be penalized for that.

If you want revenge for modal popups, your best bet is to create a bunch of throwaway email accounts, subscribe to the mailing list from them, and start reporting the individual messages as spam when they arrive. Flag them as junk at the mailbox provider (Gmail, Outlook, etc.) and use the links in the List-Unsubscribe headers to flag them at the ESP's end, too.


If you're trying to get the web site's mail server blacklisted, definitely.


Aka throw the RFC out of the window and implement a broken system because Google did that?


> I said this in another comment but this seems like a naive way

That's the standards-compliant way. Also I'd argue that spec'ing your code to handle cases where Google fails that badly is (was?) a poor allocation of LoCs.


You're entirely missing the point by blaming this on Google. This is meant to detect and handle some failure modes, and they could happen to anyone (including Google), for reasons that can be both inside and outside their control.


I had this issue with GitLab. My email provider returned a permanent error one day (due to an issue on their end), so GitLab silently stopped sending any emails to my address. I checked my email in the preferences many times and had no idea it was blocked on GitLab's end. Eventually, after not getting any notifications, I contacted their customer service and was told of this hidden setting.

So if you are not getting any notifications from GitLab, even though your email is correct, I suggest contacting them and asking if you have been blocked due to an error.


I posted this as a problem in my problem validation platform[1] and a user has built a quick solution by displaying a token if the email service received an email from the sender.

[1]'Check email service status before sending emails' - https://needgap.com/problems/178-check-email-service-status-...


Great point. And potentially email delivery services that have auto-suppression lists to protect reputation, at least they might be able to remove entries on behalf of their customers.


Good. I was hoping this was the case. Unfortunately I already moved to fastmail so there will be little benefit to me.


Oh no.


My account with Amazon went in to review because of this. I hope their team is aware about it.


Interesting response. And spot on from the technical integrity side. It’s also more fair to email providers as a whole to treat them all the same and respect their error messages. I mean, maybe there’s even requirements in some jurisdictions to deal with the address not found error in a specific way. As an email sender I think I’d prefer the message get auto re-sent after Gmail comes back online though.


> Because Gmail is sending a permanent failure, our mail servers will not automatically retry sending these messages (this is standard practice at all email services for handling permanent failures).

I fear that this will lead to many lost mails. In my experience, users are often confused by the technical "Mail delivery failed" mails and tend to ignore them or write them off as spam.


>P.S. You might also consider asking your contacts who are still using Gmail to switch to ProtonMail for more private communications


Confirmed likewise.


This feels like a cheap shot at Google. Shit happens, and they're not immune to it even if the servers are located in Zurich. Running a datacenter is no easy task.


I see it as Protonmail explaining to their users that the failure is not on their end and why they can't do much about a remedy. Seems purely factual. A cheap shot would be generalizing from the event, but I don't see them doing that.


I think I got this completely wrong. What you and other responses are saying makes sense.


Being down is okay. Returning an error message that results in the data being thrown away instead of being requeued is not. Block incoming smtp connections until your app layer is fixed.


> Block incoming smtp connections until you app layer is fixed.

Or returning one of the 4xx status codes which indicate less-permanent failure state like:

- 451 Requested action aborted: local error in processing

Which is kinda like a HTTP internal server error as it can mean anything.


For my comment’s purposes, I assume if this was possible with a flag or config setting (and the code path existed), it would’ve already been done. Doesn’t seem like they can, so they should’ve pulled the handbrake and gone “full stop” without throwing everyone’s mail away (hence blocking incoming connections and let the mail sit in all of the external MTA queues).

Another option would’ve been to accept everything with a very lightweight smtp ingest service, journal it all, and play it back to the full frontend after their code fix was pushed out.

Not an SRE so ¯\_(ツ)_/¯ just some thoughts from my time in a similar role and similar pain points (but thankfully not at this scale)


Yeah, this is a particularly pernicious failure given how email works. Many mailing providers will just mark these as blacklisted, now, and lots of unsophisticated users likely won't notice.


I consider myself sophisticated enough, but my Bitwarden has 700 accounts, of which ~30% old ones are registered with a gmail address, and the rest are handled behind g suite. Granted that last bit might be partly my fault, even though I paid for it. But even for a "sophisticated" user, I have no easy way of knowing if any of these accounts have silently failed to function now, other than by the passage of time and eventually finding out.


Oh, absolutely, even for sophisticated users mitigating may be difficult or impossible depending on exactly what bounced and how. But you at least are aware that this happened, and that you have a problem. Think how many people are out there with no clue what this error meant, or that it signaled an ecosystem problem, or that just had hundreds or thousands of emails silently bounce and unsubscribe.


A lot of people and companies use Gmail. Email providers are definitely getting support requests from users that don't know what's going on.

This is not a cheap shot, but a message to inform users that it's an issue with Google that Protonmail can do nothing about.


More like "if mail to gmail fails it's not us so pleas don't flood the support with comaplains".

> Running a datacenter is no easy task.

Sure, but then there are very view companies which have more experience with running data-centers and (normally) providing reliable email service.

So any outage for more then just a short time is very unusual. I'm really interested what went wrong.


CEO of an email marketing platform here (EmailOctopus). If anyone's curious, here's a chart showing our bounce rate to Gmail addresses over the course of the week:

https://pbs.twimg.com/media/EpUE20UXYAEa_Uv?format=jpg&name=...

That's a peak of 90% of Gmail inboxes bouncing – and this has been going on for almost 24 hours.


I know this is your livelihood, but as someone who basically never wants marketing emails, all I can think is "nice" hopefully I get auto-unsub'ed from a ton of lists.


If they normally successfully deliver to gmail, it's safe to assume a large number of people who do receive their emails want to receive them.


This is very charitable. How many people live with the nuisance of mailing (they un- or knowingly subscribed to) VS those who actually go through the trouble of unsubscribing/mark as spam in hope to rid of the from inbox?


I normally just delete mailing list mails. I don't even read them.

This year i decided to do "something" about it, so every mailing list mail received in my inbox that i don't want/care for gets an unsubscribe. It has already reduced my daily mails by a somewhat large amount. It's hard to say exactly how much, but i estimate around 10 emails less every day.

Most of the unsubscribed lists are from companies where i've purchased something andthe seller took the liberty of subscribing me to their mailing list. Those are mostly pre-GDPR that i've just never gotten around to dealing with.

The execption is of course obvious spam mails, to which unbsubscribing will probably do more harm than good.


That conclusion makes zero sense to me unless counting on the nebulous nature of the descriptor, “a large number”. They deliver successfully to my Gmail account on a regular basis so I must want to receive it? Feels like you’re telling me to stop dressing like a slut. ;)


Totally agree, especially as I signed up for exactly zero of them.

Rant: As I side note I usually try and buy direct when shopping online rather than through Amazon (for all but the most trivial purchases) and this is the 2nd largest drawback (behind filling in CC and shipping info) - because I bought one item from you, once in my life does not mean send me a daily email, and then when unsubscribing pretend like I signed up for them! For me it’s one of the easiest ways to destroy brand loyalty/reputation.


This would affect all email types including emails like receipts, shipment confirmations, password resets, account verification.

Plenty of critical communications get caught in this storm...


How do the public gmail addresses compare to the enterprise (used to be G Suite, now Google Workspace) ones?


I would be very interested to know this as well. I am trying to switch my company over to Google Workspace right now and support has been telling me my signup issues will be "resolved in 48 hours or less."

What a joke. And this after we're leaving AWS Workmail because of bounced emails.

No luck with signing up so far.


Heavily recommend you don't switch your company over to Google. Microsoft seems to understand that in the enterprise world you actually have to have support personnel, not just an opaque AI without chance for appeal


Google has decent support for paying customers.


You can actually appeal things when you start paying.


Consider yourself lucky. I have some ad words in "approval" porocess for 6 months now. I kid you not - every Friday I receive email stating that the update will be send to me on Monday (insert date here). Then nothing happens on Monday until Friday comes and I get exactly same copy, only date is different. At this point I literally laugh.

About your query

I gather that you are concerned about your Ads Disapproval for your Google Ads Account.

Observation

I understand that this is taking a bit longer as we are working with a limited staff due to Global pandemic and there is another team who reviews the account so there can be a slight delay in the decision I apologize for the inconvenience caused as I understand this is not the answer which you are looking for but be rest assured I will get back to you on coming Friday 12/18/2020 end of business day.

For any further assistance, I am just an email away.

Sincerely,


SLA of less than 99,5%... Or if there is multiple issues even sub 99%... That really is a joke...


Anecdotally, my enterprise account seems unaffected.


Also anecdotally, during the outage, test messages from my non-gmail account to my standalone/non-enterprise gmail accounts consistently bounced; test messages from my non-gmail account to my G Suite Business-associated account went through.


Serious question: how would you know that you are receiving ALL emails from ALL senders?


Totally valid, and I wouldn't. The status page indicates that "Google Workspaces" is affected, but I don't know if that is synonymous with what I have (which was Google Apps a decade ago, unsure now). All I can say is I was receiving emails during the affected window.


As an ESP, how much of a headache will this be for you in weeks/months to come? I'm guessing this throws a huge wrench in deliverability techniques--how're you handling it?


It's a real headache but should be fully reversible. @shmoogy hit the nail on the head: we'll run through our events in that timeframe, inspect the raw bounce reason to check it relates to the Gmail outage, then undo the actions that the bounce caused.

The reason why this is so nasty is not because Gmail went down, but because they returned a 5XX permanent failure and not a 4XX temporary failure for these bounces. Literally every email provider will respond to a permanent bounce by suppressing all further emails to that email address (it's permanent, after all!), so the fallout from this will be huge.


I would imagine since it's a known timeframe, domain, and error response, they can cleanly remove the suppression lists.

I logged into our sendgrid and mailgun accounts and manually purged all the failed gmail records.


Might also be affecting GSuite/Workspace emails.


The hard bounce status might be stored outside of your lists. I am not sure customers can easily change a hard bounce status themselves. Do you mean you just deleted those records with intent to re-add to reset the status? On our BigMailer platform this wouldn't work as hard bounce status would get preserved.


We use SendGrid and Mailgun right now, and both of these expose the suppression list, email address, time, and reason code + description. In Sendgrid you can filter, and mass select to remove suppressions easily (which was great). In mailgun I had to export a CSV and just removed them manually as there was not too many across my accounts.

Customers generally cannot change this on their end as far as I can imagine -- this is on the ESP end and is a protection built in because you are sending from their IP / Server and they don't take kindly to that.


+1 what Jonathan said. Typically, when email service providers are down the response code indicates a temporary issue with a soft bounce code, so you can still try to send to that address in the future.

The action for rectifying isn't too difficult, but the implications are still pretty big...


Mailgun added a few new suppressions due to bounced Gmail addresses. Hope ESPs just flush those out.


Thanks for sharing Jonathan, unprecedented situation. And that's just gmail.com addresses we can see data on, while there are all those business domains that use Google Apps for their email that probably experienced a similar issue...


What's this do to your mail-queue size - let's see that chart


Permanent failures, as these are being flagged, don't stay in the queue.


"Type: Permanent; SubType: General; Code: smtp; 550-5.1.1 The email account that you tried to reach does not exist. Please try 550-5.1.1 double-checking the recipient's email address for typos or 550-5.1.1 unnecessary spaces. Learn more at 550 5.1.1 https://support.google.com/mail/?p=NoSuchUser y128si147264pfg.177 - gsmtp"


This is pretty much the worst response possible. Hard bounces mean that email delivery services are going to start automatically removing, or at least stopping delivery to, entire slews of email addresses.

A lot of clean up is going to be needed as a result of this.

To add some more details, when using a 3rd party email delivery service, those services will either black-list or just outright remove email addresses when they get a hard bounce "email address no longer exists" message back.

Some providers make re-adding an address after a hard bounce a non-trivial task, since after all, the authority on that email address just said it doesn't exist.

This is going to be really ugly.


I really cannot believe they did not immediately hack in a new rule to their SMTP server: never return a 5xx (permanent failure), instead return a 421 (temporary failure try again later).

That simple fix buys them 24-72 hours to solve this properly.

Yeah, it burdens servers sending mail to them because now they have to hold on to all mail (including mail that really is permanently undeliverable) for another day or so, but that's still better than what's happening right now.


Why would that be better than just shutting off the delivery stack altogether?


5xx error results in suppression list addition of an email, so future emails won't be delivered (by most ESPs), and not returning MX response would probably be just as bad, or worse (or result in millions/billions of emails being re-queued due to timeouts?)

His solution would result in exponential retry failures baked into most services, which would buy them a few hours, and result in no lost emails, and no suppression list additions.


Failure of response from the server is nearly always treated as temp failure, because it could be down to network connectivity, name resolution, etc.

That is a better scenario, than 5xx.


Inability to contact the destination would be treated as a temp-failure by the origin, and taking the service off the air could be effected instantly.


In case less than 100% of gmail is experiencing this bug.


This outage seems to have lasted for about 2.5 hours. Probably this was fixed by rolling back whatever caused it. (I don't think the rollout was finished before they resolved it; my mail server sends a lot of emails to Gmail addresses, and even at peak I was only seeing maybe about 1/3 mails be rejected.)

There is no way that putting in a hardcored hack like that would have been faster. Making the change is, of course, fast.

But then you need to review it (and this is a super risky change, so the review can't be rubber stamped). Build a production build and run all your qualification tests. (Hope you found all the tests that depend on permanent errors being signalled properly). And then roll it out globally, which again is a risky operation, but with the additional problem that rolling restarts simply can't be done faster than a certain speed since you can only restart so many processes at once while still continuing to serve traffic.

The kind of thing you describe simply can't be done by changing the SMTP server, in 2.5 hours. The best you could get is if there was some kind of abuse or security related articulation point in the system, with fast pushes as required by the problem domain but still with the sufficient power to either prevent the requests from reaching the SMTP server at all, or intercept and change the response.

As a trivial example, something like blocking the SMTP port with a firewall rule could have been viable. Though it has the cost of degrading performance for everyone rather than just the affected requests.


This has been going on for 2 days, not 2 hours.


The linked status page shows a 2.5 hour duration.

My mail server logs show about 20 failures in all of the last week until yesterday 20:43 CET, then 350 failures between 20:43-00:21, then nothing after that. So fair enough, from the client side rather than the status page it looks like 3.5 hours rather than 2.5.

But still, given that resolution time, the suggested solution of changing the SMTP server is absolutely ludicrous.


Yes. I email hundreds of thousands of Gmail users each week (yes, double opt in, they all want the mails!) and we immediately delete any user for whom any Gmail error comes up at all in order to keep a solid delivery record with them. Sounds like we might have deleted 80% of our list if we'd sent today..!


My sanity tests started acting flaky ~3 hours ago, I never thought it was Gmail...

Kind of happy I had to do something else and I didn't burn hours investigating.


So new think to do: Quarantine addresses instead of deleting them and if for one provider most addresses fail don't give them another (maybe manually triggered) try later one.

(And if no such thing is detected deleted quarantined mail addresses.)


My guess is that how most email service providers handle this - they don't actually delete the email and just have a flag on it - bounced, complain, unsub. This way the list owner can run an export and see all the status code.


Hope you have a backup just in case.


Yes, we're unusual in not relying on third parties for list management. We can rollback. Or I might just comment out the 'unsub on hard bounce' code for the rest of the week..! :)


Unsub on two consecutive bounces seems more reasonable to catch flukes (or Gmail going down)?


Yes, most likely! That is a common approach for 'soft bounces' in most list management systems (e.g. MailChimp).

The problem here is Gmail has been throwing out "NoSuchUser" errors which are an instant unsub in most systems because Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I'm extremely paranoid about email hygiene, tiny bounce rates and high delivery rates, so we aggressively unsubscribe troublesome addresses (often to the point of getting reader complaints about it) for many reasons beyond that, however.


> Gmail takes repeated delivery to non-existing addresses into account for deliverability purposes.

I think you mean "reputation purposes"?

If so, wow, that sucks. Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.


> Their opaque rules have conditioned their counterparties to punish Google as hard as possible for a screwup.

Good for karma, bad for everyone though.


I think you mean "reputation purposes"?

That better describes what I was trying to say, yes. Reputation then affecting deliverability.

Over 80% of our subscribers use Gmail so to say I'm paranoid about maintaining a good record with them is an understatement ;-) Gmail is a huge weak link for us.


Ah, thanks for the explanation.


Logically you'd expect unsubscribe to only act after lots of bounces of this format when the address has been receiving mail fine before. It also seems reasonable not to trust such bounces for the entire domain for a while when this happens to lots of other addresses that have worked fine before. Not that I expect software currently works this way, but it does seem like a common sense thing to code in.


I mean, it's possible, but you'd need to queue up a day's worth of bounces, do the analysis, and then handle the bounces asynchronously later on to do that.

Most systems operate more immediately in isolation on individual addresses than that right now, because such analysis is generally not needed (until today, of course ;-)).


Mail agents already queue emails that bounce though; it's a matter of changing the conditions for when you retry and/or unsubscribe. I imagine you can do the analysis in real time too... just look at the bounce and see if it pertains to an email you sent to in the past, and if so, increment some rolling counter for that domain.


Their SMTP server being unreachable is a 4xx temporary error. The sender MUST keep trying for at least 24 hours, and 72 hours is recommended.

"Gmail going down" would not have caused this problem. Even if all their SMTP servers went offline.


Yeah, they would have been better off pulling the (metaphorical) plug—maybe block incoming traffic to port 25 or something—until they had this fixed.


Mailgun send a warning mail about increased bounces from our account. Sure, they know what's going on... but we send 4-5 digit mails per hour - it's a lot of bounces

That means I can't just resend the the emails blindly, because I'm too scared to trigger some sort of automatic suspension...

(I don't do this regularly, so I'm not familiar with all features... additional mail verification could help probably ....)


They should be returning 421 for backend outages so that sending servers queue and retry the emails. 550 can be interpreted by some as deleted [1] or even banned accounts in some cases. Maybe someone here could convince them to change the logic that occurs during an outage.

[1] - https://en.wikipedia.org/wiki/List_of_SMTP_server_return_cod...


Yah. Maybe there's an unexpected way that things can fail resulting in 550's. But maybe at Google's scale you should have some kind of kill switch to stop answering SMTP or to not send permanent errors at all, so that you could flip a switch and prevent the worst consequences of this rather than let it go on for a couple of hours.


Absolutely this.

I am astonished that either (a) this switch has not been flipped yet or (b) this switch does not exist.

Somebody is incompetent here.


Perhaps Gmail is just being discontinued ;)


don't get my hopes up!


A lot of people will lose transactional email messages, because of this.

I'd absolutely hate to be hit by this at this time. Thankfully I've made an time investment to run my own mail server years ago. A handful of times it broke down, it either went offline or started returning 4xx codes due to misconfigured or broken milter after an update. Neither meant lost messages from normal senders that use queuing MTAs.


Same for me, mainly for privacy concerns. And I back it up daily to my local NAS. It's so easy to configure and run your own mail server, that I'm surprised we are the minority in the tech community.


> It's so easy to configure and run your own mail server

Is it? Is dealing with IP reputation, getting your emails accepted by major providers, and being on the hook for fixing everything yourself very easy? I haven't tried, so I don't have personal experience, but I've heard enough horror stories to think that it's not a good use of my time.


Sending side of the MTA can be set up manually in about an hour on a Debian server, with dmarc, dkim, spf, etc. Make that a day if you want to read up on and understand each of the things in more detail, if you haven't configured them before. There's really not much to play with in this direction for a typical personal mail server.

Receiving side is where there is a great range of options, and many things to try and have fun with. You can have anything from a single catchall mailbox with no filtering, no GUI, and a simple IMAP or POP3 access for MUA, to a multi-account, multi-domain setup with server side filtering, database driven mailbox and alias management, proper TLS, web MUA access, etc. It can also be built up gradually, starting from very simple setup to something more complicated so that you never lose account of how things work.


Mine are accepted by Gmail so I am good. Considering how dominant Gmail is, that's all that really matters.

Regarding getting a bad IP rating, normally that's due to having an insecure config, like acting as an open relay, or not having DKIM enabled. There are lots of tutorials online about this, if you know Linux it really is easy.


I had an IP reputation issue and managed to resolve it after some time.

TLDR: Before you spin up a mail server, check if your IP address is on any of the blacklists [0]-[1] as well as Proof Point's list [2]. If it is, then try and get a different IP address.

I spun up a hosted server on Digital Ocean and received an IP address. I checked several black lists from a few email testing/troubleshooting sites [0] and [1] and all was groovy; my IP address wasn't on any list.

I got a bunch of 521 bounces when I tried emailing a neighbor who had an att.net address.

So, I checked the troubleshooting websites, and my IP address was listed as clean.

My logs said I should forward the error to abuse_rbl@abuse-att.net, so I did.

Those emails were never delivered, because abuse-att.net had its own blacklist. I was getting 553 errors. In the logs, the message from their server told me to check https://ipcheck.proofpoint.com.

Proof point runs their own blacklist that some enterprises use (e.g. att and apple [3]). I checked their list, and lo and behold, my IP address from Digital Ocean was blocked [2]. Digital Ocean wasn't able to remove the IP address from their blocklist and suggested I spin up a new droplet with a different IP address.

I didn't want to do that, so I sent Proof Point an email that went unanswered; the email asked them to remove my IP address. I forgot about the issue for five or six months (this is a personal server), and ran into the issue again a few months ago. So I sent Proof Point an email again, this time with different wording emphasizing that "my clients" were having delivery issues. Within a day, they removed my IP address from their block list.

So, my main suggestion is to check if your IP address is on any of the blacklists as well as Proof Point's list before you start on your server. If it is, then try and get a different IP address.

Does anyone have more "enterprise" lists, like Proof Point, to check?

[0]: https://www.mail-tester.com/

[1]: https://mxtoolbox.com/blacklists.aspx

[2]: https://ipcheck.proofpoint.com

[3]: https://www.reddit.com/r/email/comments/6toxzr/ip_blocked_by...



It may be helpful to note that Google has acknowledged they are working on similar issues (the description is vague!) with an ETTR of 1900 EST:

https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

On the other hand, their status dashboard reported similar issues yesterday and here we are again: https://www.google.com/appsstatus#hl=en&v=status


Yes, hard bounces even between Gmail addresses.


just curious, how did you check bounces stats for Gmail?


I also had the same hard bounce (when emailing from a non-gmail address -- fastmail -- to a gmail address). Sent it again minutes later and then it worked.


Incoming Gmail is bouncing, but I'm still able to access all prior received messages.


TL;DR; Don't sent your newsletters today if you can avoid it.


Over the past 24 hours, I've had GitHub request that I re-verify my gmail three times (roughly 22 hours ago, 2 hours ago, and now), each time resetting my primary email's status to "Undeliverable" and "Unverified"

The triggering event may be an email bounce. I get a lot of github notifications sent to my email, and the failure of just one/a few may trigger the reverification.


This is another good reason to have email @yourowndomain.tld

When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.


I've had way more downtime trying to run my own domain's mailserver for a year than I have with gmail for more than a decade.


That's not what I said. With some emphasis added:

> When this happens, you can spin up a temporary server and have a mechanism in place to redirect email so you don't go down when your provider does.

Use a commercial provider, but fall back to your own server when it goes down without changing your email address.


I see two problems here: The likelihood your service is restored before you spin up your own mail server, and the fact that, not expecting this failure, their DNS may have a fairly lengthy TTL.


https://mailinabox.email/ Can be set up relatively quickly


What about permanent problem, like suspended account?


In that case, owning your own domain is golden. I just don't see "spin up your own mail server" as a short term solution.


Having run my own mail server for over a decade, I have yet to have a single time where the server responds by Permanent error of accounts not existing and with email bouncing.

Losing incoming email is pretty much the worst case scenario when it come to configuration errors. It about as bad as not having backups, in that both cases results in unrecoverable loss of data.


Use a paid email host, just anything but Google. Life's too short to put up with managing your own email server.


It can as well be Google, just the paid Apps version. Zero time to get used to a different UI. I suspect there must be a solution to easily migrate all your tags and filtering rules. (Tags are the killer feature to me. Outlook sort of has them but they are less flexible.)


does the paid apps version have better uptime? Is it not affected by the current issues?


My company has paid apps, and we have been facing issues same as everyone else.


I switched to a custom domain only when gmail torpedoed one of my secondary gmail accounts.


You can redirect to a commercial service as well.


Not me, and I'm not even paying for the services I've been switching between.


Keep in mind other stuff like DNS will go down randomly. At least they won't result in a permanent address-doesn't-exist error, but you'll be putting out potentially more fires that way.


I just switched to Fastmail before all this.


Except as an academic exercise, trying to roll and maintain your own email is fraught with difficulties.


You can forward handling to a provider, like gmail. The idea is that you own your email address and can switch providers more easily if you are not satisfied with them or they turn out to be evil.


still use gmail to manage email lmao


Yep, there was a very similar event yesterday, approx. 22 hours ago: https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=10...


I figured one major incident for Google was enough for the day! We had a bunch of email bounce to @gmail domains yesterday in that timeframe.


When that happened I panicked a little, realizing how much Google Sheets data I had that wasn't really backed up anywhere since Sheets files in Google Drive are basically just links. I started a Takeout, but it looks like I wasn't the only one - it took well over a day to complete.


Be sure to verify that it worked. Some settings of Takeout don’t download docs/sheets/slides files. I don’t remember what the default is, unfortunately.


Same from LinkedIn


As quite a few googlers appear to read and write on HN, I'd really welcome an insider info on what's going on the last few days.

Sure there will be some internal turmoil going on right now, but isn't there some non-confidential info to share? Can't imagine this will hurt the image of google neither in the short nor long run, quite the opposite.


I don’t work at Google, I’m at a different big tech that’s in the news frequently. Sharing inside info on an ongoing incident is a great way to get fired. Big tech companies are way different than startups where everyone can do a bit of anything. There are people whose job it is to handle that communication. You make their job a lot harder if you disclose information. The company is so big that as an engineer you may not know all the factors involved in what would hurt the company long term - undisclosed relevant litigation, compliance commitments, partner obligations, etc.

How much do you hate it as an engineer when sales people make tech promises to customers without asking you? For comms people, engineers leaking info publicly feels the same way.


I am very pleased to see this response, genuinely. Our Technical Curiosity aside, there are literally people and teams in such big firms dedicated for this.


What you're saying makes sense but I don't think it really applies to anything the OP said. The "non-confidential" qualifier indicates to me that they only want people to share what they can responsibly.


And the parent post’s point is that there are people whose job it is to specifically share that information, and so we should let them do their job. They are the domain expert in this particular task.


For any incident like this there are tons of details that are both

1) Harmless to share 2) Will never be shared by PR teams

I don't see anything wrong with asking people to share what they can.


There’s nothing wrong with asking. I’m just explaining that as a Google employee, sharing such details is poor form.


[flagged]


> These companies wouldn’t hesitate to kick you out on the street if they had to

> Sharing inside info on an ongoing incident is a great way to get fired

You're not disagreeing.


He literally just said they wouldn't hesitate to kick you out on the street if they had to


In lieu of an actual Googler, how about some educated speculation? It blows my mind that Google can even have problems like this. Aren't their apps highly distributed across tons of CDNs? Don't they have world class Devops people that roll out changes in a piecemeal fashion to check for bugs? How exactly can they have an issue that can affect a huge swath of their customers across countries? Insight appreciated.


Googler but nowhere near Gmail, so just educated speculation:

* We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation. (e.g. what if you're in a situation where rolling back could make the problem worse? we might be Google, but we don't have magic wands)

* Debugging new failure modes is a coin flip: maybe your existing tools are sufficient to understand what's happening, but if they're not, getting that visibility can in itself be difficult. And just like everyone else, this can become a trial and error process: we find a plausible root cause, design and execute a mitigation based on that understanding, and then get more information that makes very clear that our hypothesis was incomplete (in the worst case, blatantly wrong).


We have a lot of automation/tools to prevent incidents when mitigation is straightforward (e.g. roll back a bad flag, quarantine unusual traffic patterns), which means that when something does go wrong it's often a new failure mode that needs custom, specialized mitigation.

As Douglas Adams says, "The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair."


Rollback proof bugs are rare, but boy howdy are they exciting. I think I've only seen one so far (unless you count bad data / bad state that persists after a bad change is rolled back... which can also be pretty exciting)


Is "exciting" a synonym for "harrowing" where you're from? :P


Chrome web store has no rollback strategy, there is only roll forward :(


You can build rollbacks out of rollforwards, although it certainly isn't particularly fun. You patch an update to version N version code so that it's higher than N+1 and roll out the N+2 labelled N.


> what if you're in a situation where rolling back could make the problem worse?

Here comes the poison pills!


You don’t really have to speculate, they disclosed yesterday that yesterday’s issue had to do with the automated quota system deciding the auth system had zero quota:

https://status.cloud.google.com/incident/zall/20013#20013003


Thanks for providing this. It's funny to read the speculations when you have read the actual root cause :D

Well I guess the thing is left unanswered for now is why the quota management reduced the capacity for Google's IMS in the first place.

Maybe we will know someday :)


Maybe they have world class DevOps, but they also have way more things that can go wrong than the vast majority of businesses. It's kind of remarkable that the entire world can be pinging Google services and they have ~99.9% uptime.


> It blows my mind that Google can even have problems like this.

When you operate at Google's scale then everything that can go wrong, will go wrong. Google does an amazing job providing high-availability services to billions of users, but doing so is a constant learning process; they are constantly blazing new trails for which there are no established best practices, and so there will always be unforeseen issues.


Ex-Googler here.

Yes, apps are highly distributed. Yes, roll-outs are staggered and controlled.

But some things are necessarily global. Things like your Google account are global (what went down the other day). Of course you can (and Google does) design such a system such that it's distributed and tolerant of any given piece failing. But it's still one system. And so, if something goes wrong in a new and exciting way... It might just happen to hit the service globally.

When things go down, it's because something weird happened. You don't hear about all the times the regular process prevented downtime... because things don't go down.


I speculate that for many companies, work from home has been at most, less impacting than they thought.

However, I'd speculate that in this instance, when you get that .0001% problem, less hands on deck makes work from home aspects less easier. Akin to remotely fixing somebodies PC over standing behind them.

With that premise I'd speculate in this instance that whilst not the root cause, may of been a small ripple that led to that root cause and/or lead to a slower resolution than what would normally get.

Those speculations aside, it will only highlight what that some tooling needs to adjust for remote workers as does design and set-ups more. Water cooler talk is not just for gossip and a counter would be more regular on-line group socialising at a work level so that not only the companies but the workers can fully adapt and embrace the work medium; But so the kinks and areas that need polishing can be polished and made better for all.

Lastly, I'd speculate that I'm totally wrong and yet what I said may well anecdote with some out there and resonate with others.


You might be right for the smaller company where physical access to the machines in the data center is necessary at a certain point in the troubleshooting process. I work at such a place myself. I would guess, however, that Google moved beyond that quite some time ago. It's simply not practical, with or without having offices with people in them.


All the access to the services is remote, but I'd say having the entire team in the same room does help coordinate incident response.


Agreed. And I'd hope that their plan B of "get the whole team on Hangouts" isn't met with connection / auth issues. Kinda feel bad for the googlers. Hope they get this right.


When I was there they had an IRC network for this reason. I hope they still do. Not quite the same as VoIP but fewer dependencies...


That's why the network folks at Google and AWS use IRC for just that purpose. Simple, no external dependencies, just works.


Software isn't as simple as splitting across different locations to prevent global failures.


I thought SMTP was specifically designed for this (with support for multiple MX entries, queuing at the sender MTA side, etc.) and there's an easy hard boundary at the user mailbox level you can use to partition your system.

It should not be a problem that gmail is "down". Unless this would be happening for more than a few days, noone would lose e-mail. It's a problem that it's not returning a temporary error code, but permanent one.


It is pretty clear that accepting a TCP connection and reading the bytes of the email from the sender is not the problem. Google is bouncing messages with an error like "that user doesn't exist". This would lead one to believe that some instances are having trouble looking up users, and that doesn't scale super easily. If the product guarantees that it will reject invalid email addresses (which is nice of them, not required by any spec), there has to be a globally consistent record of what email addresses are valid, and the accepting server has to look it up and react in the time that the sender is still connected to the mail server. You can't queue those and send the bounce later (there is no reliable "from" field in email; the only way to correctly bounce is while the sender is still connected). This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email. They made it hard on themselves by providing messages like "that user doesn't exist", but... it is nice when you email someone and you get the message "they got fired, sorry" instead of silence. So they made their system more complicated than it needed to be, for a better user experience, and now they are fighting a breakage.


I doubt that the delivery stack would 550 for mere trouble looking up an account. This smells more like the identity system was incorrectly returning authoritative denials.


Yeah, that sounds right to me. I would expect to see a temporary rejection with DEADLINE_EXCEEDED or something like that.

I think a lot of time and effort is spent categorizing errors from external systems into transient or permanent, and it's always kind of a one-off thing because some of them depend on the specifics of the calling application. It definitely takes some iteration to get it perfect, and it's very possible to make mistakes.


If it really doesn't want to accept emails for addresses that it doesn't know are valid, a well-behaving email server should send temporary failure codes when it can't look up if addresses are valid, and let the sender retry later when the address lookup is working and it can give a definite acceptance or rejection of the email. This is not even remotely a new problem, it comes up in email systems all the time because even at much smaller than Google scale they tend to be distributed systems. Someone screwed up.


> This basically means that you have on the order of milliseconds to accept or reject the email, so merely starting up a another replica of your SMTP daemon isn't going to mitigate this issue. The chokepoint is querying the list of users to see if you should bounce or accept the email.

You don't have milliseconds. You can take quite some time to handle the client. 10s of seconds for sure. For example default timeout for postfix smtp client when waiting for HELO is 5minutes.


If there is something I've learned from AWS outages (they tend to publish detailed post-mortem), no matter how you design your architecture in a distribute way you will always have Single Point of Failure (SPOF) and sometimes discover SPOF you didn't think of.

Sometimes it's a script responsible of deployment that will propagate an issue to the whole system. Sometimes it's the routing that will go wrong (for example when AWS routed all production traffic to the test cluster instead of production cluster).


[flagged]


Your contribution has greatly enhanced this conversation, thank you.


Because, maybe, like in every big company, the thing actually doing the work is some old oracle database with some huge monolithic around it...


Out of all the companies Google might be relying on in their back-end, I think Oracle is probably pretty far down the list.


I can’t imagine what part of Google’s history would lead someone to believe there was any third party system in their production stack anywhere.


Now their corporate/finance stack on the other hand... shudder.


Well, google did use a bunch of off the shelf technologies in the early days, but now it is obvious that there is no vendor on earth that could supply the infrastructure to run Gmail.


Didn't they use GNU/Linux form day one on?


Closed-source like Oracle I meant. They've been big boosters of all kinds of open-source stuff like linux, llvm, mysql, ...


Hush, you'll scare the shiny eyed faang wannabies away, they aren't supposed to know this until employed for at least two decades.


I would advice anyone to not share any information that his company hasn't agreed explicitely to share.


Your username is rat9988. Been burned in the past?


Management at google are poking in to check up on their staff, to make sure nothing leaks.


[flagged]


There used to be times when people didn't care for technicalities like this because the focus was on the person's contribution to the discussion.

Now that everyone's replaceable, the popular culture desperately tries to shift focus into arguing about pronouns and terms.

Watch out, this is a road to nowhere. Forcing others to use the right pronoun won't build up your retirement fund, but will distract you from worrying about not having one. And the fact that you care about it more than about your opponent's T-shirt color could be an indication that you are being manipulated to not think about the long-term things.


This is a surprisingly profound and insightful comment so deep in the subthread of, more or less, a shitpost.

Thank you, sir, for elevating our collective level of discourse.


> you are being manipulated to not think

This is where it crosses from insightful into conspiracy theory territory for me. People seem perfectly capable of groupthink-deluding themselves. Why cheapen your argument by postulating some master manipulator when it's not necessary for the deeper point you're making?

It will only lead to people focussing the discussion to challenge this particular aspect, or them disregarding all you've said, instead of engaging with the actual meat of the argument.


'Their' works fine and has been gender-neutral English for ages.


[flagged]


Okay, so use "their." It is gender neutral, so should work for everyone.


It's also wrong. because it's not singular. Makes for difficult reading.


From https://www.pemberley.com/janeinfo/austheir.html:

'Singular "their" etc., was an accepted part of the English language before the 18th-century grammarians started making arbitrary judgements as to what is "good English" and "bad English", based on a kind of pseudo-"logic" deduced from the Latin language, that has nothing whatever to do with English... And even after the old-line grammarians put it under their ban, this anathematized singular "their" construction never stopped being used by English-speakers, both orally and by serious literary writers.'


It's not "wrong." Language is fluid and singular they is widely accepted. A previous poster linked to an article showing centuries of such usage.


> "so what does it matter anymore"

The same reason it ever mattered how you refer to people, politeness and respect. If someone you consider "him" asks you to refer to them as "her" it's like someone asking you to call them by their full name "Rebecca" instead of "Becky" or "Jonathan" instead of "Jon". If you like and respect them, you do as they request because things which matter to them matter to you, and being polite to them is important to you. If you ignore what they ask, call them what you want, you communicate that you don't respect them and don't want to be polite, that you want to dominate and 'win' instead.

> "Pronouns can mean whatever you want them to mean"

Only one way. A specific person asking you to use a specific pronoun for themselves is wildly different from you unilaterally and universally saying that all women should feel included by the word "him" because "him" has no meaning anymore.


Ages is subjective, it came back in to popularity only recently


That varies based on location and regional dialect. Here in the northeast US, I remember using singular they/their since the 80's. It would be interesting to know when this become popular elsewhere.


80s in Australia too, been hearing/using it my whole life.

Though with respect to 'ages' apparently it's been around since at least the 14th century but certain purists tried to stamp it out at various times (just like the singular 'you' which no one currently has grammatical issues with I hope).

https://public.oed.com/blog/a-brief-history-of-singular-they...


I remember some people tried to get BLM into German discussions, which made absolutely zero sense, as we have a complete different history and culture. Now I see this popping up. I really hope Europe can get some cultural distance between itself and the USA in the near future. The time is ripe.


> s/his/her/

s/his/their


I believe you mean:

s/s\/his\/her/s\/his\/their/


s#s/his/her#s/his/their# also works and avoids awkward escaping. The first symbol after s is used as the separator. Works in vim, at least.

In other words:

s%s/s\\/his\\/her/s\\/his\\/their/%s#s/his/her#s/his/their#%


Did you just assume my regex engine is pcre gendered???


Wow, I've never heard this joke before. Original and well-applied to the situation.


    awkward escaping
Or as I've seen it called, "leaning toothpick syndrome".


The question is what exactly is the new “feature” that got pushed skipping canary is.


NSA backdoor? <smirk>


Since so little time has passed since the last issue, I am wondering if it could be the same cause. Maybe they didn’t fix it properly the first time.


Or simply trying to roll something out again, same that failed before.


it's got a similar flavor - that was identity management going down, this is "that email account doesn't exist".


I wonder if Gmail is just not a very well maintained codebase. Here's an issue where old emails just become inaccessible. Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

https://support.google.com/mail/thread/6187016

Maybe time to switch to a more reliable provider.


> Not fixed for over a year and they've locked the thread so I'm starting to wonder if they actually deleted the emails by mistake.

Did you try pulling them down using the API tester?: https://developers.google.com/gmail/api/reference/rest/v1/us...

Some of the internal formatting that Gmail uses has changed over the years, so more likely than not the API that parses the stored message for display in the Gmail UI is just throwing some kind of error.


I didn't but I did try Takeout and they weren't in it.

Either way my point is that this is a pretty serious bug and they haven't even acknowledged it! Not a good look.


I've never had issues over IMAP with old (decade) message in gmail


Right but the version of an email message you download via IMAP is different than the version of an email message you see in the Gmail UI. That's my point, that the error is probably in the way Google is processing messages for Gmail, so you wouldn't see it in IMAP or via the API.


Yes, I’ve been hearing about this issue from non-technical friends too. An explanation of “X crashed” helps even if they don’t actually understand what X is. The fact that someone figured it out and knows is reassuring.


Uneducated speculation, some sort of security incident. Whenever there is a major security issue in the wild, one of the big providers tends to have a problem within a few days.


People will suggest running your own mail server, and if you have the time and energy then definitely do that.

But the next best thing you can do is simply just use your own domain. That way, you can at least decide to migrate your email elsewhere. Don't use the free domains you get from things like gmail or other providers, because then you have to _change_ your email address, and not just your MX records.


10 GB space, 1 tld domain of your choice, 99,9% uptime for 1.85€/month with a setup fee of 10€. Hetzner will take care of everything else for you as it's managed webhosting.

https://www.hetzner.com/de/webhosting

Just because you can (theoretically) run your own infrastructure does not mean you should. Trust the professionals. You don't do your own surgeries, do you?

(not affiliated to Hetzner, just was the first offer I thought of.)


I've used Fastmail for several years now with no issues apart from a slowdown in the phone app a few months back (website still loaded fine). It's a bit more expensive at $50/year/user (about $4.17/month), but you get 25GB of combined mail and file storage, contact/calendar/note syncing, simple static web hosting that's good in a pinch, a very nice web front end, and superb customer support. Not affiliated with them just a happy customer.

It's still putting all your eggs in one basket in a sense, but being a paid service there's a sense of privacy, security, and permanence that Gmail and the other free providers don't offer. I do own my own domain as well, and I have mail accounts tied to it that I use for certain services and communications, mostly medical and local businesses, but I'm still at the mercy of my hosting provider for that domain. With that said, my provider (Tiger Technologies) has been astoundingly awesome and has never let me down in 12+ years of service.


I agree with this, and setting it up with Fastmail was so easy, I set up two more domains just for fun. Same goes for adding import from Gmail/Softbank/Apple/anywhere. It's like a 1 minute procedure to import an account, literally. Excellent product, glad I migrated off of Gmail.


> People will suggest running your own mail server, and if you have the time and energy then definitely do that.

As a learning experience, sure, but most people are not prepared for what running a 24/7 mail service requires of them.

First of all, a static, non-residential IP is likely needed. The big players will flat out refuse receiption if your IP is registered as residential, so that rules out hosting it from your home despite having gigabit internet.

You also need SPF, DMARC and DKIM working, or major players will also flat out refuse reception.

On top of that, you need to implement the infrastructure to actually host a server 24/7, including patching and backups, as well as monitoring it for unauthorized access.

Despite all of the above, you may still find yourself on a spam/block list, and removing yourself from these can also turn into a large task.

Part of the irony of Gmail having outages is that Google and other "large players" have fought long and hard for a decade to make it harder to host your own mail server. It has been done in the name of fighting spam, but i doubt any of them minded it making it harder to run your own.

So yeah, build your own mail server as a learning experience. Then move the domain to someone dedicated to running it.

I purchased a lifetime subscription (limited promo offer) with mxroute.com. 10GB mail storage, unlimited domains and accounts (limited by space only), as well as a Nextcloud instance for all your users. Service and uptime has been nothing but exceptional. Customer support is actually reachable. The only downside is that the spam filter (SpamAssassin IIRC) is not as highly trained as the GMail one, so more spam comes through.


I think the barriers are overstated a bit. I have email on my own server, part of the stuff I run on a dedicated server. Granted, it costs money, but I'm using the server for more than email anyway. That takes care of the IP address, and since the server's with a data hosting company, they take care of the network infrastructure, hardware maintenance and such.

Downtime hasn't been a major issue - senders will retry sending email, usually multiple times over several days. I've been able to have downtime of 24-48 hours without losing any messages.

A SPF record is just another easy to create DNS entry. If you know how to manage DNS, setting up SPF is a matter of minutes. DKIM is just slightly more complicated, with an extra key generation step. Sites like mxtoolbox.com can help you validate records.

The biggest problem I think I have with my own server is security. I do patch the machine regularly, but of course I don't have the same kind of security that Google or another big player would. On the other hand, I suspect I might have a smaller attack surface and better security than plenty of small websites.


> First of all, a static, non-residential IP is likely needed.

If you want to directly send mail that's true. But if you send mail through a smarthost, like your isp's smtp server, you can easily receive mail on a dynamic, residential ip.

> implement the infrastructure to actually host a server 24/7,

email is really tolerant of downtime. You can be down for hours without losing mail. The sending servers will retry for a while.


> email is really tolerant of downtime. You can be down for hours without losing mail. The sending servers will retry for a while.

I’m aware senders will retry for days if your server doesn’t reply, but it still requires monitoring and is not just a “fire and forget” solution.

Also, if your server starts bouncing emails, chances are you’ll be missing mails. Again, needs monitoring.


I think you can be down for 4 days by the standard.


> I purchased a lifetime subscription (limited promo offer) with mxroute.com.

At the risk of stating the obvious, note that 'lifetime' refers to the lifetime of the company, not the customer. Which underscores the risk of buying lifetime subscriptions.

And as much as I like the idea of avoiding recurring costs (I have a 'lifetime' Plex pass), it seems to me that these can't be sustainable for the company on the long term.


I’m aware it’s the company’s lifetime (unless my expiration date comes up first), and I act accordingly with nightly backups of all mail.

It’s really no different than Google, where a single bad comment somewhere in their vast eco system can end up getting your account banned.

In my case I try to stay as far away from Google as I can with my everyday services. I’m also well aware that chances are extremely high that any email I send will make its way to Googles servers.

The “easy” solution would be to self host, and I do that to some extent, but as I work with system administration I really don’t want/need another day job. I’m actively looking for relatively secure, privacy aware and affordable cloud solutions for everyday use. I wrote affordable because nothing is free.


The main problem with self-hosting is indiscriminate blacklisting by Google and Microsoft. You only need to be on the same network as some spam artist to end up shunned. The tech giants are our new overlords.


No, it is not. It really grinds my gears if people are spreading misinformation about 'being on the same network as a spammer', 'indiscriminate blacklists' and corporate overlords for being the reason on why their email is not being delivered.

Spam filters have been content driven for a long time now. IP addresses and domain names are ephemeral and so are 'blacklists'. With the amount of spam being send, we would have blacklisted the entire internet by now.

If a spam filter gives false positives, it hurts the receiver just as much as the sender.

The real problem with self-hosting is that the majority of self-hosted e-mail servers are terribly configured. Getting the SMTP server running is one thing, but getting DKIM, DMARC, SPF, TLS and MTA-STS running properly is often overlooked. What was the last time you checked the validity of the TLS certificate of your SMTP server?

Get your server and domain setup properly. Sign your email with DKIM, setup an SPF and DMARC policy and perform DMARC monitoring to spot problems. Setup TLS and an MTA-STS policy service for your incoming email. Throw in SMTP-TLS-reporting for good measure. E-mail servers are not set-and-forget if you want to do it right. And this is not the fault of large corps, it's the spammers who got us in this situation.

It's really easy to blame large services or blaming your email deliverability problems on being on the same IP block as a spammer, but really it's almost always a misconfiguration on your side.

Disclaimer: I'm the founder of Mailhardener (https://www.mailhardener.com), we do e-mail hardening and solve deliverability issues.


My corporate overlord (~10k ppl) has a policy to place all incoming email from domains less than 30 days old into the Junk folder, it's a tier 1 rule which cannot be overridden or circumvented by user rules. No amount of properly configured mail services will matter in this scenario. :-/


That's probably more of a phishing defense, but not really effective either way. 'Good' spammers will be constantly registering domains and only use the ones that are a few months old since time-based spam policies are fairly common. This type of policy only works for low-barrel spam and shady operations that register domains with stolen credit cards and end up losing their domain within a few weeks once the chargebacks get to the registrar.


Yah, I do not defend it in any way - it's security theatre to me; they also wholescale block entire TLDs (more than one) under the same umbrella, block access (HTTP) to any domain less than 30 days old as well. It is in my experience that most companies of size implement compliance checkbox solutions and do not really care about internal user experience, you (me, we) are expendable and replaceable. Comply or face sanction/termination of employment, compliance is what matters to the business.


You can self-host the receiving side of your mail server (with spam filtering etc.), but send all your mail through a mail provider with a good reputation. Configure your own SMTP server to use that other server as a relay, you can even do your own DKIM signing before sending off your mail. At least then you shouldn't have a problem with IP reputation.


I have been doing this. I bought a domain and am using zoho.com to send and receive emails for free. Storage is only 5GB but I can always pay if I want more.


yep, i've found it to be extremely worthwhile having your own domain. Its more effort but once its setup its way easier to avoid lock in to a company.

I also run my own and its been easier than expected, especially with maddy: https://github.com/foxcpp/maddy


I'm trying (mailcow was easy to set up), but it seems all self-hosted solutions do not support snoozing mails. To me this is a must have feature.


In most cases, this also makes you more vulnerable to a dns hijacking attempt.

I'd still suggest to put "secrets" in a free global trusted email provider with 2fa


Good luck getting the same quality of service as gmail with your own mail server. The fact that gmail fails every so often (extremely rarely actually) is a good sign: zero failure would mean that they are over-investing in quality and losing flexibility. Gmail only needs to be as good as the best web service out there.


But you can pay for email from other providers like Fastmail and Proton and just have your own domain with MX pointing at them. Services are better in my opinion when you pay for them and you’re not the product.


> Services are better in my opinion when you pay for them and you’re not the product.

Services also tend to be better if lock-in effects on customers are low. ... and using your own domain for email does reduce lock-in significantly.

unrelated: the link in your profile does not seem to work.


Does gmail have such a big lock-in effect on users since it allows them to use the gmail interface with their own email? (i.e. you can split the service and the UI)


Yes, gmail is complete lock in (with current legislation). You can not move your `@gmail.com` address to another provider eg. Proton.

But there is a case for making legislation forcing email providers to allow moving emails to other providers (how it should be done technically is another question). This is already in effect for telephone numbers many places.


I was lucky enough to get in at the ground for the Apps for Business (or whatever name it has now) service when it was free, and I was able to use my own domain for that.

As such, migrating these email addresses was easy enough.

My older @gmail and @googlemail ones though, not so easy. I've been moving each account I have used with these addresses one-by-one, but you never catch them all and even when you do, some services simply will not let you change your email address.

I recall being so excited when Gmail first launched and was one of the first people to get a Beta invite. I regret ever signing up for them now, given the headache it has been to get off it.


thanks! i stopped using keybase when they were acquired by zoom and didn't update my bio.


Uptimes in Germany are usually 99.9% for most hosters. I think Gmail has lost to all of those yesterday.


Parent suggests to just run mail on your own domain.

The main thing here is to avoid single point of failure as in both domains (politically induced problems) and infrastructure (technically induced problems). If people would use more than just a handful domains/providers for mail then single failures would not have that big of an impact.

I have been hosting my mail on my own domain for the past 3 years and have not been impacted by this incident. Currently I am happy for protonmail that I use to host my mails at. But I know that I can easily move on, if service drops, and even selfhost.


> the same quality of service

and the weight of google as well!

an individual mail getting onto an blacklist is most often than not a dead sentence for the address, the domain and sometimes even the ip.

but if google is at fault and email get into a permanently removed bucket, like in this event, it's in the interest of the other to play nice and accommodate for the fault.

I think people severely underestimate how hard it actually is to consistently deliver email in 2020 between dkim, spf and domain keys while tiptoeing around everyone else ip/email antispam services.


You need to at least hide your domain whois contact information if you are going to do this...


Many domain name registrars offer that as a service.


Yes but it often is not the default and as you know, most people dont change defaults (and there is also an extra monthly fee, which is a scam)

I know people here dont think its a scam but it is


Remember, free and open source software is more reliable: (https://www.gnu.org/software/reliability.en.html)

"One reason is that free software gets the whole community involved in working together to fix problems. Users not only report bugs, they even fix bugs and send in fixes. Users work together, conversing by email, to get to the bottom of a problem and make the software work trouble-free."

And Service as a Software Substitute (SaaSS) takes away your freedom: (https://www.gnu.org/philosophy/who-does-that-server-really-s...)

"The basic point is, you can have control over a program someone else wrote (if it's free), but you can never have control over a service someone else runs, so never use a service where in principle a program would do.

With free software, we, the users, take back control of our computing. Proprietary software still exists, but we can exclude it from our lives and many of us have done so. However, we are now offered another tempting way to cede control over our computing: Service as a Software Substitute (SaaSS). For our freedom's sake, we have to reject that too.

With SaaSS, the server operator can change the software in use on the server. He ought to be able to do this, since it's his computer; but the result is the same as using a proprietary application program with a universal back door: someone has the power to silently impose changes in how the user's computing gets done.

Thus, SaaSS is equivalent to running proprietary software with spyware and a universal back door. It gives the server operator unjust power over the user, and that power is something we must resist."


I ran my own mail server for 15 years. It was not more reliable than Google or Microsoft (my current provider).

> never use a service where in principle a program would do

Email is definitely a service.


Ah the free software religion, or as I have taken to call it, GNU minus rationale.

-----------------------------

> One reason is that free software gets the whole community involved in working together to fix problems. Users not only report bugs

This works only for popular open source software, and still doesn't apply to infamous Unix mail servers or likes of GNOME. The 'community' is often more interested in adding features than fixing bugs.

> so never use a service where in principle a program would do.

Comfort vs Freedom tradeoff. Sometimes data privacy / freedom isn't just that critical to justify costs and difficulties of self hosting.

> Service as a Software Substitute (SaaSS). For our freedom's sake, we have to reject that too.

For what it is worth, SaaS has only benefited freedom of people working in big corporates. It has loosened the grip of enterprise software directly sold to C-suites by wine-and-dine sales methods.

And know what, software writers have to make good amount of money too. Just giving away free desktop / server software doesn't work out for most developers. I will happily accept if an open core software has a value added SaaS. And "live cheap and write freedom respecting software for some semi-arbitrary definition of freedom", is just disrespectful to talented software developers.

> someone has the power to silently impose changes in how the user's computing gets done.

Again a convinience and security etc.. thing. Most use cases, you don't care.

Stallman sees everything as black or white. (Linus Torvalds has also written on this)

I'd suggest Stallman once watch 2017 Tamil movie "Vikram Vedha" :-)


> However, we are now offered another tempting way to cede control over our computing: Service as a Software Substitute (SaaSS).

Don't agree with this bit. You are ceding control of your data for sure, but this isn't quite the same as running a proprietary piece of software on your machine which you have no idea what it's doing.

Ceding control of data is also worrisome I agree, but giving control of your data to a custodian you trust in exchange for said custodian promising its careful curation is a trade-off that most people would find acceptable. Some won't, and I respect that, but the advocacy above might be counterproductive to most people who take it and try hosting their own email servers.


Funny how you always see this comment in this type of thread


I generally agree with the GNU stance on most things, but here this is just irrelevant and out of topic.

I run my own mail server, and you can screw up badly with free software. And you probably will more than Google or the big players, especially if it is not your job. Free software does not make your admin-sys screw-ups or hardware failures something you can "get the whole community involved in working together to fix problems".

(Fortunately, I haven't screwed so badly that my server started responding "no such user". Just some regular downtime that, as far as I know, has not make me miss any mail)

Using someone else's computer and services can be problematic for a lot of reasons, but uptime / reliability is generally not the issue.


Just mistakenly disputed a transaction because of this. I have been working with a mediocre freelancer who didn’t reply to my list of remediation items.

When I followed up today, I got the “this email account does not exist” error from Gmail and proceeded to dispute on PayPal.

I only found out through hacker news that this was a Gmail bug.

Google, I expect a follow up retraction of the incorrect error messages. It’s one thing to give a temporary error. It’s another to say “this email account does not exist”.


Email sender here (on behalf of about 2.5 million domains). We noticed the issue and mitigated it by transforming the 5xx error into a 4xx error so that messages to Google are re-queues instead of being permfailed. But even with this intervention, the ticket volume was insane...


Thanks for being proactive on behalf of your users.


Oh my.. Few will have even thought of this in time and been this proactive :/


Side note: Love the 2010-era web design. Wtf happened? Everything got bloated, buttons got huge and whitespace took over the screen real-estate.


Touch on smaller screens became the primary interface. Fingers fat, buttons big.


Yeah, I get that. But why punish the Desktop users with mobile interface? We used to punish Mobile users with Desktop interface back then, we just reversed the problem, didn't actually solve it.

They should be 2 separate things. 2 separate css files with @media select. 2 separate button sizes and styles.

I know what happened - designers got lazy.


Doing all the work twice is expensive. It’s not just the actual work of doing the design, but then you’re 2x testing costs too (unless you YOLO one of the modalities). More cost effective and better quality result to make a choice to make the platforms more uniform rather than trying to optimize for both. This obviously only holds when there’s a massive disparity in revenue. If your PC/laptop users make up a non-trivial parts of revenue it can start to make sense (but again, generally only if doing so will help you retain that market or get better revenues). The other thing unifying things does is it helps engineering and design teams by removing complexity.

All of the above is general trends of how these decisions are made. There will always be counter examples or situations those aren’t good ideas (or that someone has made a mistake applying a lesson to the wrong situation).

Saying an entire group of people is lazy or dumb is not a particularly insightful way of looking at anything that helps your understanding of the situation or learning what kind of results different incentives yield.


> Doing all the work twice is expensive.

Sure, but they already do all the work twice.

I get a completely different site on my phone than on a desktop/laptop.

In fact, they maintain far more than two designs - in addition to two native apps, there's a mobile website, the desktop website, and the basic HTML version. On top of that, they have multiple display density options for desktop (which admittedly is mostly just adjusting padding), redesigned the desktop site a few months ago, and had Inbox for a few years. On top of that, you can (some of these without a reload) change whether there are separate inboxes for various labels, add/remove a reading pane, and split threads into individual emails.

I don't think Google is lacking in potential to maintain a website.


They don't. The browser just adapts.



Is the plain HTML version under even the most basic maintenance? I was under the impression that was the old interface and they just kept it around because people liked it/slower countries.

The mobile vs desktop versions you posted are likely the same codebase with minimal (if any) differences. My understanding is that generally such things are accomplished transparently with flex layouts that automatically adjust to screen size


> Is the plain HTML version under even the most basic maintenance?

I doubt much work is being done on it, but presumably they at least make sure it works; I mentioned it because a few posts up (edit: you) mention testing (rather than initial design) as the reason why having multiple designs is so difficult.

> The mobile vs desktop versions you posted are likely the same codebase with minimal (if any) differences....

It's entirely possible that they're derived from a similar codebase at some point, but what reaches the browser is significantly different - it's barely responsive, based on user-agent, and appears to be significantly different obfuscated blobs of HTML, CSS, and JS.


I can’t speak to it but just because something is tested occasionally doesn’t mean the testing budgets are the same or serve the same purpose.

For example, the feature set required to support the HTML page could be frozen and the APIs backing them stable with no need to change. So testing isn’t really necessary. Alternatively, there’s just API changes being made to remove dependencies on deprecated code and so the testing coverage comes from the testing that happens of that API surface through other means. Finally, it could be that the HTML page is even fully staffed to support emerging markets. That’s a different budget potentially than the budget for the “rich” UI.

Again, my point isn’t to argue over the specific business pressures and practices Google has for their email UI. This requires a level of knowledge I don’t think either of us possess. All I’m trying to do is illustrate that there could be all kinds of pressures why the system is the way it is, but dismissing it as “laziness” or “stupidness” on the part of the designer is itself a lazy and stupid conclusion to make without concrete evidence. I generally assume that’s not the case and look for the incentives/pressures those people are under until there’s overwhelming evidence those people are actually stupid/incompetent (and even then, the question becomes what structures, incentives, pressures were in place to put those people in positions they shouldn’t occupy).


> Saying an entire group of people is lazy or dumb is not a particularly insightful way of looking at anything that helps your understanding of the situation or learning what kind of results different incentives yield.

Maybe, but when you're talking about the armies of designers at the multi-billion-dollar tech company Google not going through the effort to maintain two stylesheets, I think "laziness" is an accurate description.


No, designers did not get lazy. Devices got weirder.

It used to be that you could check the width and height of the viewport and say something like “320px wide? Must be a touch interface, deploy the big buttons”. Then tablets got big and it was like “1024px? Could be a laptop, but it’s probably an iPad, which has a touch interface, deploy the big buttons”. Then laptops got touch screens, then the Surface Studio came in and was like “HAHAHAHA”.

Now the game is “1920x1080? Could be a big tablet with touch, or a 1080p monitor without touch, or a non-maximized window on a Surface Studio with touch, or maybe it’s a monitor without touch hooked up to a laptop with a touchscreen and our window could get moved between them at any time...”

Nowadays, there’s no single reliable way to tell if a page is going to have to support touch until it gets a touch event, by which point you’ve already rendered the UI and it’s too late.


Simple solution would be to ask the user what they want. I genuinely don't understand why this is not common instead of trying to guess it.


You don’t have to ask the user, there is a media rule for querying whether the device is currently using coarse or fine pointer input[0] (though, of course, it relies on the OS not lying, which is not a given).

[0] https://developer.mozilla.org/en-US/docs/Web/CSS/@media/poin...


Some websites probably do have settings around this, but that gets to a point someone else mentioned: you would have to basically design, build, test and maintain two UIs. Except now with the kicker that one of those layouts is only used by the 5% of your userbase that both knows that the option is available and chooses to take you up on it.


Yeah, I like rich, data-heavy, information dense desktop displays. Mobile UIs are such garbage, not to mention shit like reddit that only is able to process a request 30% of the time.



Agreed. it drives me absolutely berserk that Reddit forces you to "Click to view more comments" just to see like 3 more comments.


That particular anti user behavior is a strategy to pay reddit gold, which removes that limitation.


That's the "entropy only goes up" or "all available space will be filled with complexity" rule of software orgs. Employees are paid to add new features, regardless of how useless they are. The only situation when they stop adding features is when hardware or something else doesn't allow it. Otherwise we'd see 1GB webapps. But when they reach this boundary, they start redoing existing stuff, because otherwise they'll get fired for inactivity. This kind of bs is difficult to stop even when you're paying their salaries and monitor the results. For example, a marketing person would keep adding useless bloat to justify his salary (he really needs his paycheck!) or a programmer would keep refactoring some bs to satisfy his purism (funded by your money, of course). Just think about it: if a competent programmer approaches you and explains how the product is mostly finished and the remaining microimprovements won't add any value to your business, would you continue paying him for doing nothing?


Designers need to justify their continued employment.


I can't see the buttons (mobile)


That's the point, this page is readable and information-dense. There is no bloating


> this page is readable

Not easily so on mobile - the need to zoom multiple times and scroll in 2 directions to read simple information is a PITA.


Because it's not for mobile. Engineers who look at this tend to work from a computer.


Says who? Information about emergency issues are often spread on mobile phones.


> Wtf happened

Money


Go on...


I'd rather not I think I completely misread the post. Classic post-before-coffee error.


If this had happened on self hosted email server, people would be claiming "this is why you don't self-host email". My company self hosts email and we have never had an outage of this sort. ever. All support emails to gmail accounts are now bouncing for us. It's also been more than 2 hours since this problem started.


The analogous claim here would be: “This is why you don’t entrust critical services to third parties.”


The name for this is "victim blaming." Someone suffers from a catastrophe (or from abuse, but that's not what happened here), and you imagine a way that it could've been avoided, never mind that things could've gone just as badly wrong the other way around.


If everyone who uses Gmail hosted their own email servers those people would suffer far more and far worse outages that Google has had.


but not all at once


But far more consistently. Most users have no idea how the Internet works, let alone hosting something as complex as email. They'd just stop using email and switch entirely to Facebook.


Also, never ever a 550 5.1.1 error.


On the same server... Was missing an email, went to look into other box that should have forwarded that. Got very suspicious looking email...

Good thing gmail wasn't completely stupid and didn't try to forward that one, getting in infinite loop...


AWS SES just blocked our account due to high number of bounces :(


Oh great. One more reason I’m not going to be able to sleep tonight.

Other people’s automation scares me. I’m sure mine scares other people as well.


Fortunately I had email bounce notifications setup and to a none gmail address, so had time to stop the email queue.


The funny thing is part of that situation started exactly 24h ago as I too saw increased bounces from AWS SES towards Gmail addresses. Not for everyone though, to be fair.

If anyone from AWS SES is reading - please do not deliver bounce receipts for gmail for the time being - it makes everyones situation much worse.


Any email sent to my gmail is getting permanently bounced. Says my email does not exist at gmail.


Now think about if some gmail accounts got Thanos snapped out of existence. What kind of digital death would that be as you would NEVER be able to recover your accounts, at least for most people who don't have 2FA.


> Now think about if some gmail accounts got Thanos snapped out of existence. What kind of digital death would that be as you would NEVER be able to recover your accounts, at least for most people who don't have 2FA.

An interesting problem - I have a gmail address, and also one on my personal domain (which uses g-suite for email). The personal domain's backup is a gmail address, and the gmail's backup is my personal domain. When I set that up ages ago, I genuinely never thought about what happens if gmail itself implodes.

I guess I'm setting up a... icloud??? backup email for a bunch of stuff shortly.


Yep, another here too. Weird really weird when both emails have existed with this forwarding scheme for over 2 years.


Seeing the same. Even email from Google f̶o̶r̶ ̶̶w̶o̶r̶k̶g̶r̶o̶u̶p̶s̶ workspace to gmail is having issues.


It's inconsistent for me, I've had two failures and one success in quick succession


This just happened to me as well. Not great D-:


seeing the same thing

Edit: oops thought it was a text post


You've linked to the same URL as this HN post


Side note - why don't google status updates contain a timezone. You'd think that they'd have some awareness of you know...global business...many users in many timezones.

> We will provide an update by 12/15/20, 10:30 PM


At the bottom: "All times are shown in your local timezone unless otherwise noted."


I bet it adjusts for time zone automatically, as I see:

> We will provide an update by 12/15/20, 5:30 PM


I hate that even more as I can never know if they adjusted it correctly or not, even more so if I've traveled recently or used a VPN.


This is Google. They know where you've been. They know where you're going before you do.

I agree that a uniform time zone would be clearer.


Anyone else having issues in general with "the internet" over the last few days?

The images on different websites I've visited don't load (e.g. twitter, bbc etc). When they do they load, they load verrry slowly


Yes, they Internet has felt for lack of better word “weird” for the past couple days. Nothing specific but similar to your experience with random things not loading but refresh and it’s fine etc. Or strangely slow response on websites.


That's generally DNS tampering by backbone providers (Centurylink and Verizon in the USA).

Use dnscrypt.


Oh, the internet is fine and happy.

The "internet" deserves these outages to make people - and CEOs, CIOs, etc - realize that in-house ~~engineering~~ sysadmins used to exist for good reasons.


I reeeeally don't want to go down this rathole but a few days ago when my kids were having problems connecting to YouTube on both wifi and LTE I had begun to think that Trump pulled the 1. Stop Internet, 2. Attempt Coup thing.


While, unlike many other people, I do _not_ believe Trump is an idiot, or even dumb at all, I do not believe he has any knowledge of how to stop the internet even if he wanted to.

Plus, how would he send tweets if no internet? I think he likes being able to use Twitter more than he likes being President.


My cursory understanding of the whole "internet going dark" conspiracy is that this would not be used by Trump, but be used by the shadow government in co-operation with "Big Tech" in order to create panic and confusion amongst the masses so they can't organise, feel helpless and ultimately ask for "big government" and "big tech" to step in and save them.

For whatever reason, gp seems to be under the impression that it's Trump who would use this to his own advantage, which is not consistent with the prevailing conspiracy theories.


Unless it's a coincidence, I don't think this is just Gmail as I am seeing slow, failed networks connections to services in Google Cloud.


I'm glad to have found this thread, but its kept me up cleaning out thousands of "bounced" blocks from the last 2 day's suppression lists. They are both @gmail.com and gSuite/Workplace domains, and incidentally a ton of random blocks from Comcast as well from users that have months of open history.

This blew away about 10% of our newsletter and marketing subscribers. I can't imagine the time you're having if you send to millions of Google accounts.

Pardon me for being conspiratorial, but I have to say that the timing of these particular issues is of concern in light of the massive attacks the US government and others have faced this week. It seems adversaries would have plenty of fun if they got a new toy that could mess with the user/resource permission mapping at Google would want to use it to go after inboxes to do email confirmations. Even those this is 99% likely to be a DevOps chore that created an SRE nightmare, I'll allow myself believing 1% this was SecOps locking down huge swaths of accounts to mitigate a mass email verification attack.


Have you considered the possibility these "attacks" occured and were detected months before but were only announced recently? Could it be that some people have a motive to spin some stories during these politically sensitive times?


Ya that’s quite possible, and probable given how well they are documented to the senate. Dizzying times we live in. The flow of information is dammed and diverted by so many different actors with different motives.

Still, I have good enough reasons to think that some sort of Pandora’s box of backdoors was opened this fall and its fallout is yet to be felt.


We are using a gmail inbox to process some business-critical emails in bulk. I guess those emails will be lost forever?

Thankfully, we have backups, but we will have to move them to the inbox or elsewhere to have them processed.

Edit: As an update, we usually have at least 100 emails come in every hour, and I am seeing none since 4:02 pm EST


Blimey, why use a free Gmail account for a business critical operation? That's just asking for trouble.


They may be using the paid, enterprise G Suite/Google Workspace Gmail: https://workspace.google.com/products/gmail/.


Can they call someone when there is an issue with gmail? Is there a contract that give them some leverage and guaranties? What is the process in the case of a missing business-critical email being lost and needing recovery?

For business critical operation, leave such answers unanswered can be pretty risky.


Yes, Gsuite has a SLA and telephone support etc.


Now would be an excellent time to ask people who use gmail for business critical operation how well that SLA and telephone support work in practice.

I know for example companies that use email for handling customer orders at their stores. I can just imagine the loss if the system was down for hours a few days before Christmas, or worse, actually lost the data.


They have telephone support, but the agents aren't empowered to do very much of anything.


I am not going to try it. I just restored data from a backup email server these emails get duplicated to.


Yeah, to be fair I've never had to test it.


Google doesn't do support.


yeah, this issue affects gsuite domains as well as @gmail.com


Yup. It also seems to be affecting some of our G Suite accounts - although not all(?)

Not a good week for Google


> I guess those emails will be lost forever

Yes.


github has un-verified my account and I'm unable to merge PRs, leave comments, etc... until it's verified again.


Scary! I hope I’m not going to get any email notification until they bring it back up.


workspace (gsuite) customers: remember to claim under the SLA, cos they don't award it automatically

under 99.9% is a 3 day credit, and they're currently at 99.4%


I sent in a support ticket. Is that the right way to claim SLA?


not sure, but that's what I've done too


I use Helm for email. It’s a silent little server in my living room routed via Amazon (using a TLS cert that lives in my living room). I’d say about every three or four months it goes down for 5 minutes if I need to reboot my wifi router.

I use it for privacy (am a fan) but I feel pretty smug knowing I’m getting better reliability too. At least this month :)

No affiliation with the company.

https://www.thehelm.com/


I remember seeing this when it first popped up on HN. Very cool. Home internet is pretty reliable these days and my sense is that it’s getting better, not worse, as people value their internet connectivity more.

Is your data backed up anywhere in case your Helm box burns in a fire?


Helm comes with free backup. Fully encrypted so only you can access. Still I’d love to turn it off. I have all my email on my laptop, which is itself backed up. I only keep some weeks worth on the Helm to let devices download.


What is a status page that has the time without timezone


There is a note regarding time zones at the bottom of the page: "All times are shown in your local timezone unless otherwise noted."


It should still say the assumed TZ as others note. Many people use VPNs and geo-ip lookup is not infallible anyway.

Edit: Apparently it's browser settings, which means you need scripting enabled for this page.


Which timezone is my local timezone? A bunch of geolocation systems think I'm in Paris. I'm in Melbourne. (My ISP bought an IP block that used to be registered to a French company).

That's a considerable difference.


It's not geolocation - just a javascript function to ask the browser what the UTC offset of your local time is. See https://stackoverflow.com/questions/1091372/getting-the-clie...


Which JavaScript function? The two main ones return two differing results for my browser.

  Intl.DateTimeFormat().resolvedOptions().timeZone

  > "Australia/Melbourne"
Which is UTC +11.

   new Date().getTimezoneOffset()

   > -710
Which is roughly UTC +11.8.

If you're going to shoebox me into a particular timezone - tell me what it is, and let me change it.


I'm in spain and my geolocation is also always wrong.


This irks me even on statuspage.io pages (I think?) where times are in PST or something quite often.

Very surprising it’s not localised.


I've noticed references to my.name@google.com instead of my real address my.name@gmail.com — That doesn't seem good.


LinkedIn telling me I could lose my linkedin account because they can't reach my gmail email address.


One @gmail.com account shows no email received from 2:17pm ET Dec 15 to 6:05am ET Dec 16. I normally receive ~4 emails per hour, throughout the day and night. After I realized there might be a fault, I sent myself test messages around 6pm ET on Dec 15 and Gmail bounced them with the error:

> 550-5.1.1 The email account that you tried to reach does not exist. Please try double-checking the recipient's email address for typos or unnecessary spaces. Learn more at https://support.google.com/mail/?p=NoSuchUser x62si100799otb.139 - gsmtp


It has been fun explaining to customers that G-mail has been having issues.


Google Stadia was also down yesterday for a couple of hours[1]. It appeared to be related to loading user accounts into the game. Some games couldn't load at all, others worked for a short time and others were unaffected.

[1] https://www.reddit.com/r/Stadia/comments/kdr2ps/its_not_just...


I've been using Gmail since the invite only beta. In the past year I've had several occasions where important messages get somehow "lost" in the UI. The ones I've noticed are messages I've been waiting for, so I've gone and dug them out of "all mail" in most cases. They just wouldn't show in the normal inbox despite not being flagged as spam.

This outage has now convinced me I need a new email provider.


It seems that Gmail is no longer a priority of Google, which is actually reasonable. Email provider is hardly a growing business anymore, and the market is also full of small players. How much gmail user will bother to transfer to other services? Even paying ones? I think minimal. When will Gmail be irrelevant? Probably when email is irrelevant.


is there a product you could use, like a cache layer for gmail? An investor I know would love that product. I understand things go down, but it's actually quite a pain to lose access to your existing emails and calendar events. Could he just use a Mail app like Outlook? Yes. But that's too complicated to set up I think.


SMTP was designed to have primary / secondary etc. Google decided they know better and not to follow the SMTP standards


SMTP seems to be down as well. "Login to server smtp.gmail.com failed." using an email client

IMAP seems to be working, though.


SMTP went down for 5 minutes. Our service was able to re-connect automatically and has been working fine, if that helps you.

Could be inconsistent though.


It seems it's even worse for gmail to Gmail communication. Some of our company emails disappeared without any kind of bump information which causes a lot of frustration among our customers. I hope they fix it soon, it's a mess of massive proportion.


Gmail service details: https://www.google.com/appsstatus#hl=en&v=issue&sid=1&iid=a8...

All good now apparently


We are seeing a lot of email bounced back, but otherwise usable.


Still having error trying to fetch email from one Gmail account to another via POP3. Seem different from the error described in the post link.


This wouldn’t have happened with Google Wave.


Started sometime earlier today - we got alerted by Salesforce deliverability team around 11am CST.


Imagine surprise reading this after a whole day of productive freelance-related email exchanges that went seemingly without issues, and wondering if all important stuff was received.

By the end of the thread I wondered why it was not affected, until I remembered my small business email is actually on Exchange 365 :-D


Is Microsoft's Live Mail (Outlook.com) more reliable than Gmail?


I use Outlook.com for my primary email and I’ve had a few more outages than with Gmail but nothing I would consider major. The major issue is Outlook.com’s spam filtering is drastically worse than Gmail, I get so much spam that would be just so obvious to any rudimentary spam filter, I mark them as spam in Outlook.com, and the next day I get the exact same email again.


I'm considering moving to Outlook.com.


Maybe their spam filtering is worse, but they also block way more legitimate smaller mail servers for no good reasons. Every few months I have to send an appeal to them so that I can continue sending mails from my mail server. They claim that there were user complaints about spam but it's pretty much impossible. I think they just silently blacklist smaller mail servers periodically and hope that they can fight spam that way. Obviously they can't.

So if you want to be able to receive emails from smaller mail servers - don't switch to Microsoft.


It always have been. You also get more spam, along with those that should not have been marked spam, but gmail does.


COVID seems to have possessed Google - Services are down, Android Apps that used to take 24 hours for approval are now taking ages - and as usual no support forthcoming. Is Google unravelling?


In Bangladesh we faced serious problems last night.


The less google services we use, the better


Google assistant is also having troubles


Will there be a public post mortem?


Now showing as fixed fwiw.


What should I use instead?


Resolved as of 3:51 PST.


Same issue here


They are getting ready to be acquired by Microsoft :D




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: