> We're both being paid to solve different facets of the same problem. Coming at...

minaguib · on Dec 16, 2022

It's interesting that this: > the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service

Was a tenet in the original Google SRE material. SREs help operate well-behaving services using engineering best-practices. Services that fail to behave well and bust their error and support budgets repeatedly go back to the teams that wrote them.

GauntletWizard · on Dec 16, 2022

This is important to stress. SRE often takes first-line oncall, but "We" support the service. On a well behaved service, having SRE support is an extra debugging-focused engineer when things go wrong. But that engineer is rarely called up, and can support the common parts of failure (network issues, hardware failures, the things that are out of any one team's service team's control) for multiple teams.

noiwillnot · on Dec 16, 2022

SREs will end up the same gatekeeper reputation as traditional IT teams, and a new buzzword will need to be made to signify the "move fast and break things" cool kid energy than DevOps and SREs (still?) have.

klodolph · on Dec 16, 2022

Labels are funny.

SRE tends to be less centralized. You have devs and SRE working together. Either 1 dev team + 1 SRE team, or an N:1 setup where one SRE team supports multiple dev teams. The SRE gatekeeping is there exactly because your company has outgrown the "move fast and break things" phase, and is now in the "move fast but please don't break things" phase. The SRE team wants to support the devs with their high-velocity feature rollouts, but the SREs also know that the service is big and important, and outages / data loss / etc will cost the company money and damage the company reputation.

The SRE team is there to try and balance development velocity, reliability, and scalability without breaking the bank. Run a dev team by itself and you may not have enough expertise in reliability and scalability to make the service work the way you want--you may have a high pager load and no strategy to get out of it besides telling your devs to fix more bugs. The SRE team brings strategies and expertise to work your way out of that kind of hole if you're in it. Centralized IT slows things down and tries to get the entire company on a tech stack which is as standard as possible. Makes sense for running legacy services, but does not make sense for products with highly active development.

Sometimes what you see in rapidly growing companies is that the workload grows faster than the capacity for the company to hire skilled workers willing to do the work. This happened at some point during Facebook's growth, for example. A core responsibility of SRE teams is to provide scalability not just in terms of computational resources and capital expenditures, but in terms of human labor and operational costs. Doing this well requires working closely with the dev team and requires that the SRE team be able to make code changes or even architectural changes to the service they are supporting. This is outside the scope of centralized IT support.

You can use SRE as a buzzword but I see it as a specific role which solves a specific set of problems which are, at this point, relatively well-understood.

DougBTX · on Dec 16, 2022

In this model, dev can deploy any service they like, so SRE isn’t a gate keeper.

100011_100001 · on Dec 16, 2022

Fellow SRE/DevOps here. I agree with this interpretation.

Specifically, if things are not "working", I expect the developers to understand how their code works and what it needs to function properly. I'm constantly surprised by how much developers don't know about the app they write code for.

I'm not asking for your intellectual capital because I want to sue you. I want to understand how the app comes up.

The main problem I'm usually trying to solve is, apps are just packages of stuff, if I can deliver 1000 packages no problem, but one doesn't deliver is it the package, the address it got sent to, or is this some brand-new package requiring a different way of handling it. I need the package sender to explain to me what it contains.

On another note, @klodolph, if you are getting paged a lot, then your SRE needs to improve. Perhaps you were slightly exaggerating, but I consider any escalation to SRE personnel a failure on the SRE side. It's kind of a brutal metric to follow, escalations as close to 0 as possible. An interesting thing that happens if you try hunting for it is you will realize 80% of your calls come from 1 or 2 things. Addressing them will make developers happier and SREs happier.

m3drano · on Dec 16, 2022

SRE here too. We need to stop building knowledge ex ante and focus in building capabilities to introspect the service ex post, that is, during the outage. Even if you are told how it works when starting to support it, you'll have other 50 services to support too and will lose (rightfully) track of where things stand at the moment you need.

Build telemetry and convergence into well known platforms to make response easier.