> We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me.
SRE here.
My takeaway from this is: If you want SRE support running this service, then you need to provide SREs with knowledge of how the system works. As long as only the devs have this knowledge, it's a bit unfair to put the SREs on the hook for supporting it.
Maybe I'm reading between the lines too much--the wording in the article is sloppy at best, and at worst, it doesn't actually say what I'm saying.
It's nice that your code has been through peer review and other people on your team know how it works too. That's less helpful for the SREs running it. SREs bear the burden of the pager--sometimes getting woken up at odd hours of the night to fix problems that were, in a sense, created by developers.
The SOP for getting SRE support for new services should include things like runbooks and design reviews. SREs should be in the loop when you figure out what metrics to expose from your service, because SREs will be the ones using those metrics to figure out the alerting systems. Very few companies have decent "SOP" for SRE support--there are a few companies which are really good at it, like Google, and then a long tail of companies which dump services on SREs without including SREs in the process.
IMO--the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service, barring exceptional circumstances. There's a deeper discussion to be had about why this should be the case--basically, devs and SREs have different incentives, and neither team should be put in a subordinate position to the other, because both teams have goals that support the business.
It's interesting that this:
> the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service
Was a tenet in the original Google SRE material. SREs help operate well-behaving services using engineering best-practices. Services that fail to behave well and bust their error and support budgets repeatedly go back to the teams that wrote them.
This is important to stress. SRE often takes first-line oncall, but "We" support the service. On a well behaved service, having SRE support is an extra debugging-focused engineer when things go wrong. But that engineer is rarely called up, and can support the common parts of failure (network issues, hardware failures, the things that are out of any one team's service team's control) for multiple teams.
SREs will end up the same gatekeeper reputation as traditional IT teams, and a new buzzword will need to be made to signify the "move fast and break things" cool kid energy than DevOps and SREs (still?) have.
SRE tends to be less centralized. You have devs and SRE working together. Either 1 dev team + 1 SRE team, or an N:1 setup where one SRE team supports multiple dev teams. The SRE gatekeeping is there exactly because your company has outgrown the "move fast and break things" phase, and is now in the "move fast but please don't break things" phase. The SRE team wants to support the devs with their high-velocity feature rollouts, but the SREs also know that the service is big and important, and outages / data loss / etc will cost the company money and damage the company reputation.
The SRE team is there to try and balance development velocity, reliability, and scalability without breaking the bank. Run a dev team by itself and you may not have enough expertise in reliability and scalability to make the service work the way you want--you may have a high pager load and no strategy to get out of it besides telling your devs to fix more bugs. The SRE team brings strategies and expertise to work your way out of that kind of hole if you're in it. Centralized IT slows things down and tries to get the entire company on a tech stack which is as standard as possible. Makes sense for running legacy services, but does not make sense for products with highly active development.
Sometimes what you see in rapidly growing companies is that the workload grows faster than the capacity for the company to hire skilled workers willing to do the work. This happened at some point during Facebook's growth, for example. A core responsibility of SRE teams is to provide scalability not just in terms of computational resources and capital expenditures, but in terms of human labor and operational costs. Doing this well requires working closely with the dev team and requires that the SRE team be able to make code changes or even architectural changes to the service they are supporting. This is outside the scope of centralized IT support.
You can use SRE as a buzzword but I see it as a specific role which solves a specific set of problems which are, at this point, relatively well-understood.
Fellow SRE/DevOps here. I agree with this interpretation.
Specifically, if things are not "working", I expect the developers to understand how their code works and what it needs to function properly. I'm constantly surprised by how much developers don't know about the app they write code for.
I'm not asking for your intellectual capital because I want to sue you. I want to understand how the app comes up.
The main problem I'm usually trying to solve is, apps are just packages of stuff, if I can deliver 1000 packages no problem, but one doesn't deliver is it the package, the address it got sent to, or is this some brand-new package requiring a different way of handling it. I need the package sender to explain to me what it contains.
On another note, @klodolph, if you are getting paged a lot, then your SRE needs to improve. Perhaps you were slightly exaggerating, but I consider any escalation to SRE personnel a failure on the SRE side. It's kind of a brutal metric to follow, escalations as close to 0 as possible. An interesting thing that happens if you try hunting for it is you will realize 80% of your calls come from 1 or 2 things. Addressing them will make developers happier and SREs happier.
SRE here too. We need to stop building knowledge ex ante and focus in building capabilities to introspect the service ex post, that is, during the outage. Even if you are told how it works when starting to support it, you'll have other 50 services to support too and will lose (rightfully) track of where things stand at the moment you need.
Build telemetry and convergence into well known platforms to make response easier.
SRE here.
My takeaway from this is: If you want SRE support running this service, then you need to provide SREs with knowledge of how the system works. As long as only the devs have this knowledge, it's a bit unfair to put the SREs on the hook for supporting it.
Maybe I'm reading between the lines too much--the wording in the article is sloppy at best, and at worst, it doesn't actually say what I'm saying.
It's nice that your code has been through peer review and other people on your team know how it works too. That's less helpful for the SREs running it. SREs bear the burden of the pager--sometimes getting woken up at odd hours of the night to fix problems that were, in a sense, created by developers.
The SOP for getting SRE support for new services should include things like runbooks and design reviews. SREs should be in the loop when you figure out what metrics to expose from your service, because SREs will be the ones using those metrics to figure out the alerting systems. Very few companies have decent "SOP" for SRE support--there are a few companies which are really good at it, like Google, and then a long tail of companies which dump services on SREs without including SREs in the process.
IMO--the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service, barring exceptional circumstances. There's a deeper discussion to be had about why this should be the case--basically, devs and SREs have different incentives, and neither team should be put in a subordinate position to the other, because both teams have goals that support the business.