Hacker News new | past | comments | ask | show | jobs | submit login
Things I want from Devs as SRE/DevOps (oschvr.com)
198 points by oschvr on Dec 15, 2022 | hide | past | favorite | 143 comments



> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me. Which is much more pleasant than an adversarial environment.

Also: I'm not the only one that knows how it works, it's been peer reviewed in no small part to reduce my bus factor. All documentation requested is perfectly reasonable, and should be part of the organizations standard operating procedure.

If it's not part of the SOP, then no, you wont have those things. You need to work at a cultural level to change that, and for that you're much better off making allies than anything else. Make it clear how those things help you, and what you'll do to make the developers life easier when you don't need to worry about the basics. If altruism fails you, you can usually count on people to act in their own best interests.


> We're both being paid to solve different facets of the same problem. Coming at me with "this is your problem" isn't going to foster a collaborative environment with me.

SRE here.

My takeaway from this is: If you want SRE support running this service, then you need to provide SREs with knowledge of how the system works. As long as only the devs have this knowledge, it's a bit unfair to put the SREs on the hook for supporting it.

Maybe I'm reading between the lines too much--the wording in the article is sloppy at best, and at worst, it doesn't actually say what I'm saying.

It's nice that your code has been through peer review and other people on your team know how it works too. That's less helpful for the SREs running it. SREs bear the burden of the pager--sometimes getting woken up at odd hours of the night to fix problems that were, in a sense, created by developers.

The SOP for getting SRE support for new services should include things like runbooks and design reviews. SREs should be in the loop when you figure out what metrics to expose from your service, because SREs will be the ones using those metrics to figure out the alerting systems. Very few companies have decent "SOP" for SRE support--there are a few companies which are really good at it, like Google, and then a long tail of companies which dump services on SREs without including SREs in the process.

IMO--the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service, barring exceptional circumstances. There's a deeper discussion to be had about why this should be the case--basically, devs and SREs have different incentives, and neither team should be put in a subordinate position to the other, because both teams have goals that support the business.


It's interesting that this: > the right thing to do is to give SRE teams the power to say "no" and refuse to take the pager for any particular service

Was a tenet in the original Google SRE material. SREs help operate well-behaving services using engineering best-practices. Services that fail to behave well and bust their error and support budgets repeatedly go back to the teams that wrote them.


This is important to stress. SRE often takes first-line oncall, but "We" support the service. On a well behaved service, having SRE support is an extra debugging-focused engineer when things go wrong. But that engineer is rarely called up, and can support the common parts of failure (network issues, hardware failures, the things that are out of any one team's service team's control) for multiple teams.


SREs will end up the same gatekeeper reputation as traditional IT teams, and a new buzzword will need to be made to signify the "move fast and break things" cool kid energy than DevOps and SREs (still?) have.


Labels are funny.

SRE tends to be less centralized. You have devs and SRE working together. Either 1 dev team + 1 SRE team, or an N:1 setup where one SRE team supports multiple dev teams. The SRE gatekeeping is there exactly because your company has outgrown the "move fast and break things" phase, and is now in the "move fast but please don't break things" phase. The SRE team wants to support the devs with their high-velocity feature rollouts, but the SREs also know that the service is big and important, and outages / data loss / etc will cost the company money and damage the company reputation.

The SRE team is there to try and balance development velocity, reliability, and scalability without breaking the bank. Run a dev team by itself and you may not have enough expertise in reliability and scalability to make the service work the way you want--you may have a high pager load and no strategy to get out of it besides telling your devs to fix more bugs. The SRE team brings strategies and expertise to work your way out of that kind of hole if you're in it. Centralized IT slows things down and tries to get the entire company on a tech stack which is as standard as possible. Makes sense for running legacy services, but does not make sense for products with highly active development.

Sometimes what you see in rapidly growing companies is that the workload grows faster than the capacity for the company to hire skilled workers willing to do the work. This happened at some point during Facebook's growth, for example. A core responsibility of SRE teams is to provide scalability not just in terms of computational resources and capital expenditures, but in terms of human labor and operational costs. Doing this well requires working closely with the dev team and requires that the SRE team be able to make code changes or even architectural changes to the service they are supporting. This is outside the scope of centralized IT support.

You can use SRE as a buzzword but I see it as a specific role which solves a specific set of problems which are, at this point, relatively well-understood.


In this model, dev can deploy any service they like, so SRE isn’t a gate keeper.


Fellow SRE/DevOps here. I agree with this interpretation.

Specifically, if things are not "working", I expect the developers to understand how their code works and what it needs to function properly. I'm constantly surprised by how much developers don't know about the app they write code for.

I'm not asking for your intellectual capital because I want to sue you. I want to understand how the app comes up.

The main problem I'm usually trying to solve is, apps are just packages of stuff, if I can deliver 1000 packages no problem, but one doesn't deliver is it the package, the address it got sent to, or is this some brand-new package requiring a different way of handling it. I need the package sender to explain to me what it contains.

On another note, @klodolph, if you are getting paged a lot, then your SRE needs to improve. Perhaps you were slightly exaggerating, but I consider any escalation to SRE personnel a failure on the SRE side. It's kind of a brutal metric to follow, escalations as close to 0 as possible. An interesting thing that happens if you try hunting for it is you will realize 80% of your calls come from 1 or 2 things. Addressing them will make developers happier and SREs happier.


SRE here too. We need to stop building knowledge ex ante and focus in building capabilities to introspect the service ex post, that is, during the outage. Even if you are told how it works when starting to support it, you'll have other 50 services to support too and will lose (rightfully) track of where things stand at the moment you need.

Build telemetry and convergence into well known platforms to make response easier.


Ownership is important to be explicit about, less as a means of assigning blame, but more as a means of coordination and resource allocation.

The author is using ownership as a tool to avoid responsibility, and is thus creating an `us vs them` mindset rather than an `us vs the problem` mindset.

Having a strong definition of ownership (like committing your organizational structure to your monorepo as config file) is invaluable for building tooling.

If you have a strong definition of code ownership it allows for things like people less familiar with a particular piece of code being able to make changes with the approval of the owners, while simultaneously notifying them of the change.

Likewise, if you are working on a platform that multiple teams use, you can write tooling that automatically assigns bugs or tickets.

Ownership problems and "us vs them" is a clear sign of poor leadership. Most devs that experience it become cyncial or hostile without being able to understand that it is leadership that failed them.


> Having a strong definition of ownership (like committing your organizational structure to your monorepo as config file) is invaluable for building tooling.

Having a strong ownership can become toxic very quickly.


Ownership is not toxic. Leadership is toxic.

It is really important to understand that toxicity and hostility is a function of leadership (or lack of it).

I highly recommend the book Extreme Ownership. That book explains the mechanics of toxic environments.


"I've grown unfond of this attitude. I most certainly don't own it. I have no IP rights to it at all. We're both being paid to solve different facets of the same problem."

If you are a dev on the team that owns that service then it's you and your team's responsibility to answer all of these questions... Even Org's SOP would end up reaching back to the team who owns the service if problem's arises...


I hate separate infrastructure teams with a passion for this reason.

Far too frequently you end up in a situation where someone makes an environment change and blows everything up because they have no understanding of the services they're stewarding.

If you want me to take responsibility, my team should be managing the service end to end.

I feel really strongly against this division of responsibility in software teams. It too often leads to holding up progress and hostile interactions due to each team pursuing their own priorities.


> If you want me to take responsibility, my team should be managing the service end to end.

This. I really do not enjoy being called up in the middle of the night to walk a group of people that know absolutely nothing about the system through the steps they need to resolve the issue, because nobody wants to give the “dev” team access to the production environment.

I think the solution that best aligns incentives is the one where the people introducing issues are also the ones called up (and able) to fix them.


Ah, developers empowered to do operations. We should have a catchy name for it... "opsdevs"? :P

Seriously, this is the original idea of the DevOps principals. But they run straight into CIS requirement that "developers do not have access to production code" and the ISO 27001 v2013 requirement of separation of responsibilities. So it'd be great if it happens, it just can't happen in the big B2B spaces.


We allow devs to do things in prod. We are a public company. Sox, Hipaa, ISO27001, GDPR, and all that. Every dev on my team has access to their prod servers and databases (but no access to other team's stuff usually). We deploy multiple times a day. We handle our own oncall. We process billions of individual requests daily for millions of users. We have several thousand employees.

Our compliance requires that all code be reviewed and pass quality assurance before merging and that all prod changes be documented.

That means Dev1 writes the code, the unit and integration tests, sets the right configs in each environment, updates the dashboards for any updated metrics, sets up alerts, and updates runbooks. Dev2 reviews the work, pushes back when any of the above needs more work, and then documents on the jira ticket how they verified stuff. Dev1 or Dev2 merge the code, observe the build, and ensure the code rolls out to prod.

When something goes wrong, the oncall dev on the team is paged and can access all prod systems, and can log in, start and kill things, move files, etc.


All counsel these days, from 'devops' or 'sre' bodies of knowledge is: development and operations are two sides of the same system, they should be integrated better. Companies: got it, create new title/team, in charge of this integration. Seriously?


Agreed - in a previous job people shipped garbage code frequently and when there was a problem they didn't want to hear about it, because "everyone owned the code".


How else would you get promoted? Get with the program.


When you give responsibility to teams themselves the result is O(1) size problems becoming O(teams) sized problems.

  > How can I check the health of the service?
  In the definition of service, you define a field for
  health check script.

  > How can I safely and gracefully restart the service?
  This will exist within the script used to push new code.

  > Does it has any external dependencies?
  This could be defined in the service configuration and 
  used for setting up integration tests and automatically 
  generating a dependency dashboard.

  > Do you have a playbook, or sequence of steps, to bring
    the service back up?
  You could generate a field in the service defintion to
  automatically generate a dashboard and include the
  playbook link at the top of the page.

  > Do you use appropriate logging levels depending on the
    environments?
  Production could be extremely opinionated about what
  acceptable logging looks like, forced via code review. Log
  level could be defined in service config.

  > Are you logging to stdout?
  Why would any production service get to choose?
  Service owners shouldn't be able to log into machines.
  
  > Are you measuring the RED signals?
  Required fields in service config that could be used to
  generate a service dashboard.

  > Is there any documentation/design specification for the
    service?
  Required config field.

  >  Are you using gRPC or REST?
  Trivial grep.

  > How does the data flow through the service?
  This is complicated, but can probably be easily replaced
  by asking what state your service keeps and how it's
  stored. This is the only question I think the author
  should/needs to ask.

  > Do you have any PII/Sensitive data flowing through the
    service?
  While this question is important, this is one of the
  problems that has to be a particular person's
  responsibility. Any dev that answers anything but
  "probably not, but I don't know" shouldn't be trusted.

  > What is the testing coverage for this service?
  Some form of this would exist in a service config.
I don't think the question of responsibility is as simple as "it's the team's problem."


Hey I see “service config” referenced a lot in that thread, but your answers has the more occurrences.

I’m not sure I follow what it is.

A technical construct, like a code template or a API that services implements ?

Or a process constructs, like a SOP to follow with checkboxes?

Thanks


succinctly: A service config is the authoritative source of truth for what a service is in a format that can be (is) consumed by tooling.

A lot of software development is about generating abstractions.

"Service" is a possible abstraction someone might want to generate and develop.

I think a service abstraction can be defined by:

  A blob of code
  A set of machines to run it on
  A way to stop and start it
  A method to load balance to it
So it would make sense to create a yaml config file committed to a repo containing something like:

  services:
    [
    { 
      name: "CoolAppServerName.prod",
      build_script: "./bin/buildCoolAppServerName.py",
      start_script: "./bin/startCoolAppServerName.py",
      stop_script:  "./bin/stopCoolAppServerName.py",
      hosts:[
        "host_1",
        "host_2",
      ],
      slb_name: "CoolAppServerName.prod",
    },
    {...},
    ]
Once you have a definition, it can be extended to meet growing needs. You might choose to do something like:

    { 
      name: "CoolAppServerName.prod",
      key_metrics: [
        "CoolAppServerName.prod.5xx",
        "CoolAppServerName.prod.latency_percentiles",
      ],
      owner: "CoolTeam",
      ...,
    }
And then you could generate a webpage with a dropdown where "CoolAppServerName.prod" is an option and the dashboard including graphs for the time series metrics "CoolAppServerName.prod.5xx" and "CoolAppServerName.prod.latency_percentiles" automatically show up. Maybe instead of having service names in the dropdown you have owner names in the dropdown.

You could potentially write some code that attempts to validate no significant changes in those metrics and use it to automatically verify that newly pushed code didn't take down the website.

Service config means creating an authoritative service identifier (authoritative because it's the only identifier used in tooling) and then attaching a configuration to it.

Facebook and google have (or at least at some point had) tupperware and borg respectively, that are basically custom verisons of the above extended for their infrastructures.


I see, thanks for the detailed answer.

That furiously remind me of solutions ala kubernetes.

Where you define entry point, healthcheck, etc

A tad more abstract, and larger ( afaik, k8s don’t care how your code is build for instance )

Never heard of Tupperware. Loosely aware of Borg.

Again, I appreciate the time.


When Kubernetes was released, it was thought it would be a successor to borg if not the key components of borg itself, IIRC. https://en.wikipedia.org/wiki/Kubernetes:

  The design and development of Kubernetes was influenced by 
  Google's Borg cluster manager. Many of its top contributors
  had previously worked on Borg;[15][16] they codenamed Kubernetes
  "Project 7" after the Star Trek ex-Borg character Seven of Nine[17]
  and gave its logo a seven-spoked wheel.
There was a lot of early skepticism about it because it was not borg. I guess my understanding is that borg is so integrated into google tooling that it would have been impossible to generalize.

I haven't used it myself yet because a few of the senior engineers (from google/fb) I respect said "absolutely not in our infra."


What are you using instead and what are the main criticisms of kubernetes from your seniors?


A completely bespoke solution. I was both too busy to and too inexperienced with kubernetes to get into it and have a conversation.

IIRC the main criticisms were that it wouldn't scale to our needs and there were some use cases that wouldn't be handled by kubernetes easily. The end result would be two different solutions for the same problem, a slow migration to kubernetes that may or may not stall out, and then a half finished/perpetual migration that would double support costs.


"> Do you have any PII/Sensitive data flowing through the service? While this question is important, this is one of the problems that has to be a particular person's responsibility. Any dev that answers anything but "probably not, but I don't know" shouldn't be trusted."

GDPR makes it the responsibility of the organisation to know. You can't safely say "I don't know" about PII.


And if an organization wants to know, then they must make a single individual responsible. "Organizational responsibility" means that no one is responsible.

It is important to have one person know the answer, rather than making your devs "guess" the answer. "The devs we asked said there wasn't misuse of PII" is not at all a good guarantee that PII is not abused or lost.

The organization cannot know unless there is an individual who knows.


> We're both being paid to solve different facets of the same problem

Code that is not running correctly in production is worthless. If you write code and haven’t thought about all of the implications that it takes to make it run correctly you haven’t produce business value.

Yes, I have been a developer for 25+ years professionally. But for the last 10, I’ve also thought through all of the topics that the author has delineated in his article.

Yes I consider myself to be an competent “DevOps” engineer as long as it is on AWS and can go from empty AWS account to a fully functional infrastructure using IAC, a CI/CD pipeline, monitoring, alerting, centralizing logging etc.

Knowing that I will either be the person doing the “DevOps” or working with the person who is informs the design of my development.


> Code that is not running correctly in production is worthless. If you write code and haven’t thought about all of the implications that it takes to make it run correctly you haven’t produce business value.

I wish this were true, but the number of billion dollar companies running code that doesn't run correctly shows it isn't.


If it brings in revenue it's running correctly, as far as the business is concerned.


Well if you want to use that definition then my point is that "correct" in that sense doesn't (necessarily) require any thinking through.


It's a question of aligned incentives. If your team (not you necessarily) only have a priority to ship features than your team needs to be alerted for performance and reliability issues because you are thus incentivized to fix said issues. If these issues are thrown over the fence to some other team who doesn't know your codebase and have the domain knowledge to improve said code then things (usually) never get fixed.

Likewise error budgets that say "your service has not been reliable enough this period so the next sprint will be dedicated to improving that and no new features will ship" is another way to make sure quality is not an afterthought.


"Own" in this context means be responsible for. Peer review does not transfer ownership to peers but it does reduce risk of your code breaking. Other people should not understand your own code (and intent) more than you, so long as that is correct, while everyone in dev/ops shares responsibility of delivering service, you own and are responsible for the part of that service delivery that you authored simply because you are the person most qualified to resolve any issue that arise from that code breaking. If an SRE is keeping your code running then they own the code's uptime in as much as you have been able to communicate the parameters and configuration of the app, but at the end of the day, when there is a bug in the code you are the best person to fix it, so you own it.


> "Own" in this context means be responsible for.

I understand this, this is what I was trying to direct my distaste at. Unfortunately, it seems to have been misconstrued by responders.

> you own and are responsible for the part of that service delivery

No, I'm just responsible for that part of the service delivery. I own none of it.

I object to it in the same way that I object to hearing "we're a family" from leadership.


So long as your bosses are ok with that, who cares. Power to you. I was just stating the obvious typical expectation I guess.


Isn't this (you devs own your app, I don't give a shit about it, I just want it to not page me) the classic old school "Sysadmin" approach that supposedly DevOps was supposed to counter? How can anyone say this and claim to be doing anything remotely near DevOps?


I second this. DevOps was supposed to result in developers also being responsible for deploying and monitoring their applications.

It's either a DevOps culture or a classical Dev + SysAdmin setup with clear boundaries and where a developed application is being thrown over the fence to SysAdmins. The latter will always result in mutual animosities between the two parties.


It's part of the deal for modern SaaS software deployment. If the devs want to deploy more often than once a quarter, then they need to be mindful of the operational burden they impose on the ops team. The way to drive that home is for them to carry the pager part of the time.


> DevOps was supposed to result in developers also being responsible for deploying and monitoring their applications

No, it was supposed to be dev and ops collaborating with mutual respect and empathy.


Agree 100%

Although it's sane that developers keep contact with the reality of maitaining the live applications they wrote, it just doesn't scale to ask them to fully support them.

There is an infinite amount of maintenance for any live system. No service of any magnitude just "works" in production indefinitely, at the very least because this service interacts with others that will fail.

If developers are responsible for every live system they publish, they will get locked on after a finite amount of service they published, and leave because of maintenance boredom.

There needs to be a reasonable amount of documentation written, explanations given, level 2 support taken, but that's it, the maintenance is for ops.


I don’t know what you think is magically different about an infra person who can apparently maintain arbitrary numbers of services and is somehow immune to the same boredom you fall under.

And look, infra people are harder to come by and more expensive than devs so no company wants to waste our time bumping dependencies and fixing non-infra related bugs. If you’re big enough to have a team or teams of maintainers then they would go there. If not then it’s ship it maintain it.


This is only true for badly written or architected systems. Usually it is a team. If you move around different teams you are probably only maintaining $currentteams apps, not everything you touched.

As such any team should ensure integrity of it’s code regardless of people coming in and out. Hiring, levels, code reviews etc. help this. Bunch of grads with no source control and write code directly in production without review would be the opposite.

SRE is a hat not a title or a team at small companies so this is how it has to be done. Also AWS probably works this way from what I have read. The team like a mini business.


Dealt with this at a company.

Eventually the developers just gave up on all development work and became operations. The actual operations team kept k8 cluster alive. Developers had to do everything else.

Eventually people would get the hang of the operations side but by that point they were burnt out and quit.


> If it's not part of the SOP, then no, you wont have those things

Isn't this "adversarial" as well? Why would you withhold that information just because the SOP don't make you provide it? What will happen then is that eventually the service will break, nobody will know how to fix it, and they will come and ask you.

If you're no longer there, the service will be decommissioned and all your work will have been in vain. I don't see a net benefit for any of this.


I didn't mean it adversarially, but I can see how you came to that conclusion. Communication is hard.

I was trying to convey that they wouldn't exist because it hadn't been asked for. People aren't mind readers, ops has to vocalize their needs. Which is what the post was doing. And I don't think those needs are unreasonable, either. My point was more that the work to make sure those things are available needs to happen higher up than you'll be able to reach by talking to any individual developer.


Aaah, ok! Sorry for the misunderstanding, and thanks for the explanation.

(I was mostly in agreement with the OP and was surprised by the level of opposition to it in the comments.)

Yes, it's better if the information asked for in the original post is in the SOP, but if it's not, then asking for it each time somebody sends something to production, while suboptimal, is better than nothing IMHO.


This questionnaire is kind of foreign to me since I see an SRE's job, more or less, as defining interfaces and then forcing everything to adhere to them (politically or manually).

These are the questions I find useful:

  "How is capacity for the service allocated right now?"
  "How is software updated right now?"
  "How was the last outage handled in as much detail as possible?"
From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).

Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.

> that YOU wrote, only YOU know how it works, thus YOU own.

I find this attitude pretty toxic. If you are in an SRE vs Product Dev mindset, then you have bigger battles to fight than service manipulation.


> From there, just about everything answers itself with a couple days of reading code and poking at machines, particularly from the output of `lsof` (log files, config files, what the service talks to).

> Half of these questions could be answered with grep and once you get proficient at grep, you can answer questions faster, and more importantly, more accurately than the people who work on the services themselves.

SREs can own the whole development process you mean?

edit: in HN you probably want to use intellJ for everything, don't even mention grep please, they don't know what that is


I don't get why SRE is a job(and it was my title for years) The stuff listed is just good software engineering. If a swe cant figure out that they need to monitor their application (or really anything on this list) you have no business being anything other then a junior programmer.

These kinds of responsibilities create this weird scenario now where the team sre is the teams babysitter. Which just leads to the ops vs dev bullshit weve seen before. Toxic right off the bat.


> I don't get why SRE is a job(and it was my title for years) The stuff listed is just good software engineering. If a swe cant figure out that they need to monitor their application (or really anything on this list) you have no business being anything other then a junior programmer.

Someone has to enforce those good practices. Weak engineers hire more weak engineers and they suck and their job.


> Weak engineers hire more weak engineers and they suck and their job.

I love meeting people from other cultures and backgrounds and experiences. It’s great to get to know each other. I perceive you to be one of those people for me.

See, in my travels, weak engineers usually have no/little say in anything of import, except the yucky boring maths stuff others don’t want to/can’t do. Hiring a personality is something that our management types love to get involved in. They hire for all the wrong reasons. They retain people for all of the wrong reasons. They primitive for all the wrong reasons. We have “hiring as a service” institution here known often as HR that manages to meddle with things.

It’s not that any of these bodies hire weak or strong engineers for any malicious purpose (oceans razor ya know), it’s just completely arbitrary for us and you kind of just learn to put op with and cope with the chaos.

Where you’re from, is there a sort of settling function where after a while, all the weak engineers have cohired all of the week engineers? Do the strong engineers hire the other strong engineers?


Not the OP but yes absolutely. At medium sized USA companies it's common (not universal) for new team members to be discovered and or interviewed and or hired by the team that they will be working with. HR/recruiters are just trying to feed the interview pipeline. Possibly the at-will nature of USA jobs feeds into this. Bad hires are less risky without long term employment contracts.

I'm interested in your experience as well. Even at the largest US orgs there's still the concept of a 'hiring manager', who leads a small team, has a big role in hiring, and can can do meaningful technical work on the system. So the interviewing ability of these people, and the people that they trust, is the main determinant of the engineering hiring decisions that get made. Like a common interview loop would be hiring manager, (2x)senior/staff dev on the team, senior dev on another team, director. Lots of exceptions and opportunities for random consultants/reorganizations from above, but the basic idea of line employees and managers being able to identify their successors is pretty baked in around here. Can't speak for the east coast though


What I mean is: say there is a startup with non-technical founders. They outsource development to some web shop to do a proof of concept or something. Later product proved to be valuable, so they hire a CTO of some sorts or even just an engineer. They have no idea what is a good CTO or engineer is when they do hiring. Sometimes it works out well, sometimes not.

In "not so well" scenario, the company ends up with weak engineers that don't have a lot of experience. They might have 20 years of making marketing websites at website mil with Drupal or maybe even Django, but zero experience in software development. Those people, either don't know what a good engineer is or they are afraid to hire someone noticeably better than them - so the team ends up stuffed with not so good engineers.

Maybe by luck they hire a good engineer eventually, but a good engineer will look at all that mess and will bounce really quick unless the pay is worth it. As an anecdote, I worked with a very good engineer recently, but her team was weak - on her 1:1 she was told to not be so strict during her PR reviews. Reviews she was making, unlike mine, were very polite and people still complained about her being too strict (she wasn't it was bare minimum).

When a strong engineer is interviewing a candidate - they know what to look for and willing to look the other way when it comes to personality. (that sometimes end up in toxic engineering team culture, but that's another issue) I also worked with strong engineers that would refuse to hire engineers that are better than them (it was unnoticed for some time).

A lot of SRE complaints are coming because they have to babysit weak engineers and hold their hands. Management thinks that the solution to this is to hire junior SREs to deal with it, but the real solution is to reduce need for babysitting. For context, I'm a sysadmin that ventured into software development and now doing SRE work and honestly, there are a lot of days when I wish to go back to software development.


That is why you have senior/staff engineers to enforce good practices.


And weak SRE’s hire more weak SRE’s. You can be as great a developer as you want, if your app is built on quicksand it’s never going to be good.


That is true. However, even our junior SRE hire that was objectively weak (we needed an extra person to deal with boring things) is stronger than average developer here. I mean, my bar is very low - can try doing what error messages in your service tells you to do without raising an issue with my team?


if you cant hire good lead/senior/staff/fotm title software engineer why can you hire a good sre?

if you can hire a good sre that knows all this stuff, then you should be identify your lacking in a skill on your teams(it'll be obvious because things are shipping slow, and breaking often) use the same skill to hire an sre to hire a good swe


I think SRE hiring pool is much smaller and that's why it's harder to hire weak SRE. There are no "30 days nodejs bootcamp"-like courses for SREs.


In my experience outside of enterprise oracle type places it's just a label. You still work on service level code, you also work on architecture, you also do infrastructure and monitoring and really all the stuff. From the places I've worked that aren't red tape bound the title really just means "we need someone who we can give a business problem to and they will solve it in an efficient way without needing a full team or re inventing wheels, and we need to be sure it's going to keep making money for us with stability". All the other rigor is just job description fluff to attract talent.


Yeah sure that sounds great in principle but at the end of the day someone has to be in charge of tracking down where that new user_uuid label in the prometheus metrics came from and find the team responsible and explain why that is a bad idea.


it's fine, it's not like that someone cannot get another job easily, even in current market


i don't get this scenario? that would be the subject mater expert on prometheus in your organization then. why is an sre required?

it sounds like your implying there has to be some weird relationship were developers cant be trusted to debug and need to be babysat by the grand sre that know and watch over everything or something. this is the toxic nonsense that exists.


> I don't get why SRE is a job

I've met CTOs that would agree with you. I no longer work with any of them.


Sure, but how are those companies doing financially and do they still have that CTO?


If we put aside the fact that I don't care, it's also completely irrelevant because it's anecdotal, and even if it wasn't anecdotal, I do not accept the implication that the competence of a specific company's CTO and their tenure time has an obvious relationship with the profitability of a company to begin with.


People are complaining about the idea that the developer is ultimately the owner of any service they wrote.

I don’t see how this is even controversial. Consider the case where a SRE is responsible for 5 or 10 such systems. They could never be expected to know as much about those systems as the people that wrote them.

Now if there is a one to one relationship between SREs and systems then it might make sense to expect that level of understanding from the SRE.

In my experience it would be a great privilege to have a dedicated SRE to your application.


You say that’s not controversial but irl I’ve worked with more than a handful of non-jr engineers and even managers who think developer job ends at seeing green build in ci (btw a ci which some other team is supposed to manage). Sometimes even green build locally


"Owner or not" is a false dichotomy. I get it that the author and many SREs are probably jaded from developers not taking ownership at all.

The right attitude is to figure out processes that let people draw a line when to go to DevOps, and when to escalate to developers. Developers need to understand the costs they impose on devops and organizations need to make sure developers are empowered to fix their own issues, rather than to be constantly chased around to business requirements.

Developers ultimately answer to business priorities, and they don't necessarily own the business processes that demand their support. If developers are given ample resources to keep bugs out of systems, document operational expectstions and respond to incidents, then the developers can "own" the processes better. If not, it's a management problem that is just of the same nature as the usual SRE complaint that developers don't want to own anything at all.


For me SRE/DevOps is just support for developers. It's possible that such person has more knowledge/experience in development and operations, but in general their focus is on infrastructure, automation and general troubleshooting.

They might know how to build/test/deliver/monitor some solution, they might know to some degree how to configure solution (but developers should support them with it and describe it well), how to script some operations, however they definitely won't write bugfix themselves.


Can someone explain to me how this is any different of a mentality from system engineers that SRE replaced?

I haven't read the SRE book, but my understanding was that at Google the answer to all this would be that the SRE would act as a software developer and submit pull requests to the codebase in order to implement/fix all of this?

> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

And my own take on this statement which is getting so much traction in the comments is that this seems largely indistinguishable from the wall between Dev and Ops that we had back in the late 90s.


I don't know when exactly "DevOps" became a new role equivalent in every way to the old "Ops"? My original understanding of the term was that developers did their own operations


It was, but I think all the legacy ops teams were worried they’d be out of a job, so they all rebranded their teams as devops.


Devs also quickly rebelled against carrying a pager and being oncall.


I don't think there is a difference. I also think this is part of the reason why so many companies now don't have an Ops department. My own is an example of this, we've outsourced the Operations part to another company because our Developers are doing most of the Ops that isn't related to networking anyway. I'm not saying this is good by the way. It's just that developers tend to get caught in the "company" part of "company vs IT" because developers want to make stuff work first, and work correctly second where as IT has always been the other way around.

I don't necessarily disagree with everything the author writes by the way. There are a lot of good points in there about building things to be operational, but at the same time, what good is an operations department if it can't actually operate it's systems? I know the answer to this often becomes buying third-party software or "standard" systems, but as more and more businesses are realizing, that's often worse than simply using a lot of interconnected excel sheets (which doesn't really scale).

It'll be interesting to see what happens once the newer waves of project managers and it-business partners learn from the successes of companies that build in-house software instead of going to "standard" systems where they'd need the inhouse developers anyway to make the un-Godly amount of API's and data-transfers work. At least here in Denmark, companies like Lego and Vestas are doing some really groundbreaking money making at a much lower cost, by not going to "standard" systems for everything. Not that you should never use standard systems, there are somethings that are shared across businesses after all, but there are typically also a lot of things that just won't fit into some internationally shared box well enough for it to work out as a net bonus.


The "not built here" attitude can create a serious sustainability challenge. There are a lot of problems that businesses need to solve for that are well solved already. It's very easy to assume you're the first to do or need x, but often it's not the case. As such, why create a development and support burdon when you can simply buy x at $/seat?

Agreed, integration can be a challenge, but the key here is good enterprise IT architecture and avoiding SaaS sprawl.


> The "not built here" attitude can create a serious sustainability challenge. There are a lot of problems that businesses need to solve for that are well solved already. It's very easy to assume you're the first to do or need x, but often it's not the case. As such, why create a development and support burdon when you can simply buy x at $/seat?

I don't disagree with your theory on this, and it's what is being preached at a lot of softer IT educations as well as IT management. The unfortunate truth is that it rarely plays out that way in the real world. At least not in my experience.

Not because it couldn't work, but because organisations tend not to do the necessary change-management to actually implement the standard system in a way that allows your "well solved problems" to fit into the box. This is why it's a very lucrative business to sell third party API's to standard solutions, because organisations always have their own little variations and differences. Maybe those variations and differences could be solved by changing the way the organisation does business, but that's not what happens, instead they buy extra development to get the API to make their square fit into the round hole, and the consequence is an endless line of in-house data services that translate and move data between systems.

Sometimes things gets so bad, that the business itself, actually goes back to solving things in Excel, and then have people manually move that input into the "standard systems" before it eventually becomes a nice RPA project, leading to even more technical debt.

This is not to say that it doesn't work. Because it does. You can run a business like that for a long time, but you're not going to maintain the growth you had before you settled into these.

> Agreed, integration can be a challenge, but the key here is good enterprise IT architecture and avoiding SaaS sprawl.

I wonder if there has ever been a place where the Enterprise Architecture wasn't a bunch of outdated documents that lived in a digital drawer somewhere because the only one who actually used it was the Enterprise Architect.

And I say this as someone who's worked on the national Enterprise Architecture of Denmark. https://www.kl.dk/okonomi-og-administration/digitalisering-o...

I don't think the theory of it is wrong as such, but I do think the academics who came up with it, expected the real world to be a better place than it is. You're never going to succeed with IT architecture in a world where every non-IT manager doesn't care, and that's the world we live in. Even here in Denmark in one of the most digitalized countries in the world. You can feel like you're succeeding within your own ivory tower, but if none of your projects reap any form of benefits from it, and the organisation still violates everything you've build because they wanted X and signed the contract before they even asked you if it would fit in, well...

I'm not against standard systems, mind you. I completely agree that if you can find a box that fits, then you should certainly buy it. I just rarely see that outside of support systems like HR systems, because as soon as it's about the actual business of an organisation, then it's never similar to others. Because if it was, then there likely wouldn't be a company to begin with, as you can't compete by doing exactly what others do. At least not until you're big enough to stagnate.


> At least here in Denmark, companies like Lego and Vestas are doing some really groundbreaking money making at a much lower cost, by not going to "standard" systems for everything.

Do you have an link to an article about this somewhere?nn


Maybe I’m being overly pedantic, but… these are questions DevOps engineers should be able to answer themselves because they’ve contributed to the answers. I understand that DevOps has basically become a euphemism for ops + automation complexity that requires product-equivalent engineering talent + arcane knowledge of a zillion cloud vendors’ … everything. But can we go back to calling that ops?

I actually liked the DevOps-as-in-devs-also-ops as a forcing function to keep deployment relatively simple because it’s very low on the core competency/value proposition spectrums. It also has the benefit of rewarding companies for making that feasible at the expense of a tiny fraction of the cost of dedicated ops roles.


I work as an SRE and while I agree with the "list of questions" as a general template for collaboration, I strongly disagree with the point that developers "own" the applications.

If you work in the same company, you all own the application. The customers don't care that you're "only" the SRE, or "only" the sales guy. This type of attitude is toxic and should be challenged categorically.

If you, the SRE, do not have the information needed (i.e. the "list of questions") then it's as much your responsibility to ask for it as it is the developers jobs to help you answer it.

If you feel that the company culture makes it impossible for you to create these necessary processes so that everyone have the information they need, you need to either work towards changing that culture or get a new job.


This list is exactly what we try to deliver to operations in our firm. All very reasonable.

You know why you "rarely get an answer for straight away "? I assume because they are working on the next ticket/delivery. A lot of this stuff is not estimated properly. A way to get it estimated properly is to work with the devs, cooperatively.

This said, for some reason, this blog post seems adversarial and gives me a bad vibe. Instead of "List of questions I’d like to get an answer from devs", it should be "we should work together to get these things done".


This is exactly the sort of requirements list that a dev group would receive from an ops group back when ops were systems administrators and network engineers.

And I am not objecting to it in the least; these are all good and vital questions.

I am objecting to anyone claiming that DevOps is anything other than "using the kinds of tools that help software development projects to help operations", and I present this as absolute evidence.


This sentiment seems related to an observation I've been making more frequently as of late on this topic.

Before DevOps was en vogue (i.e. was a descriptive term more so than a buzz word), the whole premise was to collapse the bulwark between engineers and sys admins. All SWE's should care about how their application is deployed, monitored, and scaled in production. This leads to far better application engineering outcomes in most efforts in which I've been involved.

The end result of those efforts was often, but not always, engineers writing some amount of operations tooling themselves.

But now we've come full circle. There is a ton of operations tooling you can pull off the shelf, and those tools are generic/complex enough to require administration. So many DevOps roles now as a result, particularly in larger orgs, are mostly administration-focused and less so about building the tooling itself.

It feels like we've reinvented the bulwark we tried to escape previously. There's an open question as to whether, from a practical perspective, we still have gained a net win there irrespective of the logical separation between eng and ops. I'm not sure where I've landed yet on that question.


My answer to about half of those questions would probably be: "How would YOU like it to work. You are the expert on our systems and I would like to know what you consider best practices. Give me some guidance on how we run things here and I will do my best to set it up that way. If my application is very special and need special considerations I will contact you to figure out a way that works for both of us"


I moved from full stack eng to SRE/DevOps a couple of years ago but have the least enjoyable role of straddling the two. And while I think this post surfaces some good points I can tell you that deep in the heart of every SRE/DevOps engineer that didn't come from a software dev background -- all they truly want is to get paid 250K a year to administer a system that literally does nothing and thus never breaks and this desire is the subtext to and informs every interaction they have with the engineering team.


I personally think you it’s very difficult be an SRE who doesn’t come from a software engineering background. To me an SRE needs to come from a background that includes extensive coding and architecture experience.

To me, the worst SREs are the folks who come from the DevOps side whose experience is limited to pipelines and infrastructure as code type stuff. They invent solutions that just don’t work.


This is why I do programming tests before hiring SRE/DevOps for our team. I don’t really care how well you know Ansible or Kubernetes, or how long you’ve been operating software components. If you can’t think like a software engineer you can search a job somewhere else.

In my experience DevOps coming from a sysadmin background end up poisoning the well. They’re afraid of developing the right kind of abstractions, block any proposals to do these and before you blink you end up with a mediocre VendorOps team that can do nothing but integrate off the shelf solutions with an unmaintainable mess of yaml / hcl “code”.

They will replicate, the more capable DevOps engineers will leave in frustration and your platform will be taken hostage. Don’t let a single one in.


Yes I’ve been in three places like that now that demand coding skills of their “devops/sre” candidates then proceed to saddle them with ton of ops and resist any new software from being written (the “no software is best software” crowd). Then people are wondering why they have to pay $$$ for these roles because “all they do is apply some terraform” lol

Fwiw I completely agree with your second paragraph and seen it first hand in last couple gigs but I dont think ICs are to blame here - fish usually rots from head down.


and every dev just wants to get paid $250k/year to write some code that solves the problem on their machine, hit deploy, close their laptop and go home.

Random stereotypes might be funny but they are not useful in getting stuff done.


On the contrary, knowing what everyone's incentives are is a big help.

Although in my experience most devs would rather see their code all the way to production; the problem is their line manager wants them to tick off the ticket and move on to the next one as early as possible.


> and every dev

Friend, I have a whole laundry list of issues I have with devs, but this post isn't "things devs want from devops".

> Random stereotypes might be funny but they are not useful in getting stuff done.

You say "stereotype"... I say cynicism, which is ultimately what qualified me for the SRE role.


I am very opinionated about what SRE and DevOps own vs what devs own; and, I didn't really have anything negative to say about my (admitted) skim of the article.

As an SWE, I want to and need to know how to provide metrics on my system to be able to understand its health, and I should have good safeguards in place, or at least have communicated with the SREs what I need to provide to them to help them have good safeguards in place, to make sure the application keeps running. If the application goes down, it's my responsibility to make sure it's not my fault (bug in application code) that caused the system to fail.

What I, an SWE, want out of an SRE, though, is infrastructure management. I want to be able to ask them for some queues, and for a redis instance with high availability. I want them to set up the Kafka cluster, the database. I want us to have a conversation about where the secrets are to be stored. I want to be able to ask them what I need to do in code to get a secret and use it. I want them to be able to give me a good template for k8s deployments - or maybe to pair with them, given the docker containers and sidecars I need for a deployment and the projected scaling I'll need and come out with a best-practices set of k8s deployments.

I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.

Infrastructure, be it k8s or nomad, configuring redis, making rabbitmq highly available, configuring and organizing (especially organizing) k8s deployments into something sane and logical, and so many other things related to infrastructure are as specialized of skills as writing high-performance or unusually architected, large systems. I've seen the systems that come up when SWE-on-assignment create infrastructure; and, I've seen the literal years of work SREs have in their backlog to fix it with best practices.

It's similar to front-end developers: it's an entirely different skill set; and, while each person in each tear can stumble around in the other tiers, it's way better if we are all there, working together toward a common goal, and especially focusing in the areas we have each specialized our craft.

addendum: of course there are exceptions; but I think those exceptions are 1 in 100 or 1 in 1000.


> I would be grateful if they monitor the database for some horrible queries; and, use their knowledge of which deployments made that bad query, to file a ticket to the right team so they fix their code or add an index or whatever is necessary.

What you're asking for is:

  a) SRE wrote an alert for slow queries, because slow queries affect all shared infrastructure users
  b) SRE gets woken up at 2 AM by his alert
  c) SRE sees Dev wrote a bad query
  d) SRE files a ticket for Dev
  e) SRE goes back to sleep
  f) SRE gets woken up at 2:30 AM by his alert, because the query fired again
  g) SRE has a restless night
  h) SRE goes into work, asks Dev to prioritize the ticket to fix the query because sleep
  i) Dev tells SRE that maybe it'll get into the next sprint, in the meantime the alerts are SRE's problem
No thanks.


There are ticketing paradigms that don't require alerting in the middle of the night. All you need to repo and fix a bad query is the query itself and/or an explain of the query itself, and you can work on it during business hours.

I'm sorry you've had bad experiences with alert-all-the-things ticketing paradigms.

addendum: in fact, you can automate all of this so it just shows up in a team's "known issues" query that they may manage in standup or sprint planning.


I still have the proverbial nightmare from hearing at 2 am: "Hello, this is Nagios" in a creepy robotic voice (we had it setup to go through Asterisk with a TTS spelling out what the alert was about. but it always started with that prompt. Free (as in beer) TTS voice packs in 2010 were .. interesting)


> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

Like this is the single biggest truth in the article, and I'm glad to see it stated so clearly. Shout it from the rooftops, please. It's a direct logical consequence, too — and yet, so many people seem to make decisions that violate this truth.

I field so many questions about "why is service X doing Y?" Have you asked the service owners?

Unfortunately, I've found one more or less has to become proficient in rapidly understanding services you don't own, because getting other people to act logically is a fool's errand.

> Are you logging to stdout ?

Nooooo to stderr, that's literally what it is there for. (As C says, "for writing diagnostic output". Logs are that.) Also, it is sometimes buffered and you don't (IMO) really want that.

Any output producing program requires stdout for the output, and you can't co-mingle logs with that and have piping still work. While it is unlikely that your production service is producing output, there's no reason to do anything different with the logs. (I'd say a part of being a good production service is "don't be needlessly special".)

(But our tooling will just capture and mux the two streams together, too, so it doesn't matter, unless buffering means the error logs don't make it right before your service is killed.)

Also, your infra team provides the metrics service, but you need to capture your own metrics. My metrics provider does not have a crystal ball, it cannot peer into your service's memory and pull out critical stats. You must push them yourself. Talk to your infra team, they can show you the API they use… (We collect common, machine level stats, like "CPU in use" or external things about your service that are easily visible, like per-container memory usage. But not your reqs/sec.)


> Nooooo to stderr, that's literally what it is there for.

Bah… use syslog() (or whatever uses the same protocol) and then you get priority, name of the daemon… and if you step it up to journald, then you get to log key:value stuff.

Of course most golang developers have never heard of syslog() and think that logging is done with stdout and then a bunch of parsers to extract information that was there to begin with, had they used a proper logging.


We can argue about exact implementation, but SRE demanding all apps (assuming non-interactive ones run on servers) log to the same channel (with enough tagging info to ID the app/server) is a good thing. In my case we are automagically configured with a Splunk appender for Logback (and the platform also sends the stdout/stderr to Splunk under a different sourceType).


Ah, yeah. My BE work is almost exclusively inside containers, where journald is not available.

(We could perhaps arrange that, but JSON-lines is typically good enough, and easier for devs to understand.)

(Note that the KV stuff requires you to speak journald's protocol: syslog in systemd (and really, everything I've ever seen speak syslog) is the old BSD syslog protocol, which doesn't support KV data. Not that journald's protocol is particularly hard to speak.)


Cool, turn it into a set of requirements and put up as part of the definition of done.

Questions in this form always seem condescending. Like “I‘m smarter than you, I thought about it, you didn’t”.


This is exactly what I came to express.

If this isn't standardized in an organization it should be. Otherwise, it's the same repetitive questions, the same finger pointing, and the same miscommunication. If these are the requirements needed to put a service into production, then make it explicit. As the developer, of course I own the service, but (usually) don't have the access. Standardized as requirements, both teams can work together to produce, monitor, and troubleshoot production services smoothly. Then nobody is surprised when it is release day, and asked these questions with an impatient PM whom has already publicly set expectations.


Spot on. Otherwise it’s all virtue signalling.


This comment section was exactly what I expected. A mirror of how most folks in the trenches discuss these murky boundaries.

  * SRE/DevOps folks stating the person that wrote the application has the knowledge to debug it.

  * Devs saying that it's SRE/DevOps job to debug it

  * Lots of comments on culture and you should do X

I know most people like the whole grassroots thing, but the only shops I've seen that are actually killing it are the ones who dictate these boundaries and responsibilities from the top down. And I've seen a lot of shops.


This is completely backwards. As someone that has been an SRE and DevOps engineer.

Almost all of the questions can be simply answered with: "This is a NFR that was created by SRE".

The important thing is to collaborate with each team and be there when architectural and design decisions are being made in the first place!

All of these questions are post-hoc, coming after the thing has been built. You would never need to ask these questions, if you help drive initial design.

Embed yourself with your teams. Ask to be part of design discussions. Remember: 50% eng 50% ops. You have no excuse!


> Embed yourself with your teams. Ask to be part of design discussions.

I agree that this should happen, most successful projects have people with all sorts of knowledge contributing to it, without too many silos in place.

> You have no excuse!

However, the Ops people don't always get that power or a say in the matter. In many dysfunctional environments they'll simply be given an apparently finished service and will be told to put it in prod.

Please don't dismiss that these circumstances exist altogether and don't shift the "blame" exclusively on the people who already have their lives be needlessly hard, this isn't likely to encourage a positive outlook.


> Please don't dismiss that these circumstances exist altogether and don't shift the "blame" exclusively on the people who already have their lives be needlessly hard, this isn't likely to encourage a positive outlook.

Au contraire, OP was blaming engineers with a "holier than thou" attitude. That exact attitude is the kind of thing that leads to the dysfunctional environments that you speak of.

Should SWE consider SRE at design time? Absolutely. Should SRE consider SWE at design time? Absolutely.


It seems that these questions should basically be answered once per company.

All services should have common health endpoints and shutdown operations.

Logging should be standardized across all the services of a company.

Having bespoke answers to these questions for each service will rapidly devolve into chaos, when you have multiple services deployed.


Are you really DevOps if you need to write such rants? Are you really DevOps if you company has Devs?

I've thought, that DevOps by definition is developer and operations in one. You wrote service, you support service, and there is no boundary, and there is no such problem as described in this text, by definition.

DevOps complains about problem, proposed solution for which is to be DevOps...


"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." ( Melvin E. Conway, https://en.wikipedia.org/wiki/Conway%27s_law )

This is unfortunately the death knell for DevOps organizational teams on large projects. Primarily, the design specification usually ends up being hammered into the inherent dysfunction the project was intended to solve in the first place.

Best of luck =)


I agree with some of the points, but on the other hand most organizations do not really empower SDEs to reason about architecture. Things like production budgets and production grade monitoring and observability are usually owned fully by SRE/devops, and if some enterprise architect type is involved devs won't even own the spec. At those places, devs can at best make a wild guess of what the expectations are. Responsibility should be proportional to power.


Yeck. I read—then skimmed—through this article. Do others have the same “another mediocre engineer turned manager who I detest?” Don’t want to work where this guy works.

The first sin they embark in is framing their argument, in part, as one of titles/labels. This is usually an institutional smell. And it’s not a pretty odor.

The second is that the person believes there role is to question others. It’s a move that insecure people play. The idea is that you keep your opponents defending themselves against questions you define, and that means there’s no time to address some of the hard questions that might circle your own “roll.”

It sounds like the guy feels he knows the answers. If so, why doesn’t he jump in and do them? If he knows better how to do this SRE thing as defined by him, clearly his company has pulled a Peter principle, promoting him from something he did well, to a position where he now harps on others using their nostalgia. Value may have been lost. If he’s really that good, we can use him in the trenches. If not, he’ll learn how to try to explain why some of these PHB questions are actually hard to answer and execute.


I've always hesitated when there are large pushes coming from DevOps. What if all I wanted to do was code and not work on yaml files? I also didn't sign up to be paged in the middle of the night. Some of the points are valid in the article but like some others the tone comes off as hostile.

Truthfuy often times I don't understand how things behave in a production environment.


If you don’t transfer the knowledge, help setup the monitoring/alerting cases, write runbooks for others, periodically review metrics, you absolutely should be paged for a critical outage. Devs can’t just write code and throw the operational part over the wall.


If you want me to be on call you'd better be paying top dollar for it. If I wanted to be responsible for a whole business I'd be working for myself rather than an employer.


No this is all management's problem in any decent sized company. Individual developers don't have full autonomy over what they work on. If management doesn't want to spend dev time on helping operations, then tough luck.

I suggest asking about these things in the hiring process.


Where does it say you'll get paged in the middle of the night?

The questions are actually to prevent that. They are making sure somebody can possibly diagnose and fix a problem that involves[1] this service without having to call you up.

[1] Note "involves" doesn't mean your service is the cause, it might be the previous or next one in the chain.


It was just an example. I've had DevOps engineers say they want us to feel the pain to make a point.

The questions are great for sure.


> I've had DevOps engineers say they want us to feel the pain to make a point.

And thus you'll have an incentive to improve the quality and lower the pain. Of course that's easier said than done, and there may be other factors at play (e.g. priorities set by someone else), but at least if the misery is shared those who can do the most to fix it are fully aware of it.


> I've had DevOps engineers say they want us to feel the pain to make a point.

Yes? What, you want them to keep the pain even though they don't cause it?


The code was not causing issues.


They signed up for it, I didn't.


I’m hearing that what they really need is a Developer who understands operations so the “DevOps” guy just has to take care of operations?

That was suppose to be the definition of “DeVOps” in the first place. Any company that has a DevOps role is going to really be an operation role by another name.


Honestly if you keep to https://12factor.net/ the only time an SRE will page you is when there’s a cryptic custom error with no runbook.

If only I had a dollar for every time some program dereferences a null.


I think these are good questions to ask but IME SREs are expected to learn and even contribute to these either as they onboard to their team or as they take ownership of reliability for a particular service.

The way this is phrased, it sounds like the author is managing reliability for things where they don’t already know the answers to these questions nor do they have the context or bandwidth (or even access?) to answer it themselves. Seems like a recipe for disaster, or at the very least, a lot of frantic learn-as-you-go.

That said, as a dev, I do think we could do a lot better adding playbooks. Though on the other side of the fence, they’re often ignored with a “I don’t know what’s going on and you wrote this, can you help?”


Yeah, this person isn’t really an SRE if they are asking devs for these questions. SREs should be absolutely capable of doing all of this themselves in the codebase.


I find it interesting that once upon a time, DevOps was a way of doing / organizing things, not a role per se. It slowly morphed back toward systems administration (the things it was intending to replace) as a role, and SRE was a kind of sub-set role of both. Recently, I'm starting to see this SRE / DevOps abbreviation, described as a one and the same common role. So I guess all that is old is new again, just renamed?


This doesn't read like an SRE perspective, it reads like a classic SysAdmin perspective. Which, while useful, is a very different role.


It seems like there is a lot of disagreement and discussion about the role of SRE vs Devs. My team is responsible for our own operations (we are Dev and DevOps, no SRE team), this is a great list of questions to ask ourselves during the planning and estimation phase before we build our new stuff.


> If you’re a Software Engineer/Developer, then consider that a service (at least, for me), is a piece of code running in a live production system, that YOU wrote, only YOU know how it works, thus YOU own.

SE owns the code, but SRE owns the running code

Other than that, I agree with everything in the post


> Are you using gRPC or REST Is most likely "Are you using gRPC or JSON over HTTP".


Devops is dead?


Again?


Lol if you don’t know these things, your company has Ops, not DevOps


What do you have if the devs didn't bother to implement half these things before deploying to production?


A HTTP 500, 502 or 504 error usually.


I thought the principle of DevOps was “the person who wrote a service should be required to perform the ops duties for it” to force people to learn why these behaviors are necessary.


Next in the trilogy: Things I want from SRE/DevOps as a SysAdmin.

- What specs of a VM do you require?

I'll assume that 16mb of RAM and 512mb of drive space running Slackware is suitable operating from 1.44mb floopy.

- What do I do if it doesn't compile?

It works in DevLand I assume I'll work anywhere. No, you cant growl at me, you asked for Linux and I gave you Linux. Documentation please.


> - What specs of a VM do you require?

What are my options? How much do those cost the company?

I can run N requests on my laptop with specs of X. CPU usage hits 100% and memory usage hits 50%. This is 10% of our projected load. Therefore I project that I will need 10X CPU and 5X memory resources for this system for our entire userbase. My laptop is pretty powerful, but there are systems with 320GB of RAM and 80 cores. Do we get 2 of them?

Oh, what you have is a bunch of 3 generation old 28 core Xeons with 128gb of RAM. Can I get access to one of them to test thoroughput? One Xeon core is not as good as one macbook core, especially not that old, so I really can't make any promises of perf without testing on representative hardware. No? Fine, let's add 30% buffer for poor IPC on old hardware.

Oh, you've got a custom wrapper for Java that launches it with a bunch of custom JVM options that the SRE org mandates everyone uses? Got any documentation on what those are?

What's our lead time for getting more? We need to know that for an idea of how much buffer we need.

So apparently lead time is six months? Ok, what's our load going to be in 6 months? I'll ask bizdev how many customers they think we'll have in six months. Oh, they answered with a vague "We want to have lots of users". Fine, we'll add an extra 100% buffer.

What, the operational efficiency team are pissed our VMs are at 25% usage?

While there are certainly dev teams that throw capacity management over the wall to SRE, the inverse is also nonsense as it's the SRE team who usually get informed of hardware options, deployment standards (especially in bigco) and company operational standards.


> What are my options? How much do those cost the company? Do we get 2 of them?

What options do you want? You come to me with the specifications, and I'll get back to you if you can. I'll work with you to get what you want.

If I am the one with the power to create, the gatekeeper to the virtualization cluster and with a ballpark figure this makes my life so much easier. It allows me to make my justification easier too. "Hi Manager, X needs this. I think it needs this. I'm going to setup this and evaluate, no manager I don't think your correct."

"Here's my tech specification for what I'm setting up. here's the documentation for configuration and setup and these are the results" Lets ride.

Let me give you 8x core and 64GB see how the performance spikes and go from there. I'm not stingy and always happy to give more to test performance and than decrease if it's overkill. But don't dispute if I start to take away because of.

> what you have is a bunch of 3 generation old 28 core Xeons with 128gb of RAM. Can I get access to one of them to test thoroughput?

Sure. Yes you can. Why do you think you can't?

> Lead time

As fast as you require me to setup the VM. If I have all the docs, I can fly-by and have this thing setup under a day at minimum. Heck, I'll even work no-paid overtime to get this for you. I can escalate this on the fly. I'm in the good books with the NetOps, I even know the backup-ops. Tough crowd every time to please, but I manage so.

Production lead time? Sure, probably six months to get stake-holder approval and the rest, but I'll try to get it sooner.

Prototype lead: a week.

Just give me figures and I'll do the rest. Enterprise or not. I'll push for what you want but you have to work with me. However I need the figures and documentation before I can. I can't be seen creating the documentation on my own time based on made-up figures when I have an estate of 100 VM's needing security patches. That makes me look bad and if it fails the tests, I'll get the blame.


Most of the suggestions in the article seem like sane things that most projects out there should have (not saying that the ones I have to help out with always do, but rather that I wish people cared enough about those), otherwise you're shipping what's essentially a black box, with relatively little insight into how well it works (or why it suddenly doesn't).

If you make the right technology choices, most of those should also be pretty doable (not necessarily easy, but not overwhelmingly hard):

  - enable some basic healthchecks through configuration value in Spring or whatever you're using
  - then do curl requests against that (or even the HEALTHCHECK command with containers)
  - have something standardized take care of application lifecycle (like systemd services or once again containers)
  - add some instrumentation along the way with something like Sentry or Apache Skywalking
  - integrate with OpenAPI through whatever framework you use (so you don't have to write everything manually, but can use codegen)
  - don't be lazy and write tests for your code, at least the parts that are easy to test (e.g. business logic, not the low level JSON serialization)
  - hopefully have a few Markdown files that describe your service
Of course, those aren't strictly necessary for things to run (hence a lot might be called day 2 concerns), which many will use as an excuse not to care about it, as long as they can get something shipped and it seems to work now. I've seen that more often than I'd like, given that I'm sometimes the person who gets called in to fix the eventual issues.

An excellent group of suggestions for developing applications in ways that can limit headaches is the 12 Factor App site, which can help you avoid some issues ahead of time (it covers configuration, ports, logs and other concerns): https://12factor.net/

Ideally, everyone who has the applicable skills or knowledge would collaborate and work together towards having their software both work now and also keep working, with insights into potential problems ahead of time. And if your technology stacks are sufficiently boring, there's no reason why a lot of that knowledge couldn't be encapsulated into a few concise Wiki pages, Markdown files, code snippets or project templates - so anyone who needs to ship a new service in your org can grab one of those and get up and running quickly and properly.


> Ideally, everyone who has the applicable skills or knowledge would collaborate and work together towards having their software both work now and also keep working

However we don't live in such an ideal world. I'm given tasks such as "set this up for devs for release of X" with nothing specified. How am I suppose complete the request? I now have to waste my own time chasing managers, dev's and everybody else to setup what's required. My resources are finite, I don't have unlimited resources nor a budget to waste on virtual machines. We use multitude of OS versions ranging from CentOS, to Debian including FreeBSD and Solaris so what do you want?

This is SRE playing lax on the case of not defining and just expecting it. I don't have any access to the DevKit side of things, I don't have the ability find out what is actually required.

You want it to run in production, fine. But it makes my job insane when no documentation is provided and a JIRA ticket consists of some attachment with "this please" is handed to me. Something breaks and I am the first to get the grunt because it's "my fault". I have to waste time debugging when turns out to be because the software isn't designed for what I thought was suitable for a project because someone decided to use some old version of a Ruby GEM that's been cross compiled and which I've had no clue.

No one thinks about SysAdmins when they're running the actual show. If it wasn't for those, you wouldn't have a production, uptime and all the stuff we do. I can recite so many stories in many different jobs. Where this has been the case. But same-so, HN developer is bias and don't take in to account the flow that requires for their produce to reach production.

/end vent. But your not allowed to do that on HN. Because you get thrown downvotes for expressing issues with how the teams integrates.

16 years of SysAdmin, 33 and I'm more than burnt out because of. The first thing I do every morning is decide the colour of the ethernet cable I wish to desire to hang myself with. I take pride in my work yet done dealing with the shit that's given, yet you still need to lick the plate because life. I've just had three months off, just exhausted all my savings and starting my new job on Monday. But hey-ho, maybe this company has stuff right.


> I'm given tasks such as "set this up for devs for release of X" with nothing specified. How am I suppose complete the request?

The cut and dry answer is: you aren't, because you cannot. What should happen is a polite response to the request along the lines of:

  "Insufficient information has been given for the release. Please see the attached Markdown template with the information that needs to be filled out. This request is on hold until this will be done. You can submit any suggestions to improve the template, or reach out with further questions to ..."
> This is SRE playing lax on the case of not defining and just expecting it. I don't have any access to the DevKit side of things, I don't have the ability find out what is actually required. You want it to run in production, fine. But it makes my job insane when no documentation is provided and a JIRA ticket consists of some attachment with "this please" is handed to me. Something breaks and I am the first to get the grunt because it's "my fault".

So essentially you have all of the responsibility, without any of the power to actually do your job because of the circumstances that you're pigeonholed into? If you need the money, then I guess that's what you have to tolerate, but otherwise some lines should be drawn somewhere. In most cases, adding documentation/instrumentation and requiring it going forwards would be a good idea that any sane organization would get behind and support: especially if you can reference all of the past incidents that this would have helped guard against.

Quite frankly, that sounds like a dysfunctional environment and absolutely nobody would fault you for quitting a year into it (or even not waiting that long), in search of something better.

I suspect that a part of the problem is that in our market, we have mindsets that don't go beyond any of the following goals:

  - we got paid, regardless of what works or doesn't (can be seen in consulting)
  - we shipped something to meet a deadline and not get contractual penalties, regardless of quality
  - we shipped something that seems to work, though we don't care about much else (day 2 concerns)
Sometimes it's because of ignorance and not knowing better, other times it's because there are cultural issues in the country as a whole, maybe a lot of people viewing development only as a step in the path to becoming a manager, instead of a craft that demands attention and care.

So what you get globally can be companies that range anywhere from "Hey, we want you to be comfortable and not overworked: here are some learning materials or a budget for that, here's our knowledgebase and an overview of our procedures and architecture decision records (ADRs), here's a user group for this particular technology, feel free to reach out if you need anything." to "Hey, ship the software until monday. Why isn't it still done? I don't care about the details, get it done."

Maybe I should write a blog post about that some day, just not sure what to title it: "OKRs/KPIs of caring about software" or something like that, probably. I suspect bad mindsets and a bad culture is one of the reasons for sites like this existing: https://devrant.com/ or rather why articles like this ring true: https://www.stilldrinking.org/programming-sucks

At the end of the day, do what you can to take care of yourself!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: