Former Netflix engineer and manager here. My advice: Start a greenfield project ...

notJim · on Dec 16, 2021

> Refactor to separate microservices when either: 1) the team is growing and needs to split into multiple teams

I've heard this before, and I just don't get it. I've worked on multiple monoliths where hundreds of engineers contribute, and it's fine. You have to invest a bit in tooling and recommended patterns to keep things from going crazy, but you kind of need to do that either way.

> At 35-50 people a common limiting factor is coordination between engineers

So don't coordinate? If engineers working on different aspects of the codebase need to coordinate, that feels like something is wrong architecturally.

trunnell · on Dec 16, 2021

Ok, maybe a better way to say it is that having teams independently develop services is a good way to reduce the coordination tax if you have high coordination costs. If your environment doesn't have that problem I guess this doesn't apply.

Coordination between engineers was a frequent activity everywhere I've been regardless of how well built the systems were. For example: a new requirement for the customers signing up in a given country to have features X, Y, and Z enabled. In a large organization there are probably a few teams that will be involved that make that happen. The question is how to coordinate them.

Many companies try to solve it with top-down decision making, prioritizing certainty but hampering productivity (some teams have to wait) and strictly limiting risky innovation (nothing can be done without approval).

Independent teams (each developing independent services and acting without top-down approval) is a different way to coordinate development that values productivity (keeping everyone unblocked) and innovation (finding better ways of doing things).

> You have to invest a bit in tooling and recommended patterns to keep things from going crazy, but you kind of need to do that either way.

Aha, here's a difference. If we're talking about the same things, common tools and patterns don't need to be enforced. Independent teams can pursue very different patterns without needing to agree with each other. This is a big advantage if you don't like being told what to do by people who are pretty remote to the problem you're solving. Different people react differently to that. Netflix teams tended to be staffed with very experienced and skilled people (no junior or mid level positions) so there wasn't much value in one engineer dictating architecture, patterns, or tooling to another. Nearly all tooling was opt-in, and the good tools were the de facto standards. But if you came up with a better tool or pattern, you had the freedom to try it out. This is how independence fostered innovation, and why microservices were useful in such an environment.

notJim · on Dec 16, 2021

> Independent teams (each developing independent services and acting without top-down approval) is a different way to coordinate development that values productivity (keeping everyone unblocked) and innovation (finding better ways of doing things).

I've had the opposite experience. In the monolith, anyone can make the necessary changes, because it's all one codebase that everyone is familiar with. At most, you might need some help/pairing/approvals from experts in particular areas, but in general any team can work independently.

By comparison, in the microservices world, many teams either don't want you to touch their service, or are using a tech stack so unfamiliar to you that it would take too long to be productive. And there's a rat's nest of interdependent microservices, so you end up begging other teams to adjust their roadmap to fit you in.

> Independent teams can pursue very different patterns without needing to agree with each other.

I see this as more downside than benefit. If everyone is using different tech stacks, it's harder for people to move between and contribute to different teams. And you end up with situations where one team uses Java, while another uses Scala, which brings in extra complexity to satisfy what are essentially aesthetic preferences.

andrewmutz · on Dec 16, 2021

When you had hundreds of engineers contributing, how did you manage releases?

We have a large number of teams working on a shared monolith, and a large number of teams working with separately releasable services.

One of our main drivers transitioning to the latter is the productivity gains that the teams get when they can release just a single set of changes on their own and release them on-demand (and during business hours).

For us, we release the monolith nightly with everyone's changes in it (not to all customers, but concentric releases). We find that the teams that are able to release on their own are happier and more productive.

notJim · on Dec 16, 2021

At the place where this worked well, we released small changes throughout the day. We released the monolith around 50 times per day. Developers release their own changes in small groups. We deployed the main branch, and only used short-lived (< 1-2 weeks max) branches. Used feature flags to control when customers actually saw the features (as needed.)

IshKebab · on Dec 16, 2021

Presumably the idea is that you make all changes backwards compatible with the currently running version and continuously roll them out in a progressive manner (like 1% of users get the new version, then 10%, etc.).

deterministic · on Dec 18, 2021

What would stop you from splitting the monolith into libraries and having different teams maintain those libraries? It seems to give you the organisational advantages you want without paying the added networking complexity cost. The way I see it, a micro-service is a library + networking. Remove the networking and you have a library that can be compiled into a monolith.

salt-thrower · on Dec 16, 2021

This is exactly how I feel. Great to hear it from someone with Netflix experience. So, so many organizations jump headfirst into microservices before they even realize what that entails just because they heard it's the trendy new thing Netflix is doing.

If you make your code and architecture simple from the get-go, then you can refactor to microservices when you know you really need it.

AtNightWeCode · on Dec 16, 2021

What was the root cause of the dependency horror that Netflix created, now immortalized with all the scary data flow GIFs?

trunnell · on Dec 16, 2021

I assume you're referring to the visualizations like the ones halfway down this post? https://netflixtechblog.com/vizceral-open-source-acc0c32113f...

It wasn't really a horror, and those charts are a little misleading. I'll try to explain.

Plenty of others who were also there at the time might see it differently, but IMO this was a trade off to get higher productivity and higher resiliency at the cost of higher complexity. So it was a feature not a bug.

When the cloud migration began in 2010, we had a straightforward architecture of a handful of Java monoliths and a giant Oracle database all running in a local data center. The DVD service made all the revenue but the future was the streaming service. It would turn out over the next 10 years we would need to grow engineering headcount over 10x to support the growing business complexity. Some of that complexity was foreseen and the intent was to address it with a culture analogous to a cellular ecosystem.

There are many successful designs in nature that evolved to a complexity beyond our current understanding, like the human body or even the protein signalling pathways within individual cells. These systems tend to have excellent homeostasis and survivability in the face of novel stimuli. Notably, each cell is fairly independent: it can receive signals but it acts on its own, and its internal complexity can be hidden from its neighbors.

We created a culture where teams were independent units. The sayings were, "loosely coupled and tightly aligned," and, "no red tape to ship." This is the opposite of an architect making top-down decisions. Instead we prioritized local decision making within teams which tends to be fast and effective, so this worked out well for productivity. But the side effect is that the overall system grew a bit beyond the limit of any one person to easily understand it. Thus the automated tooling to visualize the graph. But the dependency graph itself was rarely a problem. Any given developer was usually able to trace all the requests they were responsible for all the way down the dependency graph, and that's what really matters for development and debugging. Generally, no one needed to grok the zoomed out picture -- even during outages, problems were typically root caused to a particular service, and not related to the whole dependency graph.

But the dependency graph makes for a compelling graphic, so it gets people's attention. The real story is, "productive organizational culture made of independent units."