Hacker News new | past | comments | ask | show | jobs | submit login
Uber migrates microservices to multi-cloud platform running Kubernetes and Mesos (uber.com)
176 points by belter on Oct 22, 2023 | hide | past | favorite | 218 comments



> The team used existing tooling to move services between zones in order to ensure they were portable. Firstly, they allowed services to be moved back to the original zone to resolve any portability issues, but once resolved, services would be moved periodically to validate portability and prevent regressions.

This is something that most companies don’t do when they say they want to do $x to “prevent lock in”.

Uber actually is testing for portability along the way.


It's probably more cost effective to negotiate a long-term max price with your cloud provider with a force majeure clause.


Unless you’re crazy enough to work with GCP, the “my cloud provider is going to lock me in and then raise prices” doesn’t happen. AWS has only raised prices in a few very obscure cases ever.

One of which is putting a price on HEAD requests in S3 (?).

AWS already gives long term price discounts/guaranteed prices for reserve pricing and Big customers already have negotiated contracts.


Price increases are just one way you can get screwed. You can also lose out when your provider doesn't drop prices or pick up operating efficiencies that other providers have.


And when has that happened with respect to either GCP, AWS or Azure at a level that it’s worth migrating?

Even if you have done everything in a “cloud agnostic” way, “infrastructure has weight”. Any large migration isn’t just technical , it involves project management, organization training, regression testing, compliance testing, security testing, architecture review boards, vendor negotiations, firewall changes, coordination with third parties who may only allow list certain IP addresses, data migration, etc.

Heck they often have multiple physical network connections to the cloud provider (Direct Connect)

Anyone who thinks they can run everything on K8s and they have “cloud agnosticism” has never done a very large scale migration.

You would be amazed how long it takes to do a bog standard lift and shift of a hundreds of plain jane VMs and VM hosted databases. You can’t get anymore cloud agnostic than that.

source: I’ve done a few over the years in both the “real world” and working in the cloud consulting department at AWS (Professional Services). I no longer work at AWS and have no specific loyalty to AWS.


The other thing I'd add is cloud agnosticism doesn't scale. If everyone were prepared for it, there wouldn't be enough elastic capacity with other cloud providers. You'd need enough reserved capacity in another cloud to pull it off, but I guarantee you finance will say "no." What makes the most sense is multi-region work since it's more cost effective, and it's the more likely failure scenario.


There usually isn’t enough elasticism if a region fails. You really even then need to have reserved capacity and maybe even a hot standby or active-active.


>And when has that happened with respect to either GCP, AWS or Azure at a level that it’s worth migrating?

Egress price makes it worth migrating away from those three.


You’ve never been the one neck to choke when things go wrong have you? If Billy Bob’s cloud provider goes down, you are going to constantly be blamed for making a poor decision. If anything goes wrong they are going to question your decision.

If you choose AWS (or Azure) and a region goes down - everyone else is down too. “No one ever got fired for choosing IBM”.

Choosing the most popular vendor - AWS, Salesforce, ServiceNow, or whatever vendor is in the upper right Gartner magic square quadrant never gets questioned by the powers that be.


Even if the alternate cloud provider goes offline for an entire day it still would be worth it financially compared to AWS because egress is so expensive there.


And you ignored the entire reply didn’t you? It’s naive to think at the “one neck to choke level” that all decisions are made for purely technical reasons.

And for you to just say “it’s okay to be down an entire day” because of egress cost tells me that you have never done infrastructure requirements analysis at scale.

First you have to assess the cost of being down for a period of time, then you have to access RTO, RPO requirements and not all workloads have high egress costs - especially things like data lakes that may have a lot of ingress and processing costs, but relatively low egress costs.

I’ve done a lot of different cloud projects over the years from lift and shifts, to data lakes, to cloud call centers, to serverless, to ETL jobs, you can’t just blindly repeat “egress costs” in a vacuum without understanding use cases.


I never claimed that it is always worth it to switch because of egress costs, but that egress costs are a reason to switch. If I ran my sites on AWS it would 100x the cost of running it.


These were your words

> Egress price makes it worth migrating away from those three.

> Even if the alternate cloud provider goes offline for an entire day it still would be worth it financially compared to AWS because egress is so expensive there.

You never qualified either with “in my particular use case”. If you had, I would have had no argument. I haven’t been flown into your company along with SAs, sales, project managers, etc for a week to do a proper “as-is” assessment and to see what your requirements are.

I haven’t accessed the competencies of your staff or determined what is your competitive advantage and what is the “undifferentiated heavy lifting” in your company.

I would never make any blanket statements without knowing your specific use case and automatically assume “cloud” is always the right or wrong answer


>And when has that happened with respect to either GCP, AWS or Azure at a level that it’s worth migrating?

This suggests you are looking for a single example where the pricing of the big 3 is compratively high compared to the competition at a point where it worth it to switch. I gave the example that the price of egress is one cost which is not competitive. If I had instead said that SQS was not competitive obviously that wouldn't matter to businesses that don't use it enough to make a difference.


I’m looking at it from more than just “cost of infrastructure”. You also have to consider reliability, managed vs unmanaged, the competencies and expertise of your team, organizational constraints whether you have a more or less static or dynamic workload…

Microsoft and AWS have versions of the “Cloud Adoption Framework”

https://learn.microsoft.com/en-us/azure/cloud-adoption-frame...

https://aws.amazon.com/cloud-adoption-framework/

And the TOGAF framework has something similar

https://pubs.opengroup.org/architecture/togaf9-doc/arch/chap...

I am saying when considering any “large” implementation there are a lot of considerations outside of infrastructure bills.

I’m not saying that every company should go cloud. But the “lenses” you have to look through are multifaceted


I’ve been in places where total AWS spend was a rounding error compared to revenue. Egress fees weren’t a top 10 cost and wasn’t worth optimizing for.

The bosses would’ve blamed me for choosing a tier 2/3 noname provider the first time a day of downtime happens. And they would’ve been right.


Egress prices make migrating away hard. But a lot of products don't need to push much data out of AWS, especially with VPC peering based products.


And for things like transferring data to and from S3 within AWS you use an S3 gateway endpoint so data stays within AWS’s network


Whilst there are some nice toys e.g. Spanner, there's generally little other reason to tolerate the abusive pricing of 'the cloud' ime


That's a long-term concern, though. The thing to worry about is a rug-pull, but no major provider will do that. They could, but they won't.


The size of the savings you can negotiate is a function of how locked in you are. A big customer will get discounts well beyond reserved instances, in response for (usually) committing to increase their expenditure above where it is.

The better your competing offer, the better the negotiating position. And while Amazon hasn’t gotten more expensive per se, it’s certainly not gotten cheaper.


This article is a recap of the original engineering article by the quoted developer and manager at Uber.

https://www.uber.com/en-GB/blog/up-portable-microservices-re...


Ok, we've changed the URL to that from https://www.infoq.com/news/2023/10/uber-up-cloud-microservic.... Thanks!


Uber Microservices were such an inefficient PITA. There was buzzword soup of a bunch of half baked infra pieces and they were always migrating. Every part of the stack was rotten. Udeploy, xterra, tchannel, schemaless, etc etc.

My peak “wtf” moment was when we had a SEV because two services that should communicate actually used different versions of thrift, both hard forked by Uber, with different implementations for sets. Passing a set from one service to another caused everything to break.


> In preparation for the move to the cloud, the company spent two years working towards making all stateless microservices portable so that their placement in zones and regions can be managed centrally without any involvement from the service engineers

I'd like to hear more about how Uber organized the engineering teams over two years to make "stateless microservices portable".

How many teams? What were the requirements to each team? What was the timeline? How did they know it was completed? How was it prioritized along other business priorities of the teams? How long did they think it would take originally? Was it worth it?


Maybe direct these questions to a C-level employee at Uber who could potentially answer them for you?


There are no doubt lots of Uber employees that post here. This is an appropriate forum to ask.


And why do you think they could answer you with any details without going through comms?


because this is the Internet and anyone can make an anonymous account via VPN, if someone were so inclined.


Yes and I would break my NDA to answer a random question on HN for what personal gain?


It's a ridiculously innocuous question. If I were worried about blowback, I'd question working for Uber at all.


TBH I don’t trust HN with my data. They have weird account policies, and I’d not be surprised if they felt like witch-hunting someone down they would…especially in the interest of ycomb alumni


I'm replying to my own comment here since it was so severely downvoted. OP was musing about a bunch of questions that didn't seem useful to the discussion. Who was he asking? And if you're going to say an Uber staff member, why doesn't his comment indicate that? It just didn't seem to add to the discussion at all.


OP here... @s3p: FWIW, I didn't take issue with your comment. Surprised it was down voted. I do find that often people to reply to such comments though with something like "Uber team member here..." so it didn't seem ridiculous, but your suggestion seemed authentic and fair to me.


It seems like they’ve gotten to the “holy grail” of deployment where developers don’t have to worry about infrastructure at all in theory.

I’ve seen many teams go for simple/leaky abstractions on top Kubernetes to provide a similar solution, which is tempting because it’s easy and flexible. The problem is then all your devs need to be trained in all the complexities of Kubernetes deployments anyway. Hopefully Uber abstracted away Kubernetes and Mesos enough to be worthwhile, and they have a great infra team to support the devs.


It's not clear to me that being completely unaware of your infrastructure is a good thing. I don't think it's too much trouble to ask an engineer to understand k8s and think about where their service will live, even if it's a ci system that actually deploys. Furthermore, many layers of abstraction, especially in-house abstraction, just mean you have more code to maintain, another system for people to learn, and existing knowledge that you can't leverage anymore.


There is a wide spectrum of infrastructure (and platforms, frameworks, etc) from “allows applications to do just about anything, though it may be very complex” to “severely constrains applications but greatly simplifies doing things within those constraints.” To be clear by “just about anything” I am not talking about whether some business logic is expressable, but whether you can eg use EBPF and cgroups, use some esoteric network protocol, run a stateful service that pulls from a queue, issue any network call to anything on the Internet, etc.

If you are developing applications software like Uber 99.99% of the time you really do not need to be doing anything “fancy” or “exotic” in your service. Your service receives data, does some stuff with it (connects to a db or issues calls to other services), returns data. If you let those 0.01% of the things dictate where your internal platform falls on that spectrum, you will make things much more complicated and difficult for 99.99% of the other stuff. Those are where leaky abstractions and bugs come from, both from the platform trying to be more general than it needs to be and from pushing poorly understand boilerplate tasks (like configuring auth, certifications, TLS manually for each service) to infrastructure users.

Being unaware (of course not completely unaware, but essentially not needing to actively consider it while doing things) of infrastructure is actually the ideal state, provided that lack of awareness is because “it just works so well it doesn’t need to be considered”. It means that it lets people get shit done without pushing configuration and leaky abstractions onto them.

I’ll give you one example of something that does an excellent job of this: Linux. Application memory in linux requires some very complex work under the hood, but it has decent default configurations with only a couple commonly changed parameters that most applications don’t need much, and it had a very simple API for applications to interface with. Similar with send/receive syscalls and the use of files for I/O ranging from remote networking to IPC to local disk. These are wonderful APIs and abstractions that simplify very hard problems. The problem with in-house abstraction isn’t that they are trying to do abstractions but that sometimes they just don’t do a good job or churn through them faster than it takes them to stabilize.


Well put, 99% of companies don't need to introduce such complexity for their relatively trivial use cases (though well-intentioned albeit bad engineers will try to invent it anyway).


Part of my point is the goal with such a system is usually to require less infra work/knowledge from your devs, but it backfires if you don’t invest enough in your abstraction.

The implicit goal of these abstractions is really to central knowledge and best practices around the underlying tech. Kubernetes itself is trying to free developers from understanding server management, but you could argue it’s not worth using directly vs. just teaching your devs how to manage VMs for the vast majority of organizations.

I don’t think you’re ever going to stop more and more layers of abstraction, so the best we can hope for is they’re done well. Otherwise you may as well go back to writing raw ethernet frames in assembly on bare metal.


> Part of my point is the goal with such a system is usually to require less infra work/knowledge from your devs, but it backfires if you don’t invest enough in your abstraction.

I disagree that the solution is to simply build more. Often the best thing to do is accept that devs will need to know a little infra, and work with that assumption.

> The implicit goal of these abstractions is really to central knowledge and best practices around the underlying tech.

I agree with that.

> Kubernetes itself is trying to free developers from understanding server management, but you could argue it’s not worth using directly vs. just teaching your devs how to manage VMs for the vast majority of organizations.

The difference is that spinning up a VM and setting it up to have all the features you would want from k8s would be too much to ask from a dev. You would probably just end up re-creating k8s.

> I don’t think you’re ever going to stop more and more layers of abstraction, so the best we can hope for is they’re done well. Otherwise you may as well go back to writing raw ethernet frames in assembly on bare metal.

The problem is that abstractions are not free, and most of the time they aren't done well. Once in a while you'll get one that reduces(hides) complexity and becomes an industry standard, making it a no-brainer to adopt, but most of your in-house abstractions are just going to make your life worse.


I think the biggest “win” with abstractions is that it makes it easier for infra teams to update underlying concretions (is that a word? the concrete version of the abstraction) without having to dig deep into the codebase.

e.g. with kubernetes, if you have the actual manifests defined by every team, it is a pain to do any sort of k8s updates. With a simple abstraction where teams only define the things they are interested in configuring (eg helm values), that simplifies this task a lot.


All it takes is for one microservice to start hanging on a GRPC request, server hardware stops doing some fundamental thing correctly, or some weird network quirk that 10x’s latency to half the switch ports in a rack, and you end up with insane, sophisticated cascade failures.

Because engineers don’t have to understand infra, it often spans geographies and failure domains in unanticipated, undetectable ways. In my opinion the only antidote is a thorough understanding of your stack down to the metal it’s running on.


A single engineer can’t understand everything at scale.

Even in a 100 person startup that I worked for where I designed the infrastructure and the best practices and wrote the initial proof of concept code and best practices for about 15 microservices it got to the point where I couldn’t understand everything and had to hire people to separate out the responsibilities.

We sold access to micro services to large health care organizations for their websites and mobile app's. We aggregated publicly available data on providers like licenses, education etc.

Our scaling stood up as we added clients that could increase demand by 20% overnight and when a little worldwide pandemic happened in 2020 causing our traffic to spike


None of the layers of abstraction are perfect. You have to deal with the whole mess all the way down.

We've had individual EC2 instances go bad where I currently work, with Amazon acknowledging a hardware problem after a ticket is raised. The reality is, quickly resolving the issue means detecting it and moving off of the physical machine.

Naturally our tooling has no convenient way to do that, because we have layers of things trying to pretend physical machines don't matter.


No the answer is keeping all of your VMs stateless and just using autoscaling with the appropriate health checks. Even if you just having a min/max of 1.


Describe a health check that can detect any possible hardware problem.

The error rate on the machines was higher in both cases, but many requests still succeeded. Amazon certainly didn't detect an issue right away either.


There is no way that you could record metrics - even custom metrics that get populated via the CloudWatch logs agent to CloudWatch and over a certain threshold of errors, bring another instance up and kill the existing instance? If you could detect sporadic errors there must be some method to automated it.

I’m assuming this isn’t a web server, if so it’s even simpler.


A statistical rule moves you into the realm of deciding what rate of false positives and false negatives you'll tolerate. Based on data from exactly two incidents in this case, which is obviously a bit fraught.


  > abstractions on top Kubernetes
  > abstracted away Kubernetes
I am beginning to think it's not such a bad thing to live and work in a third-world country far away from SV-induced hype cycles. This is genuinely painful to read.


But lots of people here talk positive about services like Heroku or Fly where you just push the code somewhere and it runs without you having to know a lot about the infrastructure.

Not every software development problem is a big scale problem and once you identify such a case you can start optimization work taking all the low level details into account. In reality most scalability problems revolve around databases, caches, concurrency and locks and you probably aren't going to tackle a lot of these in your average stateless service.


Kubernetes works great for larger projects when combined with ArgoCD or similar.

They all use GitOps which means all infra deployments and changes are tracked and easily able to be rolled back on any issues. And the complexity is nothing compared to having to manage your own cloud resources using Terraform etc which used to be the case.

And these days every developer needs to be on board with DevOps and so there are no real old-school infra teams supporting anyone.


The other "leak" in these abstractions that arises from physical limits is performance, especially when it comes to IO.

This is a major problem for databases and ultimately makes database "portability"/fault tolerance tricky since they work best with direct-attach storage that's inherently bound to a single physical machine.


Not to mention there is all sorts of other limits that you can hit at scale just on the compute layer itself (eg max pids, file descriptors etc).

I don’t know if we can truly abstract away the underlying system. The best we can do is give a best effort approximation that works in most cases, but explicitly call out the limits when they are reached.

I suspect that this is just the bubbling up of the underlying physics limitation of having limited resources where compute is run.


Does Uber really need 4000 microservices?


A different (better?) question is, does Uber need 4000 API contracts?

The answer to that is probably yes. APIs let us split work across systems/people/teams/regions, and provide a way for both sides of a split to work together. Uber has a lot of teams, a lot of engineers, and so it makes sense that there are a lot of API boundaries to allow them to work together more efficiently. Sometimes those APIs make sense to package as microservices.


Uber has several different APIs for users. A naive purist might think that's silly until you realize a rider is user, a driver is a user, a courier is a user, a restaurant owner is a user, a line cook is a user, a doctor's secretary is a user, a Uber employee is a user, a freight broker is a user, an advertising manager is a user... people can simultaneously be multiple types of users and have multiple profiles as a single type of user, and did I mention that you have to properly secure PII due to being in a high regulated industry? And that's just users.

Don't even get me started on anything money related :)


Plus there's a surprisingly high floor on the number of apis a large company needs for basic stuff like "set up new hires automatically in all the system needed"


> a line cook is a user, a doctor's secretary is a user

I was with you on other types of users, but can you elaborate on these particular use cases?


UberEats


Where does a line cook's use case fit in it? From what I know, uber eats sends a order to a restaurant, an employee manually punches in the order on their POS system, and the order ticket goes into the kitchen


You don't think there are a variety of people at a restaurant who might interact with the system? Is this particular detail so very important to the point of the parent comment?


I think the parent comment tried to prove a point by making an extremely frivolous claim and naming every person they could think of as a “user” which means they are either wrong or they failed to adequately make whatever point they were trying to. Uber doesn’t need an api for line cooks, so using them as a justification for a large number or micro services was not rhetorically sound.


> Uber has a lot of teams, a lot of engineers, and so it makes sense that there are a lot of API boundaries to allow them to work together more efficiently.

Isn’t that the premise of the question? Does Uber need so many engineers?


> Isn’t that the premise of the question? Does Uber need so many engineers?

The only people who can answer that are employees at Uber.


New unrelated databases, frameworks and queue services don’t write themselves duh


I wonder what setting up a local dev instance is like for anything involving more than one or two of those.


I've worked on a couple of extremely large micro services projects.

And the thing is that nobody ever needs to run the entire stack other than end to end tests which get run in the cloud.

You just checkout the services you need and because they are designed to be isolated the dependencies will usually be automatically stubbed out. So it's just a matter of running them or chaining them together if you have a particular scenario to test.


Question for you: how does performance measurement and optimization work in that environment? Is the key some sort of meta tooling that understands relationships between microservices? How would you express such relationships in the first place?

Q2: How do you ensure the stubbed deps behave like the real thing?

Q3: how do you handle logging and metrics in an unified way across the stack? And related to this: how do you ever get to upgrade services crosscutting concerns that ideally are not invented in every service?


> how does performance measurement and optimization work in that environment?

SRE here. Generally speaking, each API or each service will have a contract that it must adhere to depending on upstream and downstream relationships and their fail safes. Each service (or API) will then load test in isolation.

After that, if you want to be really sure about regressions (which would include fail safes) you load test the whole thing put together.

> Is the key some sort of meta tooling that understands relationships between microservices?

This is quite hard to do when you have a lot of transactions. I don't think there's commodity software that does this because you'd need to configure that software to map on keys, then map those keys to services. Generally, the easiest way is to get engineering teams to declare upstreams and downstreams.

> Q2: How do you ensure the stubbed deps behave like the real thing?

Generally, generation. Something like protobuf or Open API generation will do.

> Q3: how do you handle logging and metrics in an unified way across the stack?

You issue high level standards like, "We'll use JSON logging with UTC time formatting". At the end of the day logging is very contextual and in a service ownership model the service owners are usually the ones reading and alerting on their logs.

> And related to this: how do you ever get to upgrade services crosscutting concerns that ideally are not invented in every service?

Shared dependencies. I'm not actually sure what's a cross cutting concern; generally services that are this small should be designed to operate mostly independently. They're small, but "microservices" tend to have a lot of fail safes built in. If you're referring to how do we not write 4000 config loaders then there's usually a team that builds a very generic config loader and everyone or a majority use it.


> JSON logging with UTC time formatting

perhaps simplest, biggest impact in my log life has been adhering these principles.


I've worked at a place that architected 100s of microservices pretty well, in a similar way that Uber apparently does.

Q1: (perf) these tools exist, the buzzword phrase is "distributed tracing". The relationships are actually not explicitly defined for the tooling to work, but rather inferred. Visualize a network call as a call-stack, where each service is a level in the stack. Jaeger (a CNCF project addressing distributed tracing) was coincidentally started by Uber.

Q2 (stubs): In my experience, mocked responses get you a long, long way. Typically the API response type that you're mocking is generated from a protobuf (or thrift, OpenAPI, etc.) file. If your dependency changes that type in a way that breaks your test, the CI platform will let them know.

If it's a more subtle change (like, it used to deterministically return 18 and now it deterministically returns 20), it's really on the service owners to communicate changes and grep the code base before making the change.

Q3 (logging/metrics): Typically by using shared "logging" and "metrics" lib for each language. Every service will typically be a gRPC service and accordingly a standardized + generated-from-protobufs set of metrics to Prometheus, by default.

Q4 (how to upgrade common libraries): this is definitely a tricky one. The answer is, basically, really carefully. Typically, you'll want your infrastructure to be compatible with vX and vX+1, and give teams a deadline to cut over from logging X to X+1. The couple of weeks before that deadline usually involves a lot of cat-herding and handwringing.


Not OP, but I worked on a large microservices based system at a leading financial institute and when we needed to work on a single service, we had docker compose files that pulled the images needed for the dependency services of that service for us to develop what we needed. They all just ran in our local docker. If we wanted we could have a massive compose file with all services in it, but typically we only needed the iam service and a few other small ones depending on what services we were working on.


Q2 - techniques like contract testing help here, beyond simple stubs. Also, mocked services maintained by the services original devs you can work against help


What you are imagining happens (it's not pretty).


Is there typically a “cold restart” plan to rebuild the whole infra from scratch? I’m thinking of things like circular dependencies when services boot.


I've found that the eventual consistency provided by orchestrators like k8s solve this problem rather well. If the services are written using the right paradigms to handle such a situation, they will also be much more resilient to platform disruptions.

I like to view k8s a lot like erlang's OTP, if something isn't right with the state of a service, I advocate calling 'exit()' and letting the restart with exponential backoff handle the transient.


Nobody needs to if they can't possibly.


In general in a micro service environment, you try to build things so that 1) you don't need to run other things locally, and 2) if you did need to, the services are just containers so it's pretty easy to run one.

But you tend to try to write your service so that it treats everything else it depends on like a vendor-provided API. Like, if you were building a Slack bot, you wouldn't ask Slack to let you pull down and run a local copy of Slack's API to test against. You'd maybe set up a test account in Slack's production system, and run your local bot against that to test it before you deploy it with credentials to run against your real slack account.

In a microservice architecture, you integrate with other internal systems in the same way.


We're not big on microservices, but we do integrate with a lot of other systems.

I find the opaqueness of other services to reduce development speed quite drastically. With local code I can view both sides of the fence and easily see if I'm using it wrong or if it's a bug in my colleagues code.

Seems that if you're constantly developing against opaque services you'd end up in the same quagmire quite quickly?


This is exactly why the "microservices" pattern is usually adopted along with the "monorepo" pattern. IMO, it's a strong anti-pattern to have the former without the latter.


You shouldn’t care about anything outside of published contracts for dependencies. You’re slwags dependent on underlying APIs


Of course I shouldn't.

But when things don't work as I expect, it's far more efficient to be able to view the code on both sides, rather than only on my own side and try to guess what the other side is doing.

Besides the usual suspect of wrong understanding on my end leading to misuse, this can also be due to lacking or wrong documentation of the other system, or bugs in the other system due to unexpected inputs or similar.

Like just a few days ago we spent an unreasonable amount of time with an API of one of our customers, where we would get empty list back for some of our queries. Turned out something in their service crashed when handed national characters, despite accepting JSON and hence UTF-8 input and nothing in the documentation about English letters only. Rather than returning 400 or 500, the service returned 200 with an empty list, leading us to assume we did something wrong.


> But when things don't work as I expect, it's far more efficient to be able to view the code on both sides, rather than only on my own side and try to guess what the other side is doing.

Are you able to view the source code of your platform vendor? Everyone is at some level dependent on Black box APIs.

If you can document where with certain input you don’t get the expected output, you reach out to the team that is responsible for it whether internally or externally and they either explain it or they fix it.

This is the API service I’ve been working with over the past five+ years - three actually working at AWS (Professional Services).

https://boto3.amazonaws.com/v1/documentation/api/latest/inde...

I found a bug in one relatively new API that a service team released, I reached out to the team with a documented scenario and they fixed it.

Other times they explained what I was doing wrong. That’s what any large organization does.

I’ve worked with other vendors and internal teams plenty of times over the years.


When you want to run a local version of a service, you start it locally as normal and configure either local stubs of dependencies and dependents, or you can hook up the actual “production” services to your local service process. If doing the latter, you can create special user accounts and events that change how the networking routing happens. Events from those fake users pass through regular production apps up until the service you’re testing, then they are instead routed to your local version, and you can continue the calls to other prod services.

Uber employees’ apps are special and allow us to log in as these fake users and create fake rides or deliveries, and then we can look at the traces and logs to debug and stuff.


Looks like they are shifting away from local development: https://www.uber.com/en-CL/blog/devpod-improving-developer-p...


Something like https://tilt.dev/ where you spin up a subset of the service graph in a cloud environment that hot-reloads based on local edits.


Microservices allow development orgs to scale horizontally which enables businesses to expand to adjacent markets, faster.


Static function calls are also API contracts.


There's an an interesting HN comment[1] from 2020 by former Uber engineer, which discusses the complexity a bit. It's more about UI, but the thread discusses the backend as well. In brief something that may look super simple for the user (like handling payments) is actually quite complicated when you cover all the market, different payment types etc. And all this carries to the backend as well.

[1] https://news.ycombinator.com/item?id=25376346


And also certain states and localities have different requirements for ride share. I noticed this in NYC and Seattle


Do all of those need to be microservices or you could you instead have one monolithic payment service that handled all those use cases?


Of course you could, just like you could do this in 2000 less-well-defined microservices, or 8000 more finely-grained ones.

The question is what makes you think 1 service is immediately better than however many payment services there are now?


All other things being equal, 1 service is obviously better than 4,000 services to maintain.


Not that obvious, how do you coordinate people from several of teams working on it?


IBM managed to coordinate several hundred people on a single software product for decades on end.

Even Microsoft managed to do so for multiple products while also stack ranking the teams.

And I doubt there's a single service, even payments, that's as technically complex as Excel.


Do you really have 4,000 teams working on payments?

And I'd agree with the other child comment that the monolith can always be broken into separate components which are owned by different teams.


And then every time you had a change, you would have to deploy everything and your surface of failure is greater.

What would a monolith buy you?


Not having to evolve or understand or staff 4,000 micro services.

An ability to easily change the boundaries of your conceptual components, because they WILL be wrong now or in the future.


You still have to understand your boundaries when you have a large monolith unless you have one big ball of mud.

Even with a well constructed monolith, you need to have well defined “services” with contractual interfaces.

You don’t have to understand 4000 services to make one change anymore than I need to understand the entire boto3 library when I am building on top of it.

https://boto3.amazonaws.com/v1/documentation/api/latest/inde...

You can’t just change your interface in a monolith either without breaking other parts of it.


Managing a numerically large set of services has its own challenges, but it pales in comparison to the complexity of a monolithic service serving the same functionality. As the other poster already pointed out, such a behemoth would be a nightmare to change at all. It would also be a scalability and and reliability nightmare. We migrated away from monoliths because they don't work in modern compute architectures.


It might be easier if you have the same API for payment microservices, but each different implementation in a different service, so approximately 100 times less distinct APIs than microservices.


And what service would that be?


Uber is a global company (70+ countries) operating Uber and Uber Eats.

So almost certainly they are duplicating their entire stack per-country if only to get around the vastly different regulatory environments.


I'm guessing it have to do with payment processors. I remember reading an article while back about why Uber app are large (100+ MB) that most of it is related to payment processors and taxes that it is operating globally.


No, that's not correct (mostly). Services are written in the way to support global operations.

But that scale introduces a lot of complexity so you can't just have "one service for onboarding drivers"


Do you have knowledge of this?

I find this hard to believe given the regulations from some of the larger countries requiring, by law, customer data be processed in country.


Deployments != unique codebases.


Try but read the full comment chain again to understand the context of our discussion.


I did. Your concern about data needing to be isolated due to regulation doesn’t require you to make a complete copy of the code.


>“So almost certainly they are duplicating their entire stack per-country if only to get around the vastly different regulatory environments.”

Responding comment says no they are not and the services are built to handle global traffic.

I respond and say I doubt that due to on soil laws.

You can argue two regions with different configurations but the same code bases are different services but that’s not what we’re talking about here.


I re-read the chain and don’t follow your argument.

Do you mean that your original point was about deployments to begin with?

FWIW I work in a microservices shop for a global app in an extremely regulation heavy industry, and we run a single codebase per service, segregating regulator-imposed behaviour via flags to deployments


were discussing if Uber has on soil deployments. Unless they are running afoul of on soil laws they likely do.

Your fwiw is exactly what we’re taking about here and I’d venture a guess nearly half this site works for some Corp with duck tape, hope, and micro services powering their junk. Me too!


Yep. There are might be some services that are entirely geo specific but I haven't seen them.

The same microservice that deployed in multiple geos still counts as one service, so considered to be 1 out of 4000 in this case.


It’s not just countries. I’ve used Uber across the US and you can look on the receipt and see different regulations in play depending on the city


Uber has a really liberal definition of a micro service. Every web UI or dashboard is a service (of which there are many hundreds). Every application anyone builds across their many thousands of engineers is a service. It's rare, I think, for services to have fewer then a few thousand lines of code. In my experience, most companies would have a monolith that serves multiple UIs from the same service. Uber instead ships that monolith as a library which is a framework for building individual UIs. It has its pros and cons but I quite liked how they did it.


(Worked at Lyft) Our number of active micro services was small in comparison. 4,000 is likely a overblown number to highlight the accomplishment possibly counting inactive ones


Isn’t Lyft US only? Uber operates in 80+ countries


Worked at Grab. They had a ton of micro services. It was their way of partitioning databases so that they didn't have to deal with joins. Yes, it caused a lot of problems.


Lyft also doesn’t do food delivery…


From experience working at big tech I’m willing to take a guess.

Maybe a couple of dozens will be actual more complex and meaningful services. Then few dozens more services that are somewhat more unique.

And then majority of the long tail will be mostly cookie cutter services, doing X, but for lots of different use cases, where each of use cases is separate deployment counting as a service (for example - systems to process streams of logs related to business logic).


The same binary with a different configuration


I've seen at least one place with many more than that in recent years. If you have one microservice "listener" per queue and another for the database processing and persistence (business logic) and another providing an API for one or more frontend UI's related to it then the microservice tally goes up very fast. It's kind of surprising to read so many comments indicating HN readers weren't aware of this.


sounds like a massive nightmare


Most are likely limited to some subdomain, with limited communication between domains.


Massive scale*


There's quite a sizing range between monolith and microservice.

If all their It needs are behind micro "micro" services, that figure is understandable.

Outside of the map, taxi, food, payments, onboarding, they also have monitoring, deployment, HR, billing, legal, taxes, internationalized stufd, and the usual "..." for what I'm missing.

If you just take a standard ERP, you could easily split it in dozens even hundreds of microservices.


> If all their It needs are behind micro "micro" services, that figure is understandable.

I call them nano services.


And you're pinpointing to the problem with such news.

Everybody knows what is a monolith but nobody really knows what is the size of a "micro" service.

Just for taxes, do you make one service for taxes or one for each recipient of taxes? (In the EU, is it one for each country, in US, one for each state + federal ) with a different team managing each service?


But how does your "..." sum up to 4000?


Apparently they started at 1000 and went from there...

"What I Wish I Had Known Before Scaling Uber to 1000 Services" - https://youtu.be/kb-m2fasdDY


It reminds me this thread about Netflix, with insane amounts of events and logs compared to active users.

https://news.ycombinator.com/item?id=30635369


Yup Pornhub serves much more video than Netflix and they do so without that insane amount of complexity.


Isn't pornhub free, their only monetization being ads? Also does Pornhub have personal recommendations, per country and region libraries, an android, ios and android TV apps? Probably not.

That is a much easier business model and a lot lower level of complexity than Netflix. I imagine running pornhub is essentially running a large website that hosts video. Probably just the billing side of Netflix is more complicated than the entirety of Pornhub operation.


PornHub has significantly more content than Netflix, not to mention the ability for users to upload content and have it immediately available worldwide. That alone makes it significantly more complex than Netflix.


Pornhub is not about waving the engineering flag up and down to signal how cool their infra is though.


Honestly Pornhub's stack is genuinely impressive. More start-ups should just use PHP and get shit done


I think one of the reasons behind PornHub's tech stack is that their industry doesn't really lend itself to VC, so building an engineering playground for PH would be a waste of money as no amount of complexity would net them VC money (nor an invite to a cloud provider conference), where as most startups live and die based on the VC funding their complexity and buzzwords allow them to grift.


I think the takeaway was that choice of language might be less important than engineering strategy, because PH are successful despite their choice of stack :)


Yup I remember that. Netflix seems to be the poster child for overengineered architecture - for something that is almost entirely commoditised nowadays (one-way over the internet video streaming).

Uber's problem space is significantly more complex than Netflix, so I'm unsure it's a fair comparison. But they do seem to have quite a lot of overengineering going on. At least that's how I feel each time I read an Uber tech article.

About the only companies which seem to justify their complex architectures are Google/Meta/Amazon imo.


> Uber's problem space is significantly more complex than Netflix

What makes you say this? Netflix serves probably several orders of magnitude more bytes and online video is hard. At its core Uber is basically a Passenger Service System and we had systems like these implemented in software since 1950s


There are a lot more use cases in taxi and food delivery space than in video streaming. At least by an order of magnitude. Consider various user personas for one, legal considerations and so on. Technically each use case might be less demanding than video streaming, but overall much more complex.


What would be the engineers doing otherwise? You get bored if you don’t.


It's insane to hear attrition used as an excuse for architectural decisions, but I've seen it firsthand.


So we're going to get to the point where everyone gets their own microsevice, right?


That is unironically a better way to do it than anywhere I've worked that does microservices.

My experience with microservice shops is you have one macromonolith with 50 people working on it (which has all the problems of a monolith and none of the benefits), 5 actual decent microservices with a team or individual that properly maintains them, and 100 random utility micro"services" that are like 3 lines of code, used by exactly one other service, and you need 40 loc and a network call to interact with them.

I'll take everyone has their own service any day of the week. At least when I need to interface with 12 different things I can have 12 different people to roast for not properly documenting their API. And tbh literally the only positive I can come up with for microservices is the ability to neatly fire one into the sun and rewrite it from scratch.


Isn’t that basically what happens when you split an API out to the different methods and write a Lambda for each one?


Does it matter how they organize their services? Your experience and environment will be different in so many ways that I doubt it's comparable.


Yes, there are specific business rules for each nation, region/state, and city.


Maybe they meant instances.


How else would engineers demonstrate "impact" for promotions?

/s


I worked at a SV startup (series A) for a while and an EM once mentioned struggling to keep the number of microservices under the number of engineers.


Is there a compelling article about the ideal microservice to engineer ratio (ie less than 1.0)?


I don't know of one, but this thread has some interesting discussion: https://www.reddit.com/r/ExperiencedDevs/comments/x1p5gj/my_...


You jest but also you're not wrong.


My own company has 800+ microservices. I am very familiar with the politics of microservices.


My personal microservice fiefdom is about ten. My company probably has 1000. Is this ratio normal?


Any more than 1 microservice per engineer (as in, 7 microservices for a 7 engineer team) is too much for engineers to handle during on-call incidents.

If management values business SLAs that is.


Dysfunctional organizations lead to absurd solutions to absurd problems.


There's no way that number isn't fiction; Occam's razor say's its out of the range of believable. That's ~2 per eng according to Google. That's absurd. (That eng headcount is also a bit … high.)

This sounds like a figure from someone who sees a signle microservice running across 100 pods/instances, and counted that as 100 "microservices".


Uber invested heavily in tooling that makes creating and deploying a new service take about 30 minutes. This was before they invested in making it as easy to share code. If you combine that with fast hiring and a big pressure to ship, it makes sense to solve every problem with a new service that calls a few others.


S3 alone is built on top of 300 micro services. I don’t find it unbelievable that Uber needs a lot of them.


I find it highly disturbing that so far I've seen Zero Uber devs on this thread adding any sort of context/info or just confirmation. Wtf is this, the NSA? KGB? Can't they just list/dump the names of said 4000 services, or is that somehow some sort of secret-sauce?


A random engineer at any company is not going to divulged non public information about the inner working of their company without permission.

We had to sign something at AWS not to divulge internal tooling like what we used for our internal account factory that we used to create AWS accounts. Literally tens of thousands of people know what this tool is.


you see someone telling you how many microservices s3 has right above.


Yes that “someone” was me - a former employee who couldn’t remember whether that information was public or whether it was something I was exposed to from the inside.

It in fact was public.

https://aws.amazon.com/blogs/storage/how-automated-reasoning....

I verified that before I posted that little tidbit.


I couldn't find any explanation of where the data would be found. Are they splitting data across clouds, and constantly "porting" that data from cloud to cloud as part of their portability?

Orchestrating the application layer across clouds is interesting, but how does their data layer work?


The title is misleading. I don’t see Mesos mentioned ones in the article.

I got so excited about reading for Mesos helping in the multi cloud world, potentially as the hypervisor for running k8s


I dislike the Uber business itself (horrible treatment of drivers, poor customer service, poor safety controls, bullying of small businesses with Uber Eats, shitty executive level team with questionable ethics).

But the underlying technology which carried them to this point is a fascinating read.



I believe the dollar amount savings figures, they’re big and worthy of a congratulations to the engineers involved!

IMO, engineering man hour savings are a lot less trustable. This may eliminate or simplify some engineering processes but IME massive migrations like this simply replace them with a different set of processes; because they’re different and theoretically addressable they’re not counted against the hours saved as they can be bucketed into bugs/to be addressed by the roadmap/legacy behavior migrated from the old system (which is now dangerously-fragile-legacy and not ol-reliable-legac). Eventually someone will come along and decide this too is an inherently flawed platform that needs to be entirely replaced at great expense, and the circle of life continues.

This is still a massive undertaking not just from an engineering perspective but from an organizational/process one though. Whoever pulled this off essentially had to coordinate (or figure out how to simplify/explain things well enough to skip coordination) with almost every engineer and likely almost every production service in a company with thousands of engineers. Those in startups may balk about this kind of thing taking two years, but having done my own two year projects (at a smaller but comparable scale) in a big company I can say two years is what I’d consider a highly optimistic and unlikely outcome for a project of this magnitude.


> This may eliminate or simplify some engineering processes but IME massive migrations like this simply replace them with a different set of processes

Yes

> because they’re different

Now I have to learn an entire new set of tools/processes etc that are more useful to someone else but not helpful for me. The old one had its quirks but I knew it inside out and now the whole org has to re-learn how to do everything we did before.


Look forward to the future write -up of how a Zookeeper issue nuked their entire Mesos stack.


For a company that is basically a taxi service, they seem to invest an awful lot in constant rebuilds of their extremely complex infrastructure, which raises the question of whether that is even remotely necessary or just an exercise in pretending that they are a tech company.


“Basically a taxi service,” except that Uber spans hundreds of cities, coordinates millions of drivers - none of whom work on a fixed schedule - and its only interface with customers is an app that has to be fast, accurate, and reliable at all times.


Even a single taxi event is complex. Tell Uber you want to go to the ATL airport and it'll ask which terminal. If you're catching one from the ATL airport, it'll map and walk you to the rather distant pick up spot. And we haven't even touched on payment yet...


This is just such a bad take that it makes everything else you say after it null.

And google is just a search engine they only need like 20 engineers……………


They do food delivery, parcel courriers, regular ubers, plan ahead uber, grocery shopping, and a lot of other stuff. if anything this is simpler than most silo driven architectures you'd usually get with such a massively diversified business.


To be fair all of those listed were also handled by taxis previously, just the process was more manual and more distributed, the dispatcher allocated a cab to the requester and maybe passed an initial message, and then it was directly between you and the driver.


> "Basically a taxi service"

Not defending their tech stack, but I mean that is a lot of realtime data that needs to be accurate - this is not your typical SaaS crud app.


Just the taking payments and reconciliation in dozens of currencies and payment methods alone would be so kind numbing difficult and complex


I love these r/iamverysmart takes on HN.

Is this generally a sign of youthful wishful thinking or just plain hubris?


Oh hey, this is the thing I work on.

We're giving a talk about this at KCD Denmark on the 14th of November "Keynote: Uber - Migrating 2 million CPU cores to Kubernetes" if anyone is in the area and has any particular interest in this.


Congrats to the UP team. The platform sounds good. I especially liked the Balancer component.


To save you the deep deep dive: on OCI and GCP.


In 3 years… “Uber saved cost by migrating their micro service to their own colo.” followed by “Uber simplified operations by migrating their micro service platform to a monolith”.


Might be a good guess, there's precedent of them changing fundamental techology in a similar timeframe...

2013: "Migrating Uber from MySQL to PostgreSQL"[1]

2016: "Why Uber Engineering Switched from Postgres to MySQL"[2]

[1] https://www.yumpu.com/en/document/view/53683323/migrating-ub...

[2] https://www.uber.com/en-GB/blog/postgres-to-mysql-migration/


More like “we hired a new principal arch who drove a change they personally liked, and everything was better because it allowed a lot of time to fix tech debt”


In 5 years... "We've discovered a new paradigm for efficiently carving up and distributing computational units for our application. We call it, nanofunctions."


Later "Lowering cost and dramatically reducing complexity with nanofunctions running on monoecosystem".


Don't even start on that. It's no longer funny coming up with things like that for me.

I'm currently trying to get out of the industry because I'm drowning in architectural bullshit like this constantly. It is pedalled by snakes, bastards and wankers who care nothing for solving problems but want to create new ones.


It’s been happening since the dawn of time, fat client/thin client, static link/dynamic link, micro services/monolith, centralised/decentralised, it doesn’t just migrate from one to the other, the pendulum swings back and forwards, and will do for eternity.

You can be all angry about it, but being angry at the storm doesn’t affect the storm, it only affects you. A lot. Negatively.

The trick is to position yourself to maximally profit from the next trend swing, I’ve been doing it for 20 years now, if you can predict where the next place is gold will fall from the sky, then go and stand there, with a really big bucket.

I always found it strange that there is a certain type of intellect who is capable of accurate observation of reality, but incapable of execution (sometimes called “the disconnected intellect”), they can tell you exactly the problem, and the solution, but sit angry/frustrated that the observed world doesn’t match some imagined ideal in their head, and rather than adapt their internal model and be entrepreneurial enough to capture the value that generates, they bleet and complain while losing all opportunities - opportunities they can see! I can’t imagine being like this.


I have ridden these fads for the last 25 years (well actually longer - I had a different career first!) and made a fuck load of money out of it. But I am tired of it now. I really don't care any more.

I am fed up of solving the same problems again and again. It's more than just earning; there's intellectual dishonesty in this and it's tiring and demotivating.


Sounds like you need to move up maslow's hierarchy of needs and focus more on self actualization (or whatever if upwards for you) ... because your posts are full of negativity and you sound miserable.


Oh I'm bloody fantastically happy. I am merely cynical about the state of the industry as it stands.

I'm literally 18 months from packing up this shit and doing what I really want to do which involves nothing whatsoever to do with computers.


Good for you and congratulations! I hope to do the same some day, but I need at very least 10 more years (no, this is not an advice for FIRE enthusiasts to chip in with tips or help, thank you).


> The trick is to position yourself to maximally profit from the next trend swing, I’ve been doing it for 20 years now, if you can predict where the next place is gold will fall from the sky, then go and stand there, with a really big bucket.

Where the pendulum will be in 2030? Asking for a friend :D


I'm out of the market then so I'll drop what I think will be the situation:

Put everything on: cost savings, energy reduction, privacy, death of advertising.

Cost savings -> inefficient languages and architectures will die because the main datacentre currency is going to be performance/watt and that's going to cost serious money when transport infra is contending with DC power consumption. Things which are compiled and not interpreted will have a cost benefit then. Rust/C# (with AOT)/Go etc. Half these bloated piles of shit with expensive build toolchains will die too.

Energy reduction -> linked with above, energy usage reduction is going to be a big one. That means reducing workforce, simplification and efficiency are going to be key drivers. This may kill some ML approaches off that consume a lot of energy. So ARM etc.

Privacy -> Confidence in surveillance states and the cloud is declining so privacy first oriented services are going to have a huge uptick. Apple / standalone systems / new opportunities.

Death of advertising -> advertising is in the death throes with AI coming in as it decreased the signal-to-noise ratio. It becomes less effective so discovery rather than promotion will be the way to get attention for your product. Portals / landing pages / software catalogues.

Me I'd concentrate on cloud cost management and code efficiency and business efficiency as key areas to invest my time in.


I’m 30+ years in and I don’t think there’s any bad faith. It’s just people who are new seeing all of the problems with the old implementation and thinking their way will be better.


That is bad faith. They're not trying to understand the systems that they feel they can take the responsibility to try to "fix".


That isn't what bad faith typically covers?

Bad faith is easiest seen when they hold opposition to higher standards than they can hit. In particular, on purpose. It is a smoke and mirror dialogue that is not intended to make progress, but only to spend the other side's time.

In contrast, being wrong and or not fully grasping difficulties is normal. And sometimes, you get lucky and unexpectedly make progress


This is my core contention: negligence of bringing one's efforts to suitably address the problem is not necessarily intentional a la mens rea, but rather demonstrates that a person is not appropriately engaging their mental and cognitive faculties and in essence disrespects the entire process to which they've assigned themselves. These processes are bigger than individuals and it behooves one to engage in ways that make sense. This lack of attention constitutes bad faith in that it is the opposite of good faith engagement.

Also, they are assuming their participation is a net positive, which is a position that requires some amount of intention to take, so I disagree that intention as you've cast it is really a relevant perspective here. Bad faith is more than just a rhetorical debate tactic, it's a modus operandi with regard to how someone actually engages with the topic at hand.


I can see the similarities. That said, there is an advantage to allowing a bit of hubris in young teams. It is a gambit that can uncover absurd progress.

There is also a bit of the established problems obfuscating themselves. Such that it is easy to see many new workers have been given the run around many times.


Chesterton's Fence problems?


Ultimately it's promotion-driven development.


Yeah I call it Resume Driven Development.


Exactly. "What's Kubernetes FOR?" "To the best I can tell, it's a jobs program for our industry."


Kubernetes is great for the problems it is designed for. It's fucking terrible for the other 99%.

Source: juggle lots of clusters full of things that shouldn't be in Kubernetes.


Too many microservices where 90% of the weight is service scaffolding, and 10% is actual meat/product.


I'm not sure why this is an issue with a long running system. Business requirements change, knowledge changes, cost structures change, etc... Unfortunately the world isn't static. I'm not sure about you, but when the facts change I also try to change.


[flagged]


Yes, no one said they did that.


[flagged]


The “in 3 years” was meant as a forward looking statement.

The post was being sarcastic that Uber will in 3+ years time claim another victory, based on abandoning the cloud.


Thanks, I also misunderstood "in 3 years" meant forward-looking (i.e. "3 years from now...") rather than backward-looking ("for over 3 years...")


They were making a joke about a hypothetical post in another three years


[flagged]



Maybe, just maybe this article isn't catered to the end user but to the engineer building it?


[flagged]


Might want to tweak the prompt a bit


What did he say?


Profile/settings -> Showdead -> yes

Basically the wording just looks like GPT output




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: