Hacker News new | past | comments | ask | show | jobs | submit login
My EC2 Wishlist (daemonology.net)
132 points by cperciva on June 16, 2021 | hide | past | favorite | 48 comments



AWS has a very data driven culture, almost to a fault. They are extremely customer-centric, but almost too much so.

For any item on this list, if someone at AWS with the necessary authority thinks it's a good idea, they can't just find an engineer and have it implemented.

First, a PM has to talk to a bunch of customers of varying sizes to see if the feature would be useful to them and if so what their use cases are. Then they need to scope the project, write the press release, have it run through a committee of Principle engineers, and come up with a spec. Then they need to take that spec back to the customers for feedback.

After all that, it will finally get onto a list to be added to the roadmap in the future. Then it will get onto a roadmap hopefully. And if it's not "next quarter", there is a really good chance it will fall back off that roadmap.

Then it will finally get implemented and launched in private beta for key customers who expressed interest. Then a wider beta. And only after all that will it get released for everyone else.

I understand the need for some of that process, since when it launches it will be instantly available to millions of customers and will need to scale. But there has to be a middle ground here. Like: customer asks for feature, engineer builds it and makes it available only to that customer with the caveat that it could break or change at any time. Then iterate from there, maybe slowly adding in new customers while at the same time talking to customers and integrating their feedback. And maybe that does happen, but I've been part of some pretty big customers, and I can't think of a single time we got a feature right away that went through major changes. The best we ever got was "special important customer" just before a wider beta. But by then the API was basically set.


AWS has established itself as being highly backwards compatible - and the cost of that seems to be the long-winded process you described. If they were more liberal with what features they implement then they would have to either be more liberal with what features they deprecate (removing a core value proposition of AWS: backwards compatibility) or they would have to suck it up and deal with the bloat which would naturally lead to higher cost and overall slower delivery of new services.

For a counter example see GCP. One of the biggest barriers to entry is trust: no one wants to use it because they don't trust Google to not arbitrarily yeet features out the window with six months notice. Whether or not that's actually what GCP does, it's how people perceive Google.

I'll take AWS's approach for things that I work on professionally and have to support for many years.


GCP seems to be generally better designed though. Despite how conservative towards adding new features, amazon is being described here, I am often unimpressed at how poorly conceived many things are within AWS


I totally trust Google to arbitrarily yeet features out the window with six months notice...


The process I described would not in any way harm backwards compatibility though. They can still run the same process, but in parallel, so that iteration can happen faster amongst people who care the most.


Will that not lead to bloat and/or more deprecation of services in the future?

I suppose you can tell people "hey this service shouldn't be relied on! We might get rid of it one day!". But that erodes trust - people just don't fully internalise what that means, they expect the same level of long term support from all AWS services. Soon the common sentiment will turn into "AWS changes its services all the time" when the truth is "AWS changes the services marked as experimental all the time".


If you only give access to a few customers I don't think that would be a problem. Especially if you purposely make breaking changes to the API during the alpha period to make it infeasible to use in production.

The key is to not give wide access. Google's problem was that anyone could turn on "labs" and use them, and then get really upset when they went away. A few customers getting really early access to a feature under NDA would not be nearly the same.


It really depends on how it is rolled out and managed. I have lived, first hand, issues where you end up with two trains that end up being impossible to merge without breaking users or one or the other.


> engineer builds it and makes it available only to that customer with the caveat that it could break or change at any time.

This is likely why - you may get people to sign off on “break or change at any time” but the reality is once a customer is using it you’re going to have to support it indefinitely.


Implementing even relatively small features at Amazon takes quite a bit of effort, usually across multiple teams. The codebases are enormous, the number of services involved is inevitably at least a handful, athe side effects of a change can be hard to reason about, etc. The idea that a single engineer is going to implement a feature for one customer in some really short timeframe is just not in ballpark of reality.

And even if that were a fathomable process, one must consider the opportunity cost of assigning engineers to one-off feature work rather than on vetted roadmap features. This sort of thing is frequently done at small companies trying to save an important contract or similar, sometimes to great detriment. Amazon isn't a place where a single customer (apart from maybe US government) has such pull.


It sounds like a place where "hero" PM's who know how to navigate the system and attract buy-in would do well.


Isn't part of a PM's job generally to navigate the system and get buy in?


Depending on how navigable the system is that can be an afterthought, a major component, or their entire job, leaving the other responsibilities to fall by the wayside.


> They are extremely customer-centric

Is someone holding a gun to your head and making you say this?

What you described is exactly my experience interacting with AWS as a (sizable) customer, yes, but it's not remotely what I would consider customer centric. Bureaucracy-centric, perhaps, or bottom-line centric, but certainly not customer centric. Even getting AWS to do something for its own benefit is like pulling teeth. For customer benefit? Fugheddaboudit, unless you are huge and are willing to relentlessly hound them for years and throw gobs of cash at them and deal with the betas that somehow manage to be even more buggy, feature-anemic, and gotcha-landmined than their public-facing services.


My $0.02...

#1 Make EC2 Auto Scaling decisions faster - it's painfully out of sync with the AWS Console table. I don't mind paying for it. And while we're at it please can I get a metadata address REST API to signal CONTINUE?

#2 SSH keys are configured in cloud-init, there's no reason they can't be read from SecretsManger and rotated out of the box.

#3 Put $HOSTNAME/$env:ComputerName in the EC2 Console table - weird software like Azure AD Application Proxy uses it for registration and I'm fed up using SSM to query all Windows instances to find an errant machine.

... and one more. Please make CloudWatch an out-of-the-box sink for cloud-init logs. SSM Agent doesn't start on Windows until fairly late (after other services), so it can be pain reading logs out of %TEMP% if you've locked the network down and fully adopted SSM sessions/port-forwarding.

... and a final one - we need a "1-click WTF" button to help solve EBS/KMS issues. There's insufficient info in the UI. Dropping down to check the cli JSON output isn't a user-friendly solution. Azure does this better with their "Diagnose and solve problems" blade in App Service.


I was going to comment about whether you were aware of SSM after reading the ssh keys one, only to be surprised to see the rest of them reference SSM

We don't even add a break-glass KeyName since SSM is available on all client OSes we care about, Ansible has it as a "connection:" mechanism if we need to do things to a lot of instances, and kubelet for everything else

I hear you about the cloud-init output, though

I would love if they used some browser trickery to grab control-w while an SSM session was active, though, because muscle memory is a harsh mistress :-(


control-w is my least favourite twitch. I use it randomly, often. It's not too bad in a Google Doc because you just control-shift-t.


> #2 SSH keys are configured in cloud-init, there's no reason they can't be read from SecretsManger and rotated out of the box.

If you didn't know it, you may be interested by ec2-instance-connect (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...). It's sadly officially only supported on Amazon Linux / Ubuntu but ephemeral ssh key authorization based on IAM has nice properties in terms of security / auditability / access control / revocation etc.


I would have thought them constraining it to Amazon Linux and (bizarrely) Ubuntu meant there was something inherent to those cloud AMIs which enabled such a thing, but this[0] sounds like just as much work as installing the SSM agent, and having the extra drag of needing to monkey with sshd_config (a fine way to lock oneself out if not careful)

0 = https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...


> #1 Make EC2 Auto Scaling decisions faster

Fuck yeah. The autoscaler is almost certainly a distributed, fault tolerant, highly available system that is deployed across multiple availability zones. Presumably they trade consistency for other attributes so there's an intentional delay to avoid making constant flapping changes. There's also the metrics granularity to consider. Take SQS for example, another distributed system. Many of the useful queue metrics are averaged over time, which adds a delay on top of the metric gathering period. The result is that a scale-to-zero ASG that scales based on SQS queue depth takes about 3 minutes to produce an EC2 instance from a cold state. If you're lucky.


While all the things you mentioned are correct, I would also like to point out one of the fundamental results from control theory here: The time period of the control loop must be proportionate to the time it takes for any changes to the controlled system to propagate, otherwise you will get wild oscillations (in the number of instances in this case).

Since it can take a few minutes for an EC2 instance to fully boot up, start up all the applications it needs and start serving traffic, you can't have a very fast autoscaler loop adding a new instance every second: it would start up way more instances than needed and eventually have to shut most of them down again, etc. This EC2 startup delay is the core problem, if you could reduce that interval the autoscaler could run way faster even with the usual distributed HA, CAP theorem and metrics concerns.


I haven't personally tried it, but I wonder if that's what "Warm Pools" is partially designed to help? https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-au...

I guess whether that helps the SQS situation depends on whether it is _the ASG_ decision time that's the problem, or the ASG requests an instance which itself takes multiple minutes to come into service


I suppose what I really need is a way of invoking certain operations through the ASG interface immediately, not at end of a long distributed async chain.


Since we're adding to the wishlist:

AMIs are a pain to publish: you need one for each region, and AWS keeps adding regions, and publishing is a challenge (you need to be a citizen to publish to the Gov regions, and you won't believe how hard it is to reliably publish to the China region, at least back in '18).

AMIs should be S3 blobs like GCP, and that makes everything simpler.

And, while we're at it, Key Pairs should allow elliptic curve (ssh-keygen -t ed25519 ...) instead of forcing us to use RSA.


> AMIs are a pain to publish

If the time urgency allows, use the cli to poll for new regions one a day, then share the ami to any new regions added from yesterday.


I've also built a script that does this (copies custom amis to the 4 regions we operate in, if missing), and I bet thousands of others have written similar tools.

Similarly, every cloudformation/terrform/etc config has a region-to-ami lookup involved, because they get different ids in each region.

AWS could do something that takes a tiny fraction of the collective effort applied, and make all that work unnecessary. Having amis globally available once uploaded, with zero extra work, would be nice. I'm okay with the first launch in a new region taking an extra few seconds while it copies over for the first time.


I would appreciate serial console output that doesn't take 10x the boot time of an instance to propagate.

Seriously, 30s boot time, 300s (or more!) to see that serial console output via the GUI or API.

This is a PITA if you're using AWS for short-lived CI runners, and the console to check what the SSH host key is of these newly spun up nodes.


We have that now: There's an API option to say "give me the latest output instead of what you have cached".

Why that isn't the default, I have no idea.


I want something like a --follow option which acts like tail's follow option.

aws ec2 get-console-output --instance-id xx --follow

When debugging why an EC2 instance is having weirdness on boot, it's a pain to sit there refreshing logs constantly to see if it's got to that point yet.


Strangely, at least from an outside observer's point of view, one would think the user's interests and AWS's interests would overlap for that one -- it doesn't square with my mental model that answering repeated polling requests to an endpoint which surely knows the last emitted text are cheaper than leaving the connection open and using any number of fancy http push mechanisms to send down updated content

Or, if we really wanted to get crazy: use the new SSM protocol and push down the console log that way, since they have obviously already invested a ton of engineering into the protocol and its scaling concerns, extra bonus++ that it then would work from the EC2 Console, too


Seconded (although I admit I haven't tried cperciva's suggestion). This would make it quicker to view the SSH fingerprint of a new instance. AWS has added a web-based SSH client, which is quicker than waiting for the serial console output, but you need to dig out the relevant command to show the fingerprint.

Aside: I may be one of very few people who bother to verify fingerprints. Why it seems to be the norm to hold SSH security to a much lower standard than HTTPS, by blindly trusting on first use, I don't know.

I recently tried out Microsoft Azure, which appears to offer no way to view the RDP thumbprint of a new Windows Server instance. It's not that it's buried somewhere in the web UI, it simply isn't there. (If I'm mistaken on that point, a correction would of course be welcome.)


Re: wish #3 with multiple IAM roles attached to an instance.

imds-filterd looks a bit like kube2iam, squinting a bit. There may be some not-too-terrible alternatives, considering that as prior art.

The first is the possibility of the daemon performing an assume-role. That is, the node has a role which allows it to assume roles a, b & c, and the metadata interceptor looks at the workload, assumes the appropriate role and returns credentials. This is a bit fiddly in terms of handling multiple concurrent requests, caching and races etc.

The second plausible option is that there is related functionality in AWS's replacement for kube2iam - IRSA (IAM roles for service accounts). This approach seems to be AWS's preferred approach for workload identity. It has a few more moving parts (needs an "OIDC Provider" which can just be a bucket) though.


Yep thought the same thing, reminds me of kube2iam, IRSA is a much better solution and obviously where AWS want to go. So yeah this one isn't going to get changed.


My wish: let me design my own instances. There are so many instance types and subtypes now, and comparing them is not straightforward (even with ec2instances.info).

The hardware can obviously be provisioned into all sorts of configs - that's got to be how these proliferated. Just give me the configurator and show me the price, please don't make me pick through all of these instances...


I think that the way these instances are classified is based on the bare metal they run on. So for instance an r4.xlarge is a single VM in an r4.metal server acting as a host.

So only r4 virtual instances (r4.xlarge, r4.2xlarge, etc...) divide up nicely on these r4.metal hosts. (RAM, CPU, Instance Stores, etc...)

If they made fully customizable instances then they could be using 100% of the host machine's RAM, CPU or whatever but only a fraction of the other factors. This causes a metal instance to be occupied entirely by a single VM which would make it cost a TON!

So in order to make customizable instances they would have to make a ton of differently shaped metal instances to match all the possible types of custom instances and make a best fit. In the end they end up with what they have now anyways.

What would be cool is if Amazon would let you order custom hardware for bare metal to go in their racks. They acquire and set up the server and load their host image. (Sort of an EC2 Colocation service.) Then let you divide it up as you see fit into virtual machines. Sort of a (Build your own instance class) thing.


For that last bit it’s not as crazy as you might think since you can already buy dedicated hardware and you can manage your own hardware on-prem with Outpost. If you smush them together you get something kinda neat.


Agreed. Google Cloud seems to do a better job of this, you can simply pick how many CPUs your VM will have.

For me personally, it's the jump from c5d.4xlarge (16 core) to c5d.9xlarge (36 core) that's a bit hard to work around, but there are surely others.


  - ed25519 key support
  - ability to add more than one MFA token to a single account (I am TRYING to protect my root account, but it won't let me!)


root & IAM login page UX is horrendous. e.g. it doesn't remember your browser (prompting MFA every time), no control for login TTL, too easy to lock yourself out, MFA out of sync


If you setup Amazon SSO for sign-in, it offers some of these features (multi-device MFA, session TTL, etc).


I'm not plugged into EC2 nearly enough to know the exact reason why #3 isn't a thing. But IAM has been around for a long time and is showing its age AWS continues to push out new services.

A lot of rough edges around services boil down to 'we use IAM as a dependency, and IAM is a giant bottleneck.' I wouldn't be surprised if having multiple instance profiles on a machine is one of those things that is bottlenecked by IAM. Allowing multiple roles on a single EC2 machine would put a lot of additional strain on IAM's backend.


A workaround idea of the "Attaching multiple IAM Roles to an EC2 instance": attach a single role which is allowed to assume all the roles that you need. Then have a master process that have access to the instance role credentials and have that process feed/rotate the credentials of assumed roles to other processes.


Re: SSM parameters, my wish is for AWS to implement resource policies in the same way KMS, S3, etc have. The public use case wish is nice, but I'd like to be able to share params within my org as well.

Right now we copy params (AMI IDs) to SSM Param Store in hundreds of accounts. It would be nice to write them once and open them up via policy.


Lack of ability to create custom ELB security policies are frequent pain I run into. Customers I work with want to hand pick the supported TLS cipher suites to confirm with internal security policies rather than take one of the pre-defined AWS ones, which tend to support the odd "weak" cipher.


Maybe the new ALB and NLB don't support it, I didn't go digging into their docs, but the "normal" ELBs allege to be able to do what you're describing: https://docs.aws.amazon.com/elasticloadbalancing/latest/clas... (search for "To use a custom SSL security policy" because it's not an HTML anchor :-(


Wait, the bidirectional UART console has arrived? Hm, I might've seen that news but forgotten about it.



It’ll get sorted out now it’s on hacker news.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: