The REP prefixes are the most common; they just let you perform the same instruction a variable number of times. It looks in the CX register for the count. This makes many common loops really, really short, especially for moving objects around in memory. The memcpy function is often inlined as a single REP MOVS instruction, possibly with an instruction to copy the count into CX if it isn’t already there.
I suppose the REX (operand size) prefix is pretty common too, since 64–bit programs will want to operate on 64–bit values and addresses pretty frequently.
None of the prefixes toggle things that can be set globally, by the BIOS or otherwise. They all just specify things that the next instruction needs to do.
The ModR/M and SIB prefixes are probably the most common prefixes in instructions. They are so common that assemblers elide their existence when you read code. REX is in the same boat: so common that it's usually elided. The VEX prefix is also really common (all of the V* AVX instructions, like VMOVDQ), and then the LOCK prefix (all atomics).
After all of those, REP is not that uncommon of a prefix to run into, although many people prefer SIMD memcpy/memset to REP MOVSB/REP STOSB. It is slightly unusual.
This isn't correct.
ModR/M and SIB are not prefixes. They are suffixes and essentially part of the core instruction encoding for certain memory and register access instruction. they are the primary means of encoding the myriad addressing modes of the x86. And their existence is not elided in any meaningful way, their value is explicitly derived from the instruction operands (SIB is scale, index, base), so when you see an instruction like:
mov BYTE PTR [rdi+rbx*4],0x4
SIB is determined by the register indices of rdi, rbx, and 4, all right there in the instruction. Likewise, Mod R/M encodes the addressing mode, which is clear from the operands in the assembler listing. Though x86 is such as mess that there are cases where you can encode the same instruction in either a Mod R/M form or a shorter form, eg PUSH/POP.
REX is a prefix, but it is a bit special as it must be the last one, and repeats are undefined. It is not elided because of commonality but because its presence and value is usually implied from the operands, it is therefore redundant to list it.
For instance, PUSH R12 must use a REX prefix (REX.B with the one byte encoding).
More specifically, they're affixed to certain opcodes that require them. There are a number of byte-sized opcodes that do not require a ModRM or SIB byte (although a number of those got gobbled up to make the REX prefix, but that's another story).
There's a good reason for using vector instructions over REP: Until relatively recently that was how you got maximum performance in small, tight loops. REP is making a comeback precisely because of ERMS and FSRM, so unfortunately this will become a bigger problem going forward.
REP prefixes are pretty rare. Depending on compiler they are used rarely, for a few specific operations (like rep movsd for memcpy) or usually never.
Most common prefixes are by far REX prefixes in 6h4 64bit assembly (don't believe me? Look at the 64bit code in vim and see all those `H` letters around. It's REX.W). Segment override prefixes are another class of prefixes that are used in handwritten assembly (in bootloader or special runtime functions) but almost never used by compilers.
In older code, most common prefixes are 0x66h (it doesn't even has opcode, there's no way to emit it directly) and maybe 0x67h.
They are all used to modify some aspect of the next executing instruction, for example the "default" register size is 32bit, but you can change it with a prefix. I think an example will help.
"Prefixes" in this case mostly expand the instruction encoding space.
So rarely-used addressing modes get a "segment prefix" that causes them to use a segment other than DS. Or x86_64 added a "REX" prefix that added more bits to the register fields allowing for 16 GPRs. Likewise the "LOCK" prefix (though poorly specified originally) causes (some!) memory operations to be atomic with respect to the rest of the system (c.f. "LOCK CMPXCHG" to effect a compare-and-set).
All these things are operations other CPU architectures represent too, though they tend to pack them into the existing instruction space, requiring more bits to represent every instruction.
Notably the "REP" prefix in question turns out to be the one exception. This is a microcoded repeat prefix left over from the ancient days. But it represents operations (c.f. memset/memmove) that are performance-sensitive even today, so it's worthwhile for CPU vendors to continue to optimize them. Which is how the bug in question seems to have happened.
x86 was designed in 78, basically for the purpose of running a primitive laser printer (or other similar workloads). The big problem with this is that the encoding space for instructions was "efficiently utilized". When new instructions, or worse, additional registers were later added, you had to fit the new instruction variants in somehow, and you did this by tacking on prefixes.
Nah, x86 goes even earlier in its heritage - it was, effectively, a bolt-on on Intel's way older designs, as a huge part of the 8086 was being ASM source-compatible with the older 8xxx chips, even as the instruction set itself changed [1]. What utterly amazes me is that the original 8086 was mostly designed by hand by a team of not even two dozen people - and today, we got hundreds if not thousands of people working on designing ASICs...
Acckkghtually, if you go back far enough you end up at the Datapoint 2200. If you want to understand where some of the crazier parts of the 8086 originate from, Ken Shirriff has a nice read: http://www.righto.com/2023/08/datapoint-to-8086.html
> x86 was designed in 78, basically for the purpose of running a primitive laser printer
It's interesting that ASCII is transparently just a bunch of control codes for a physical printer/typewriter, combining things like "advance the paper one line", "advance the paper one inch", "reset the carriage position", and "strike an F at the carriage position", all of which are different mechanical actions that you might want a typewriter to do.
But now we have Unicode, which is dedicated to the purpose of assigning ID numbers to visual glyphs, and ASCII has been interpreted as a bunch of glyph references instead of a bunch of machine instructions, and there are the control codes with no visual representation, sitting in Unicode, being inappropriate in every possible way.
It's kind of like if Unicode were to incorporate "start microwave" as part of a set with "1", "2", "3", etc.
ASCII was used by teletypes, not typewriters. They were "cylinder" heads, as compared to IBM's golfball typewriters.
The endless CR/LF/CRLF line ending problem would have been solved if the RS (Record Separator) ASCII code was used instead of the physical CR = carriage return, ie move print head back to start of line, and LF = line feed, ie rotate paper up one line.
But Unix decided on LF, Apple used CR, Windows used CRLF, and even today, I had to get a guy to stop setting his system to "Windows" because he was screwing up a git repo with extraneous CRs.
It's just because x86 as an ISA has accreted over the course of 40+ years, and has variable-length instructions. Every time they extend the ISA they carve out part of the opcode space to squeeze in a new prefix. This will only continue, considering that Intel has proposed another new scheme this year.
You got some great answers already, but to your first point check out Hennessey and Patterson's books, namely Computer Architecture and Computer Organization and Design.
The latter is probably more suited to you unless you wanna go on a dive into computer architecture itself. There's older editions available for free (authorized by the authors) on the web.
I first read the 3rd edition of Computer Architecture and besides being one of the most clear textbooks I've ever read it vastly improved my understanding of what's going on in there in relation to OoO speculative execution, etc.
That's a very poor summary of what prefixes are. My advice, just skip the original article which isn't very good or interesting and read taviso's blog that is linked in the top comment (it gives a few concrete examples of these prefixes). They are modifiers that are part of the CPU instruction.
I disagree. We’ve seen what happens when titles have max context: people don’t click the link and they polish their semi adjacent hobby horses in the comments as they would a tweet.
HN goes for a middle ground that promotes intellectual curiosity and link clicking. If you refuse to click the link for obscure titles at least you’re stuck replying to those who did click the link and that’s still better than what we have on the rest of the internet.
Submissions that don’t have the payoff to justify more obscure, whimsical titles fall off the first page unlike this one.
This is very well written. I know little about assembly programming and Intel's ISA, let alone their microarchitectures, but I could follow the explanation and feel like I have a rough understanding of what is going on here.
If the problem really is that the processor is confused about instruction length, I'm impressed that this problem can be fixed in microcode without a huge performance hit: my intuition (which could be totally wrong) is that computing the length of an instruction would be something synthesized directly to logic gates.
Actually, come to think of it, my hunch is that the uOP decoder (presumably in hardware) is actually fine and that the microcoded optimized copy routine is trying to infer things about the uOP stream that just aren't true --- "Oh, this is a rep mov, so of course I need to go backward two uOPs to loop" or something.
I expect Intel's CPU team isn't going to divulge the details though. :-)
I don't understand "ERMS" and "FSRM" and there seems to be nothing good on google about it.
Are these just CPUID flags that tell you that you can use a rep movsb for maximum performance instead of optimized SSE memcpy implementations? Or is it a special encoding/prefix for rep movsb to make it faster? In case of the later, why would that be necessary? How does one make use of fsrm?
Found this [1], which also links to the Intel Optimization Manual [2].
Seems like ERMS was a cheaper replacement for AVX and FSRM was a better version, for shorter blocks.
> Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB than previous generations of microarchitectures.
> Enhanced REP MOVSB (ERMSB) before the Ice Lake microarchitecture with Fast Short REP MOV (FSRM) was only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it was much slower, because there is a high internal startup in ERMSB - about 35 cycles. The FSRM feature intended blocks before 128 bytes also be quick.
FSRM is just the name of a cpu optimization that affects existing code.
Choosing an optimal instruction choice and scheduling can be done statically during compile time or dynamically (via chosing one of several library functions at runtime, or jitting).
In order to be able to detect which is the optimal instruction scheduling at runtime you need to know the actual CPU. You could have a table of all cpu models or you could just ask your OS whether the CPU you run on has that optimization implemented.
Linux had to be patched so that it can _report_ that a CPU does implement that optimization.
As described it's just a CPU crash exploit that requires local binary execution. Getting to a vulnerability would require understanding exactly how the corrupted microcode state works, and that seems extremely difficult outside of Intel.
It's not super-valuable yet, but it would keep you mount a really nasty DoS on cloud providers by triggering hard resets of the physical machines. Some people would probably pay for that, though it's obviously more interesting to push on privilege or exfiltration.
Particularly since the MCEs triggered could prevent an automatic reboot. Would depend what the hardware management system did - do machines presenting MCEs get pulled?
If I'm a cloud provider and somebody's workflow is hard resetting lots of my physical machines, I'm going to give them free access to single tenant machines at the very minimum. If they keep crashing the machines that only they run on, I guess that's ok.
You can exploit this from a single core shared instance.
So you go and find yourself a thousand cheap / free tier accounts, spin up an instance in a few regions each, and boom, you've taken out 10k physical hosts. And run it in a lambda at the same time, and see how well the security mechanisms identify and isolate you.
Causing a near simultaneous reboot of enough hosts is likely to take other parts of the infrastructure down.
I'm curious what part of this scheme involves "not ending up in jail"? Needless to say you can't do this without identifying yourself. To make this an exploitable DoS attack you need to be able to run arbitrary binaries on a few thousand cloud hosts that you didn't lease yourself.
> I'm curious what part of this scheme involves "not ending up in jail"? Needless to say you can't do this without identifying yourself.
Stolen credit cards are a dime a dozen, and nation state actors can just use their domestic banks or agents in the banks of other countries in a pinch to deflect blame or lay false trails.
If I were Russia or China, I'd invest a lot of money into researching all kinds of avenues on how to take out the large three public cloud providers if need be: take out AWS, Google, Microsoft and on the CDN side Cloudflare and Akamai and suddenly the entire Western economy grinds to a halt.
The only ones who will not be affected are the US government cloud services in AWS, as this runs separate from other AWS regions - that is, unless the attacker gets access to credentials that allow them executions on the GovCloud regions...
> If I were Russia or China, I'd invest a lot of money into researching all kinds of avenues on how to take out the large three public cloud providers
This subthread started with "is this issue a valuable exploit". Needless to say, if you need to invoke superpower-scale cyber warfare to find an application, the answer is "no". Russia and China have plenty of options to "take out" western infrastructure if they're willing to blow things up[1] at that scale.
Countries have proven far more reticent to use kinetic options vs. cyberattacks. Or, put differently, we're all hacking each other left and right and the responses have thus far mostly remained in the digital realm.
> Or, put differently, we're all hacking each other left and right and the responses have thus far mostly remained in the digital realm.
Which is both good and bad at the same time. Cyber warfare has been significantly impacting our economies and our citizens - anything from scam callcenters over ransomware to industrial espionage - to the tune of many dozens of billions of dollars a year. And yet, no Western government has ever held the bad actors publicly accountable, which means that they will continue to be a drain on our resources at best and a threat to national security at worst (e.g. the Chinese F-35 hack).
I mean, I'm not calling for nuking Bejing, that would be disproportionate - but even after all that's happened, Russia and China are still connected to the global Internet, no sanctions, nothing.
If clouds use shared servers to run their management workloads and if very important companies use shared servers to run their workloads, they would deserve it.
But I don't believe it. People are not that stupid.
> If clouds use shared servers to run their management workloads and if very important companies use shared servers to run their workloads, they would deserve it.
Why target the management plane? Fire off payloads to take down the physical VM hosts and suddenly any cloud provider has a serious issue because the entire compute capacity drops.
I mean, you kinda can. There's a depressingly thriving market for stolen cards and things like compromised accounts. A card is a couple of dollars. There are many jurisdictions that turn a blind eye to hacking us companies. Look at how hard it's been to rein in the ransomware gangs and even 'booter' (ddos-for-rent) services.
DoS isn't as lucrative as other things; I assume that most state actors would far prefer to find a way to turn this into a privilege escalation. But being able to possibly take out a cloud provider for a while is still monetizable.
The blogpost describes that unrelated sibling SMT threads can become corrupted and branch erratically. If you can get a hypervisor thread executing as your SMT sibling and you can figure out how to control it (this is not an if so much as a when), that's a VM escape. The Intel advisory acknowledges this too when they say it can lead to privilege escalation. This is hardly a useless bug, in fact it's awfully powerful!
Intel themselves said it could lead to privilege escalation and a friend of mine (who coincidentally was responsible for this Intel-related talk: https://youtu.be/Zda7yMbbW7s) already managed to get privilege escalation with it, though I’m not sure if he’ll want to share any details, at least for now.
It’s anything but a minor bug and anyone that says so clearly hasn’t worked with CPUs
This assumes that either 1. partners and interested sponsor-state state actors aren't kept abreast Intel's microcode backend architecture, or 2. that there hasn't been at least one leak of this information from one of these partners into the hands of interested APT developers. I wouldn't put strong faith in either of these assumptions.
It does, but the same is true for virtually any such crash vulnerability. The question was whether this was a "valuable exploit", not whether it might theoretically be worse.
The space of theoretically-very-bad attacks is much larger than practical ones people will pay for, c.f. rowhammer.
>> Getting to a vulnerability would require understanding exactly how the corrupted microcode state works, and that seems extremely difficult outside of Intel.
Intel knows exactly how their ROB works.
Therefore Intel knows the possible consequences of this bug and how to trigger them.
If there is a privilege execution path from this, Intel knows. And anyone Intel chose to share it with knew.
Thankfully, since it's public now, the value of that decreases and customers can begin to mitigate.
> If there is a privilege execution path from this, Intel knows. And anyone Intel chose to share it with knew.
No, or at least not yet. I mean, I've written plenty of bugs. More than I can count. How many of them were genuine security vulnerabilities if properly exploited? Probably not zero. But... I don't know. And I wrote the code!
Did they confirm that it can definitely be used for escalation? The description I saw was "may allow an authenticated user to potentially enable escalation of privilege and/or information disclosure and/or denial of service via local access" which sounds like they're covering all their bases and may not actually know what is and isn't possible.
> Sequence of processor instructions leads to unexpected behavior for some Intel(R) Processors may allow an authenticated user to potentially enable escalation of privilege and/or information disclosure and/or denial of service via local access.
so basically you're saying that the cpu frontend missed the opportunity to ignore the 0x90 because it was an actual instruction which would be converted into an actual nop uop?
Is this still the case or modern intel CPUs optimize out the nop in the frontend decoder?
Some compiler writers thought that was the case, if [0] is related to OP. I don't have a "modern" (after 6th gen) Intel CPU to test it on, but note that most programs are compiled for a relatively generic CPU.
"Looking in the old AMD optimisation guide for the then-current K8 processor microarchitecture (the first implementation of 64bit x86!), there is effectively mention of a “Two-Byte Near-Return ret Instruction”.
The text goes on to explain in advice 6.2 that “A two-byte ret has a rep instruction inserted before the ret, which produces the functional equivalent of the single-byte near-return ret instruction”.
It says that this form is preferred to the simple ret either when it is the target of any kind of branch, conditional (jne/je/...) or unconditional (jmp/call/...), or when it directly follows a conditional branch.
Basically, when the next instruction after a branch is a ret, whether the branch was taken or not, it should have a rep prefix.
Why? Because “The processor is unable to apply a branch prediction to the single-byte near-return form (opcode C3h) of the ret instruction.” Thus, “Use of a two-byte near-return can improve performance”, because it is not affected by this shortcoming."
...
" If a ret is at an odd offset and follows another branch, they will share a branch selector and will therefore be mispredicted (only when the branch was taken at least once, else it would not take up any branch indicator %2B selector). Otherwise, if it is the target of a branch, and if it is at an even offset but not 16-byte aligned, as all branch indicators are at odd offsets except at byte 0, it will have no branch indicator, thus no branch selector, and will be mispredicted.
Looking back at the gcc mailing list message introducing repz ret, we understand that previously, gcc generated: nop, ret
But decoding two instructions is more expensive than the equivalent repz ret.
The optimization guide for the following AMD CPU generation, the K10, has an interesting modification in the advice 6.2: instead of the two byte repz ret, the three-byte ret 0 is recommended
Continuing in the following generation of AMD CPUs, Bulldozer, we see that any advice regarding ret has disappeared from the optimization guide."
TLDR: Blame AMD K8! First x64 CPU. This GCC optimization is outdated and should only be used when specifically optimizing for K8.
Modern Intel CPUs I am led to believe that issuing nops is actually slower than adding prefixes. I think there is work in the backend updating retired instruction counters and other state which still occurs for nops, but decoding prefixes happens entirely in the front end.
When a nop truly is necessary you will see compilers and performance engineers add prefixes to the nop to make it the desired size.
Is it even possible to design a cpu with out-of-order and speculative execution that would have no security issue? Is the future leads to a swarm of disconnected A55 cores each running a single application?
This vulnerability was not caused by OoO or speculative execution. It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base, which has never been adequately rebuilt.
The more proximate cause is that some instructions with multiple redundant prefixes (which is legal, but pointless) have their length miscalculated by some Intel CPUs, which results in wrong outcomes.
> It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base, which has never been adequately rebuilt.
Itanic would like to object! Unfortunately it can’t get through the door.
A more sensible approach for that use-case would be IMO to have well-defined specialized prefixes for padding, instead of relying on the case-by-case behavior of redundant prefixes. (However I understand that there's almost certainly a good historical reason why this was not the way it was done)
The easiest way of doing padding is to add a bunch of `nop` instructions which are one byte each.
If you read the manual, Intel encourages minor variations of the `nop` instructions that can be lengthened into different number of bytes (like `nop dword ptr [eax]` or `nop dword ptr [eax + eax*1 + 00000000h]`).
It is never recommended anywhere in my knowledge to rely on redundant prefixes of random non-nop instructions.
Note that this technique is really only legitimate where the used prefix already has defined behavior with the given instruction ("Use of repeat prefixes and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior."), and of course the REX prefix has special limitations. The key is redundant, not spurious. It is not a good idea to be doing rep add for example. But otherwise, there is no issue.
The prefixes are redundant so it's not really case-by-case behavior. You're just repeating the prefix you would be using anyway in that location.
Using specialized prefixes wastes encoding space for no real gain.
You realize on most common processors NOP itself is a pseudo-instruction? Even the apparently meme-worthy (see sibling comment) RISC-V, it's ADDI x0, x0, 0.
Usually, the historical reason is that adding the logic to do something well-defined when unexpected prefixes are used is going to cost ten more transistors per chip, which is going to add to cost to handle a corner case that almost nobody will try to be in anyway. Far better to let whatever the implementation does happen as long as what happens doesn't break the system.
The issue here is their verification of possible internal CPU states didn't account for this one.
(There is, perhaps, an argument to be made that the x86 architecture has become so complex that the emulator between its embarrassingly stupid PDP-11-style single-thread codeflow and the embarrassingly parallel computation it does under the hood to give the user more performance than a really fast PDP-11 cannot be reliably tested to exhaustion, so perhaps something needs to give on the design or the cost of the chips).
Both approaches are viable, but RISC-V's approach is better, as it provides higher code density without imposing a significant increase in complexity in exchange.
Higher code density is valuable. E.g.:
- The decoders can see more by looking at a window of code of the same size, or we can have a narrowed window.
- We can have less cache and save area and power. We can also clock the cache higher, enabled by it being smaller, lowering latency cycles.
- Smaller binaries or rom image.
Soon to be available (2024) large, high performance implementations will demonstrate RISC-V advantages well.
Well, the bug in this specific case (based on the article by Tavis O. linked elsewhere in comments) looks to be the regular kind -- probably an off-by-one in a microcode edge case. That is, here it's not the case that the CPU functions correctly but leaves behind traces of things that should be private in timing side channels, as was the case for Spectre.
I think formal methods could help designing of such machine, if you can write a mathematical statement that amounts to "there is no side channel between A and B"
Or at least put a practical bound on how many bits per second at most you can from any such side channel (the reasoning being, if you can get at most a bit for each million years, you probably don't have an attack)
Then you verify if a given design meets this constraint
Formal methods are widely used in processor design. It is hard to formalize specs to assert behaviors that bugs we haven't thought about don't exist. At least hard while also preserving the property of being a Turing machine.
I know. I mean applying formal methods to this specific problem of proving side channels don't exist (which seems a very hard thing to do and might even require to modify the whole design to be amenable to this analysis)
As a tidbit, this was part of how one of the teams involved in the original Spectre paper found some of the vulnerabilities. Basically the idea was to design a small CPU that could be formally shown to be free of certain timing attacks. In the process they found a bunch of things that would have to change for the analysis to work... maybe in a small system those wouldn't actually lead to vulnerabilities, but they couldn't prove it (or it would require lots of careful analysis). And in big systems, those features do lead to vulnerabilities.
I'm not sure it ever got built! The Spectre stuff was found during the "how would we even begin to do this" phase. I've seen a fair amount of academic work about formally verifying RISC-V cores though.
What would be the typical size of such a constraint-based problem, and do we have the compute power to translate the rules into an implementation? And what if one forgot a rule somewhere… Deeply interesting subject.
I think you'd want it to be a theorem (in Lean, Coq, Isabelle/HOL or whatever) instead of a constraint problem. So it would be more limited by developer effort than by computational power.
Theoretically you can do this from software down to (idealized) gates, but in practice the effort is so great that it's only been done in extremely limited systems.
The REX prefix is redundant for 'movsb', but not 'movsd'/'movsq' (moving either 32- or 64-bit words, depending on the prefix). That may have something to do with the bug, if there is any shared microcode between those instructions?
Benchmarking is always problematic -- what is a good representative workload? All the same, I'd be curious if the ucode update that plugs this bug has affected CPU performance, eg, it diverts the "fast short rep move" path to just use the "bad for short moves but great for long moves" version.
In the article by Tavis O. linked elsewhere in comments, he suggests disabling the FSRM CPU feature only as an expensive workaround to be taken only if the microcode can't be updated for some reason. That suggests to me that he, at least, expects the update to do better.
That would be the conservative thing to do. If there's no limit on microcode updates, if I was Intel, I'd consider doing that first and then speeding it up again later. Based on the 5-second guess that people who update everything regularly will care that we did the right thing for security, and people who hate updates won't be happy anyway, so at least the first update will be secure if they never get the next one.
(I think there is a limit on microcode, they seem conservative to release new ones - I don't remember the details)
It's a shame that Google didn't publish numbers. They have very good profiling across all of their servers and probably have incredibly high confidence numbers for the real-world impact on this. (Assuming that your world is lots of copying protocol buffers in C++ and Java)
I've heard Intel does use TLA+ extensively for specifying their designs and verifying their specs. But TLA+ specs are extremely high-level, so they don't capture implementation details that can lead to bugs. And model checking isn't a formal proof, only (tractably small) finite state spaces can be checked with TLC. And even there, you're only checking the invariants you specified.
That said, I'm sure there's some verification framework like SPARK for VHDL, and this feels like exactly the kind of thing it should catch.
Formal methods have been used in CPU design for nearly 40 years [1] but not yet for everything, and the methods tend to not have "round-trip-engineering" properties (e.g. TLA+ is not actually proving validity of the code you will run in production, just your description of its behavior and your idea of exhaustive test cases).
> I’ve written previously about a processor validation technique called Oracle Serialization that we’ve been using. The idea is to generate two forms of the same randomly generated program and verify their final state is identical.
> I found this bug by fuzzing, big surprise [..] In fact, vendors fuzz their own products extensively - the industry term for it is Post-Silicon Validation.
This is such an interesting read, right in the league of "Smashing the stack" and "row hammer". As someone with very little knowledge of security I wonder if CPU designers do any kind of formal verification of the microcode architecture?
Nice find. That indeed sounds terrible for anyone executing external code in what they believe to be sandboxes. Good thing it can be patched (and AFAICT, it seems to be a good fix, rather than a performance-affecting workaround.)
x86 has a builtin memory copy instruction, provided by the combination of the movsb instruction and a rep prefix byte, that says you want the instruction to run in a loop until it runs out data to copy. This is "rep movsb". This instruction is fairly old, meaning a lot of code still has it, even though there's faster ways to copy memory in x86.
Intel added two features to modern x86 chips that detects rep movsb and accelerates it to be as fast as those other ways. However, those features have a bug. You see, because rep is a prefix byte, you can just keep adding more prefix bytes to the instruction (up to a maximum of 16 AFAIK). x86 has other prefix bytes too, such as rex (used to access registers 8-16), vex, evex, etc. The part of the processor that recognizes a rep movsb does NOT account for these other prefix bytes, which makes the processor get confused in ways that are difficult to understand. The processor can start executing garbage, take the wrong branch in if statements, and so on.
Most disturbingly, when multiple physical cores are executing these "rep rep rep rep movsb" instructions at the same time, they will start generating machine check exceptions, which can at worst force a physical machine reboot. This is very bad for Google because they rent out compute time to different companies and they all need to be able to share the same machine. They don't want some prankster running these instructions and killing someone else's compute jobs. We call this a "Denial of Service" vulnerability because, while I can't read someone else's computations or change them, I can keep them from completing, which is just as bad.
To some extent, anyone with a web browser is sharing their machine with other people. That's Javascript.
If you ever download untrustworthy code and run it in a VM to protect your main set of data, that's another case.
The success of cloud computing is from the idea that multiple people can share the same computer. You only need one core, but CPUs come with 128, but with the cloud you can buy just that one core and share 1/128th of the power supply, rack space, motherboard, ethernet cable, sysadmin time, etc. and that reduces your costs. That assumption is all based on virtualization working, though; nobody wants 1/128th of someone else's computer, they want their own computer that's 1/128th as fast. Bugs like these demonstrate that you're just sharing a computer with someone, which is bad for the business of cloud providers.
My point is that for a sufficiently large user, you can probably use enough of the 128 cores by yourself alone, that it's more worthwhile to do that and turn off these mitigations : both because it removes a whole class of threats, and also because the mitigations tend to have a non-negligible performance impact, especially when first discovered, on chips that haven't been designed to protect against them.
I very much agree with that. The reality is that cloud providers can replace entire machines with only a small latency blip in your application (or at least GCP can), so if you are doing things like buying 2 core VMs 64 times to avoid losing more than 1% capacity when a machine dies, you probably don't actually need to do that. You could get a 128 core dedicated machine, and then not share it with anyone, and your availability time in that region/AZ probably wouldn't change much.
That said, machines are really monstrously huge these days, and it can be hard to put them to good use. You also miss out on cost savings like burstable instances, which rely on someone else using the capacity for the 16 hours a day when you don't need it. It's a balance, but I'd say "just buy a computer" would be my starting point for most application deployments.
So your argument is that everyone who wants to run a WordPress blog should be paying $320/mo[0] to rent a whole machine just so we can avoid one specific kind of security problem?
If you don't want to share GCP and AWS both offer ways to rent machines that aren't shared with other users. But for most people the cost isn't worth it because shared machines work well enough and provide much better resource utilization.
Some x86 instructions can have prefixes that modify their behavior in a meaningful way. Such a prefix can be applied generally to any instruction, but it's expected to have no effect when applied to an instruction it doesn't make sense with. But it turns out the CPU actually misbehaves in some cases when this is done. Intel released a CPU firmware update to fix it.
Intel is a known partner of the NSA. If Intel was intentionally creating backdoors at the behest of the NSA, how would they look different from this vulnerability and the many other discovered vulnerabilities before it?
Only the people inserting the backdoor or using it would need to be bound by a National Security Letter's gag order. I doubt anyone at Google (including those subject to NSL gag orders) was made aware of this specific vulnerability.
# Google’s commitment to collaboration and hardware security
## As Reptar, Zenbleed, and Downfall suggest, computing hardware and processors remain susceptible to these types of vulnerabilities. This trend will only continue as hardware becomes increasingly complex. This is why Google continues to invest heavily in CPU and vulnerability research. Work like this, done in close collaboration with our industry partners, allows us to keep users safe and is critical to finding and mitigating vulnerabilities before they can be exploited.
There's a tension between the NSA wanting backdoors and service providers (CPU designers + Cloud hosting) wanting secure platforms. It's possible that by employing CPU and security researchers, Google can tip the scales a bit further in their favor.
the backdoor would just be an encrypted stream of "random" data flowing right out the RNG. there's some maxim of crypto that encrypted data is indistinguishable from random bytes.
> This bug was independently discovered by multiple research teams within Google, including the silifuzz team and Google Information Security Engineering.
Can we get a better title for this? "Reptar - new CPU vulnerability" or something. I thought it was some random startup ad until I picked up the name somewhere else.
If it is changed to what you suggested a question mark would be warranted, because it is not yet clear what can be done with this "glitch" (as the article calls it).
>A potential security vulnerability in some Intel® Processors may allow escalation of privilege and/or information disclosure and/or denial of service via local access.
This isn't how anyone would backdoor a CPU. An actual backdoor would be done via some instruction sequence that is basically impossible to trigger by accident and hard to detect even when triggered.
Can you give an example of such sequence? Is it really so easy to hide it given that the microcode can be decoded in principle, https://news.ycombinator.com/item?id=32145324? Why is hiding it in a "bug" a worse solution? Why you can't do both?
One is to make the condition for the backdoor trigger based on multiple (unlikely) instructions in sequence. This bug was triggered by a single instruction, so it would have been a pretty easy case for fuzzing. If you need a sequence of 10 specific instructions in a specific sequence, with no kind of observable side-effects for getting just the first 9 right so that nobody can do a guided search? That's not going to be found just by random chance. It doesn't matter what those instructions are, as long as they're not something that would get generated by real compilers on real programs.
The other is to make it dependent on the data rather than just the static instructions. Like, what if you had the SHA1 acceleration instructions trigger a backdoor iff the output of the hash is a certain value? You could probably even arrange for the backdoor to get triggered from managed and sandboxed runtimes like Javascript, rather than needing to get the victim to run native code. And somebody triggering this by accident would be equivalent to a SHA1 preimage collision.
It looks like Intel was cutting corners to be faster than AMD and now all those thigs come out. How much slower will all those processors be after multiple errata? 10%? 30%? 50%?
In a duopoly market there seems to be no real competition. And yes I know that some (not all) bugs also happen for AMD.
> And yes I know that some (not all) bugs also happen for AMD.
Some of these novel side-channel attacks actually even apply in completely unrelated architectures such as ARM [1] or RISC-V [2].
I think the problem is not (just) a lack of competition (although you're right that the duopoly in desktop/laptop/non-cloud servers for x86 brings its own serious issues, I've written and ranted more often than I can count [3]), it rather is that modern CPUs and SoCs have simply become so utterly complex and loaded with decades worth of backwards-compatibility baggage that it is impossible for any single human, even a small team of the best experts you can bring together, to fully grasp every tiny bit of them.
> and I suspect the situation will worsen when AI will enter the picture.
For now, AI lacks the contextual depth - but an AI that can actually design a CPU from scratch (and not just rehashing prior-art VHDL it has ... learned? somehow), if that happens we'll be at a Cambrian Explosion-style event anyway, and all we can do is stand on the sides, munch popcorn and remember this tiny quote from Star Wars [1].
Not sure what other errata you're referring to, but this looks like an off-by-one in the microcode. I would expect the fix to have zero or minimal penalty.
It's not clear to me this fix will have any performance impact. I strongly suspect it will be negligible or zero.
This seems like a "simple" bug of the type that people write every day, not deep architectural problems like Spectre and the like, which also affected AMD (in roughly equal measure if I recall correctly).
Parent commenter might be thinking of Meltdown, a related architectural bug that only bit Intel and IBM PPC. Everything with speculative execution has Spectre[0], but you only have Meltdown if you speculate across security boundaries.
The reason why Meltdown has a more dramatic name than Spectre, despite being the same vulnerability, is that hardware privilege boundaries are the only defensible boundary against timing attacks. We already expect context switches to be expensive, so we're allowed to make them a little more expensive. It'd be prohibitively expensive to avoid leaking timing from, say, one executable library to a block of JIT compiled JavaScript code within the same browser content process.
(via https://news.ycombinator.com/item?id=38268043, but we merged the comments hither)