The problem isn't execution unit throughput at all, it's decode, where running t...

The problem isn't execution unit throughput at all, it's decode, where running the second thread on the module would bottleneck the decoder and force it to alternate between servicing the two threads. And this doesn't really matter what is running on the other core, it simply cannot service both cores at the same time regardless of instruction type even if the first core isn't using all its decode. If it has to decode a macro-op, it can even stall the other thread for multiple cycles.

> Each decoder can handle four instructions per clock cycle. The Bulldozer, Piledriver and Excavator have one decoder in each unit, which is shared between two cores. When both cores are active, the decoders serve each core every second clock cycle, so that the maximum decode rate is two instructions per clock cycle per core. Instructions that belong to different cores cannot be decoded in the same clock cycle. The decode rate is four instructions per clock cycle when only one thread is running in each execution unit.

...

> On Bulldozer, Piledriver and Excavator, the shared decode unit can handle four instructions per clock cycle. It is alternating between the two threads so that each thread gets up to four instructions every second clock cycle, or two instructions per clock cycle on average. This is a serious bottleneck because the rest of the pipeline can handle up to four instructions per clock.

> The situation gets even worse for instructions that generate more than one macro-op each. All instructions that generate more than two macro-ops are handled with microcode. The microcode sequencer blocks the decoders for several clock cycles so that the other thread is stalled in the meantime.

https://www.agner.org/optimize/microarchitecture.pdf