> I also understand that this is possible because the emulator is running on a superscalar processor.
It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.
It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.