First off, I was referring primarily to memory latency between L1 cache, which h...

First off, I was referring primarily to memory latency between L1 cache, which has improved over the past 2.5 decades only through the combination of Moore's law getting the wires shorter (which is going to end soon, at least for silicon) and increasing clockspeed (which really ended with the breakdown of Dennard scaling a little over 10 years ago). Intel's L1 cache latency has not improved in almost 10 years, with it still at 4 cycle latency (at best). The improvement has only been that there is more data you can access at L1, but the time to data hitting your registers has not improved at all.

Our scratchpad (the analogous term for software managed memory, in comparison to a traditional hardware managed L1/L2/L3 cache system) for instance has single cycle latency along with zero bus turnaround. Along with our ability to guarantee memory latencies between any locations in memory, our whole goal is to try to never have a wasted cycle.