Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting. If an L3 hit takes 15 ns, then based on your argument a hypothetical CPU with only one core (and hence no fabric) would be better off without L3, since a DRAM read can be performed in just 10 ns.


You still need a memory controller, you still need to get to that controller on the edge of the die. And going to RAM more often will surely consume more power.


No, the 10 ns is just the time inside the DRAM. Reading from DRAM would take 20-30 ns even in a very simple chip.


This is the part I don't understand. You're saying that the interval from when the DRAM first receives a read request to when it sends the data back over the channel is about 10ns, at least in fancy gaming RAM. Ok, fine. Where is the other 10-20 ns of latency coming from? Why can't the CPU begin using the data as soon as it arrives? I guess some time is needed to move the data from the memory controller to the actual CPU core. But it seems to me (far from an expert) that this shouldn't take a full 10-20 ns. Or am I mistaken?


Firstly, to clarify, there's nothing very special about 'gaming ram' other than the particular chunk of silicon performs better than others so they stuck a shiny sticker and an oversized heatsink on.

The problem here is the latency is state dependent and who knows what people are talking about here. The memory itself can have a latency 1-3x the CAS Latency number and you need to understand how DRAM is accessed to appreciate why. Which will also clarify why an L3 cache is such a good idea.

> For a completely unknown memory access (AKA Random access), the relevant latency is the time to close any open row, plus the time to open the desired row, followed by the CAS latency to read data from it.

(It's actually worse than than for DDR5.)

https://en.m.wikipedia.org/wiki/CAS_latency

https://en.m.wikipedia.org/wiki/Memory_timings

https://www.anandtech.com/show/3851/everything-you-always-wa...

Then you've got some small time going to and from the controller, which might also be doing some address translation, maybe some access reordering to avoid switching rows. I think 30ns is very optimistic.


To read a single cache line from DDR4 (basically the same for DDR5 but I'm less familiar) the memory controller needs to:

  1. send ACT
  2. wait tRCD(RD)
  3. send READ
  4. wait tCL
  5. read the burst from the DQ
The original 10ns number was only taking step 4 into account. tRCDRD is just as long if not longer. Then the burst takes a couple more ns.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: