Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This article contains a fundamental flaw. It estimates the upset rate in memory due to cosmic ray flux as upsets/bits/hour, but this is an incorrect unit. Upsets depend on the total physical size of the memory (and thus the total neutron flux) and the sensitivity of each memory cell (bit) to cosmic rays. Sensitivity may increase as you decrease the size of memory cells, but not in lock-step with the change in size. A room full of 4 Mbit memory chips will almost certainly have a higher rate of upsets per bit than will a single 2GB DIMM. The figures quoted in the article are from studies of computer systems in the 1980s, so upsets/bits rates are much higher than would be expected with modern RAM (which has 3 orders of magnitude more bits in the same volume).

This error is probably why the article's theorized SEU event rate for modern systems is about 3 orders of magnitude higher than experimental evidence suggests (such as from this Google study): http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf



I don't understand your first paragraph. Shouldn't the cosmic ray flux through each bit be the same for each bit, and unchanged as you increase the total amount of memory? If I'm a given bit, should having a second stick of RAM 10 inches away affect whether I flip during a given time period?


Shouldn't the cosmic ray flux through each bit be the same for each bit, and unchanged as you increase the total amount of memory?

For identically manufactured ram, generally yes. The total number of upsets you'll see from a collection of 10 sticks of RAM will be roughly 10x higher than from 1 stick of RAM. However, there are huge variations in RAM, especially when you're comparing modern RAM to RAM manufactured in, say, 1988. The main figure used in the article (1.3e-12 upsets/bit/hour) comes from a study of a Cray Y-MP 8 system that had a main memory system containing approximately 32,000 SRAM chips. This amount of memory is measured in cubic meters, yet today the same number of bits of RAM fits on half or a quarter of a single DIMM.

Suffice it to say, the cosmic ray flux through the Cray Y-MP 8's main memory system and through half of a 2GB DIMM is significantly different, by orders of magnitude. At the same time, the memory cell in the Y-MP 8 and the memory cell in a 2GB DDR2 DIMM will have a different rate of sensitivity to cosmic ray flux, translating to a different rate of upsets for the same rate of neutron flux per memory cell. However, these two factors don't balance each other out, modern memory cells aren't thousands of times more sensitive to cosmic rays even though they take up thousands of times less space. The result is that a figure of upsets/bits/year can only be taken to be constant so long as the memory technology remains constant. That is most decidedly not the case here. If one were using 4GB of Cray Y-MP ram (which would likely fill an entire server rack, and more) perhaps you'd see the SEU rates the author calculates. However, most folks these days are using 4GB of RAM in 2 tiny DIMMs which may have, at most, a combined cross-sectional area (of the actual memory chips) of at most maybe 16 cm^2. This has non-trivial effects on the SEU rate.


Oh, OK. I guess then I wouldn't say that upsets/bit/hour is an incorrect unit. (It's clearly what you want to know to calculate the chance of error for a given piece of RAM.) It's just that this parameter varies across time and manufacturers. Using the value from a particular model of RAM manufactured in 1988 is sure to lead to wrong conclusions.

Thanks.


The rate of cosmic ray intersection is most likely directly proportional to the physical cross section of the silicon. The author uses per-bit empirical data from 10 years ago, when memory was 100 times less dense, and then extrapolates to the present. It would likely be more correct to use a per-chip (or per cm^2) rate.


I think his point was that as you get a higher number of bits per volume, the number of flips per GB would go down, basically because for a given number of bits, a higher density would imply they are exposed to less flux.


Based on pure gut feeling, I doubt that sensitivity has increased at all for at least a decade now. Anything that penetrates into the casing is either something that will not interact with it, or has enough energy to flip a bit both in a 180nm and 32nm process.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: