Hacker News new | past | comments | ask | show | jobs | submit login

And you haven't replaced it? The only time I had ECC memory report errors, I was also experiencing undetected errors as well. System was not stable. Pulled it out, happy ever since. I've always thought of ECC as a warning system (despite the correction ability). Like a spare tire.



Memory errors happen all the time, at least they did several years back when the Google server farm saw "an average of one single-bit-error every 14 to 40 hours per Gigabit of DRAM": http://www.intelligentmemory.com/support/faq/ecc-dram/how-of...

This translates to "a mean of 3,751 correctable errors per DIMM per year": http://www.zdnet.com/article/dram-error-rates-nightmare-on-d...

I'm not sure how things pan out these days with newer memory types. ECC checks and fixes these errors so they're not an issue.


Citing the results of a paper in which they studied the errors observed mostly on DDR1 memory?

A far more recent study by CMU based on the entire fleet of Fb servers shows that correctable error rates dropped dramatically in the past decade.

http://repository.cmu.edu/cgi/viewcontent.cgi?article=1345&c...


[Unless it's obvious to others,] paper doesn't conclude that DDR1 to DDR3 discrepancy correlates to error rate. It only concludes from empirical measurements that memory density scales with error rate (2).



I don't consider it broken so I haven't replaced them, the system is rock solid stable.

Your statement though was interesting, how do you have both ECC memory report errors and 'undetected errors'. At least from a memory perspective, with ECC an 'undetected' error is a multi-bit error that both flips bits and leaves the ECC bits in a legal configuration. That seems like it would be pretty rare.

That said, I've seen motherboards (in our data center) where the memory slots themselves were unreliable (probably bad or weak solder joints on the DIMM sockets or missing terminator resistors). They appeared as a machine with a lot of ECC errors but the same DIMM in another motherboard gave no errors.


The patient first presented with classic memory corruption symptoms, like random segfaults. Ran memtest. No errors reported, but a flurry of corrections logged in the BIOS. Pulled a pair of dimms, everything cleared up. Swapped slots too, so I feel confident blaming the RAM.

It did fall outside my expectation of how ECC works. One bit errors and three bit errors, but not two? Some access pattern that memtest strides don't hit? I didn't really need the extra RAM, so I just moved on without it.


SECDED - single error correction, double error detection. Your Duff memory module was flipping more than two bits.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: