How often does a computer (CPU and RAM) make mistakes?

I'm a hobby programmer currently studying actuarial science, so this seems like a logical question for me to ask! We all know that CPUs are highly accurate and have some degree of error correction of internal analogue signals (ie currents between transistors) and even redundant systems, but what is the probability that those will fail? While this is not a pertinent question for the every-day programmer (the fail rate of a CPU is probably infinitesimal for every-day considerations), I find it interesting nonetheless. Let's base the question on an example;

Suppose we have a binary operation:

1
2
3
4
bool a = true;
bool b = false;

bool test = a == b;


What is the probability of test being assigned a values of true? That is to say, what is the probability that it will incorrectly assign the values or stuff up the boolean test? I'm looking for an answer in terms of a sigma event number or probability (ie. On average, how many times do I have to run this program before it gets it wrong?).

Let's assume some typical 64-bit PC architecture.

Does anyone have experience / thoughts on this?
Last edited on
I'm pretty sure an optimizing compiler will realize that test is false and will hardcode it thusly.

Also don't take my word for it but again I'm pretty sure that errors that happen in the CPU are more likely due to flaws in the design than anything else.

Memory chips on the other hand may produce errors which are corrected by means of ECC.
http://en.wikipedia.org/wiki/ECC_memory

On average, how many times do I have to run this program before it gets it wrong?

Well if you rewrite your code as:

1
2
3
volatile bool a = true;
volatile bool b = false;
volatile bool test = a == b;

Then I guess it becomes possible for errors to creep in, but even then I'm pretty sure they'll be caused by memory and not CPU.

Unless the CPU caches act crazy... although they're not "regular" memory.
Last edited on
I do remember seeing when the whole faster than light neutrino thing was going on, that they were looking at possible interference in chips from cosmic rays as a cause (at least, in some of the literature I read). This is a definite way to cause errors in a chip. But how often does a cosmic ray affect a calculation really?

IBM suggests in memory the answer is 'one cosmic-ray-induced error per 256 megabytes of RAM per month' ( http://www.scientificamerican.com/article.cfm?id=solar-storms-fast-facts). With memory being generally more susceptible to errors than the CPU and being larger than a CPU along with the fact that this would have to affect a part of a CPU at the time of calculation and affect the part of the CPU actually doing the calculation... I would expect the CPU errors caused by this to be significantly lower.

Sorry, no sigma results for you though. =/
> (ie. On average, how many times do I have to run this program before it gets it wrong?).

On a modern processor architecture, you are unlikely to ever see it get it wrong (the probability of that happening is infinitesimal). If it is a soft error, the processor will detect it and apply error correction, typically via retries; if it is a hard error, the system will halt - kernel panic (unix) or KeBugCheck (windows) - with an MCE.
Last edited on
Relevant Wikipedia article:
http://en.wikipedia.org/wiki/Soft_error
Topic archived. No new replies allowed.