When to use something
other than MTBF
As you may suspect I would say you should never use MTBF.
Given MTBF is prevalent, we may find avoiding MTBF nearly impossible.
Given a choice
When talking about reliability goals, just use reliability. Say what you mean in clear language. For example, if you want 95% of units to survive without failure for 5 years, than say the reliability goal is 95% survive over 5 years (include function and environment if it’s not clear from the context)
When asking for reliability information, as for what you want. If you want the device to last five years without failures or with very few failures, then just saying 5 years can be misunderstood. Couple the duration with the probability of survival, to be very clear.
When specifying a test, also be clear – the goal or objective is one statement, the confidence or statistical uncertainty elements is something different – keep them separate.
When not given a choice
When only given MTBF or only asked for MTBF values, what should you do. Well, use the value and ask some questions. Remember that MTBF all by itself is just an indication of the average failure rate. It is not a duration and does not convey how long or over which period of time the failure rate applies.
I cringe when I hear someone comment on a 50,000 hour MTBF value with, “That is about 5 years, which is long enough for our application.” We really should state MTBF as hours per failure to be a bit clearer.
So, when given 50,000 hours MTBF for an item, I first consider over what duration this applies (if I don’t know – it’s time to ask more questions). So, let’s say we have a electronics box with a fan. It is expected to operate full time for two years, or 17,520 hours of operation.
If the fan assembly data sheet has a listed MTBF of 50k hours, and it’s the only information I have available. I can estimate the reliability directly.
$latex \displaystyle&s=4 \begin{array}{l}R\left( t \right)={{e}^{\frac{-t}{\theta }}}\\R\left( 17,520 \right)={{e}^{\frac{-17520}{50000}}}=0.70\end{array}$
This is the reliability function for the exponential distribution and results in an estimated 70% of units survive over 2 years. If that is an acceptable failure rate (about 30%) then use the fan, if not, find a better fan, or a better estimate of the reliability of the fan.
When only given MTBF do the math and convert the value into something that is much easier to understand.
Do the same when asked for MTBF. Provide reliability – probability of success over a specific duration. Again, make it clear.
Paul says
Fred, two points seem relevant. There’s no reason you have to assume that a constant failure rate applies. You might ask how your view of reliability would change if you knew that failures were distributed according to the normal or Weibull distributions. Since you know (in your example) that you’re dealing with a fan, you can deal with various assumptions about how wear-out might happen.
If you have nothing else to go on but a statement concerning MTBF, you may also be in a position to recommend testing. If you’re lucky enough to have applicable test data in your filing cabinet, you might be able to reuse it in some way. One way that seems particularly appropriate is to estimate the conditional probability of failure based on age. Nelson has suggested some techniques.
A supplier recently suggested that a card has an MTBF of millions hours, or hundreds of years. Probably the only way I can use that information is to estimate replacement rate and sparing needs for a large fleet (i.e., 1 replacement in X years if I own 100 of these cards). Fortunately, there was field data, and it turned out that we could estimate time to first failure, and knowing the age of a card, we could estimate the conditional probability of failure in any time interval we wanted. Of course it turns out that the median time to first failure of this card (and the 63rd percentile as well–which is the characteristic life for Weibull and MTBF for constant failure rate) can be estimated, and it’s on the order of high single digit years. It still means that this card is not likely to fail during the technology cycle, but we do have a data-driven approach here.
There is every reason to believe that electronics wear out. Knowing this allows an engineer to work backwards from various sets of assumptions to get a feel for what might happen it real life. My general expectation is that over a reasonable time (say up to 5 years, more or less), that most problems will be fairly robust to the assumptions used, and this provides some structure for taking appropriate action. If the analysis shows that the design is sensitive to assumptions, then that’s useful too, and it tells you where to do more work.
Fred Schenkelberg says
HI Paul,
Great comments – thanks.
If all you have is MTBF then you are correct and one should look for more information. Sometimes we have field data, maybe a literature search, etc. All good. The key is the MTBF in of itself is not all the useful. We should build the reflex to MTBF of needing more information.
Even estimate spares for a fleet is not all that useful using MTBF alone. It may provide a gross number yet we may be very interested in when those spare or most needed. Like you said, electronics does tend to wear out – and there is the common issue of factory and supply chain induced issues (early life failures) – Given only MTBF we do not know if we need the parts early or late in the life of the units.
The million hour MTBF – means not 100 years of life, it means there is a 1 / 10^6 chance of failure each hour – that is all. If they tested 1 million boards for an hour, they may claim they have test data to support the claim. More likely they tested a few boards for thousands of hours to tally up to one million total hours…. or they just did a parts count prediction….
What is missing is what duration is the MTBF valid – if it’s just five years and ignores early life failures, then it means the units is probably pretty robust…. if the application is a 30 year solar panel installation – I would want more information.
Cheers,
Fred
Paul says
Fred,
I wonder if MTBF can be meaningful unless the constant failure rate distribution applies. If you were integrate the hazard function and plot failures as a function of the integral, then you could tell. If that curve is concave upward, then infant mortality or reliability growth is indicated. If it is linear, then there is a constant failure rate. If it is concave downward, then wear-out is indicated.
In real life, things aren’t usually so simple and it’s fairly rare that a single distribution captures the life cycle reliability experience. There are often competing failure modes.
As you point out, a prediction is reasonably worthless. Testing 10,000 units for 100 hours isn’t the same as testing 1000 units for 1000 hours or 100 units for 10,000 hours. A reliability demonstration’s validity ends at the clock time of its conclusion. There would quite likely be different failure counts in each of those 3 test plans.
Early failures can have lots of different causes: defects either introduced or not eliminated during manufacture, fast growing defects under field stress that cannot be detected during manufacture, installation errors, and so forth. The reliability engineer needs to be aware of whatever may apply in a given situation, and it’s reasonably unlikely that any measure (predicted or test-based) that is total time over number of failures will help control them. Deeper knowledge is simply required, or a willingness to live with those failures. Partnerships with quality engineers, procedure planners, and so forth are useful here. In any event, it’s more than just reliability.
More and more, I’m tending to a data driven approach. I am challenging suppliers with questions like “what do you mean by an MTBF of a hundred thousand hours?” It cannot possibly have much to do with lifetime, or true probability of failure (given that I’m pretty sure right out of the gate that there’s a measurable return rate given even 1 or 2 years of operation). I’m also asking “how do you know?” I am asking what failure modes I can expect to see, and how “MTBF” is computed. If I get back total time over number of failures or a Telcordia prediction, then I know there’s a lot more work to do. If there’s no FMEA, then the supplier hasn’t thought about the product carefully. And I’m asking how the design is robust in the event expected failures occur. The real reason I’m interested in the rate of failures is that I want to understand how to plan for maintenance or replacement. If I can prevent failures by partnering with quality engineers, design engineers, and procedure planners, then so much the better. It isn’t possible to prevent 100% of failures, of course. In the age of networks and cloud computing, we have to be thinking about a lot more than dividing the number of failures into time. We have to be asking the right questions and getting useful answers (though this is often like pulling teeth with no anesthesia).
Fred Schenkelberg says
Hi Paul,
I’d say try it – if given just MTBF – then the hazard function is just a straight line – it will not show early life or wear out patterns….
I totally agree that if you have the data, use it first. The simplifying assumptions involved with using MTBF often obscures the useful information contained within the data.
Cheers,
Fred