5 Clues Using MTBF is Not Helping

Have you ever heard the claim that “We use MTBF, as it’s working just fine”?

They may be profitable and successful in the marketplace. Is MTBF serving them well?

Probably not.

One way help the folks claiming MTBF is alright is to illustrate using a better reliability metric may provide an improvement over using MTBF. Asking a few questions may find the inevitable chink in the MTBF armor.

Any early failures due to supplier or assembly error?
Any test design to check for how long the system will last?
Ever notice maintenance seems to improve or degrade system performance?
Did you fix the time frame for the calculation of MTBF?
Do you plot and track the month to month changes in MTBF?

The key is to find indications they need to understand the changing nature of the underlying failure rates for their product or system. If the product or system only experience failures that followed an exponential distribution pattern, all of above questions would have a negative answer. MTBF, begrudgingly, may work.

Unfortunately, products, components, systems, rarely if at all ever occur such that the time to failure distribution is an exponential distribution. The arrival of failures tends to either taper off over time or increase over time. The arrival of failures, and likely never, actually occur at a steady pace.

Check the Assumption with Your Data

I once heard the claim that the time to failure distribution was always best described by the exponential distribution. So, I asked for what evidence did they have to support that claim.

They always assumed an exponential distribution, thus only collected the count of the number of failures and the tally of total operating time to calculate MTTF or MTBF. Thus all their data ’supported’ using MTBF. It was all they every measured. Thus there was no evidence, in their minds, that they needed anything other than MTBF.

This group routinely scrubbed the data to remove any vestige of time to failure information. They deemed it unnecessary to collect start times or time of failure information. A count was good enough.

Early Life Failures

Then back to the questions above. They regularly worked with suppliers and their design to eliminate early life failures. Resolving issues that cause failures is a good think, yet not recognizing the nature of a decreasing failure rate and simply adjusting MTBF each time another failure occurred, didn’t seem odd to them.

Later I discovered they didn’t count failures in the first month of product operation as a failure for the purpose of calculating MTBF. Those ‘quality issues’, even though they occurred when the customer was using the product, were not ‘in the useful life’ region thus not appropriate to use for MTBF calculations.

Onset of Wear Out or End of Life Estimates

I asked how long their product would last and how did they know? Did the product have some form of wear out or degradation that we should understand to plan for replacements or repairs?

Yes, it did. The team regularly conducted accelerated life testing on two different elements of the product that would limit the life of the product. In both cases, the specific failure mechanism could feasibly occur after a month or two of operation in some circumstances.

In most situations, these failure mechanisms would not noticeably increase the probability of failure till closer to 5 fives years of use. At which point the failure rate would continue to rise slowly. It was dependent on the use conditions and frequency of use. This only meant the ‘useful life’ period for each customer was different. If the product failed after two months, the useful life period was only two months. If it worked for five years before failing, the useful life period was much longer.

So I asked about the useful life, and they showed me the MTBF value again.

Summary

Instead of rationalizing your metric, find one that is useful.

If your organization uses MTBF or your suppliers insist on using MTBF, ask them the questions above. Point out the use of MTBF is not serving them. No amount of fiddling, adjusting, or accommodating the changing nature of failure arrivals will reveal useful information while using MTBF. Those efforts are a futile attempt to green a bit of information from a faulty measure.

Instead, use reliability directly. Use non-parametric or parametric models that permit the modeling of changing failure rates over time. If nothing else, plot the time to failure data in a histogram (simple probability density function plot). Fit the time to failure data to a Weibull and check if the slope (beta) is 1 or not. Check the assumption of the data being well described by the exponential distribution.

Let your data speak to you and to make sense as you try to understand the data.

The Magic Math of Meeting MTBF Requirements

MTBF: According to a Component Supplier

5 Things You Can Do Today to Avoid Using MTBF

Comments

Paul Franklin says

July 20, 2016 at 5:40 AM

Good article and great point: if you model assuming that failure rate is constant, then you can’t model changes in the failure rate. I’ve also found that product level behavior rarely matches any standard distribution over long times. The implication here is that assuming that data that fits Weibull or Lognormal now may not in the future, and that if such changes occur, then it is often useful to look for more than one major failure mode.

Don -PLC Failure says

July 21, 2016 at 5:22 PM

As an example, one can look at the PLC MTBF article http://bin95.com/plc-controller-failure-rate.htm
Another example is automated augmented MTTF/MTBF using PLC to monitor event (say each time solenoid is told to activate), write formula to account for application variables such as environmental effects, then predict date that solenoid will need replaced before failure. Two solenoids on same machine will have different estimated failure date, based on frequency of use in that particular process.
Another example is a conveyor with varying loads/products, the PLC can monitor speed of product to go from one point on conveyor to another, adjust failure time, monitor belt slippage increase rate, etc. All base on the points you outline here in your article.

- Fred Schenkelberg says
  
  July 21, 2016 at 5:29 PM
  
  Thanks for the note Don,
  
  Yes, there is a wide application for this approach the idea being able to support rational decision making by making the tradeoffs between reliability and part/component ratings visible.
  
  Cheers,
  
  Fred