6 ways to overcome MTBF stubbornness
Just before making an 1 hour presentation at a reliability engineering conference George asked me how to teach others what MTBF really means.
Not having given this much thought before, I asked for more details.
George works as the sole reliability engineer for a small company making specialized networking monitoring equipment.
Last week he discussed with a vendor of electronic parts that MTBF was not a failure free period. And, that day he talked to a customer with a similar interpretation of MTBF.
He said he wasn’t looking forward to talking to the development team about reliability goals, given the amount of confusion around MTBF. He wondered if there was an easy way to explain what MTBF was and was not.
Time to Present
It was time present and the host began the introduction. George and his problem was still on my mind.
I smiled to the nearly 150 reliability professionals gathered and asked them is they ever had an issue with someone not understanding MTBF.
150 hands shot up, lots of head nodding, a murmur arose from the group.
The Common Issues
So, I explained George’s experience and asked the group if they encountered similar issues.
Yes, and others then offered additional ways MTBF could be mis-understood. They talked about angry customer calls, the sorting of failure data to only include constant failure rate failures (what ever those are), and others talked of poor business decisions.
The Perils of MTBF articles came from this discussion.
6 situations and ideas to overcome?
Then we talked about what we can do to help others understand the problems and mis-understandings around MTBF.
It is not a failure free period. It is the point in time when about 2/3’s of the units are expected to fail.
It only applied to the flat part of the bathtub curve or a constant hazard rate. Of all the products I’ve worked with I’ve never seen a flat part of the curve. Products do experience increasing and decreasing failure rates over time.
It is true that MTBF given alone assumes a constant hazard rate and if that is actually true for the product in question, using MTBF is ok.
When George later explained the vendor was a providing cooling fans and they discussed bearing wearout as the expected failure along with the vendor’s firm belief that MTBF of 50,000 hours was accurate.
MTBF alone does not apply over all time. Eventually every product will wear-out (show an increasing hazard rate with age). Thus, MTBF if provided should always include the duration over which it applies.
Small ceramic capacitors, if properly mounted on a circuit board, and expected to function for 2 years have a very small change in hazard rate that it could be considered constant. Without stating the MTBF value with 2-year duration, one could assume that MTBF value applies equally well over 30 years of use or longer.
If we are using a lot of ceramic capacitors, it probably is not safe to assume all of them are properly attached to a circuit board, survive installation and transportation, which leads to an apparent decreasing failure rate over the first few months of the product’s life.
It is very difficult to determine the cause of a solder joint failure. And, even if we could conclude the failure was due to a poor soldering process, it’s still a failure for the customer. It counts.
The discussion included at least a dozen people just like George offering similar stories and suggestions around explaining what MTBF is and is not.
Thanks to George for asking the question and to the audience for sharing. Here is a summary of types of mis-understandings and what we can do about it.
1 – MTBF is a failure free period.
No, do the math. It is the average time to failure (total time divided by failures). Since it is the unbiased estimator of the single parameter exponential distribution it also represents the time till 63.2% of units are expected to fail.
MTBF is the inverse of the failure rate, and each hour of operation there is a chance of failure if the product failures are well described by a constant hazard rate.
2 – MTBF is the time till 50% of units fail
Closer to accurate, yet based on the common understanding of the average or mean from a normal distribution. For life data we often use distributions that do not extend to negative infinity. The exponential and Weibull distribution have their mean defined as the 63.2nd%’ile not the 50th%’ile.
3 – MTBF applies for the flat part of curve
This is true and thus doesn’t apply for most products or components. There are literally thousands of ways any product can fail. Each failure mechanism is waiting for the right conditions to invoke a failure.
There are failures generally associated with early life which include faulty components from the vendor, poor assembly, transportation or installation damage or start-up damage. The cause may occur early in the life of a unit, and may result in a failure at any time. It may take years for a cold solder joint to fatigue to failure.
4 – Our customers are asking for MTBF
Give it to them along with a time duration over which it applies. Be clear about the expected dominate failure mechanisms which may apply in their application. Provide a Weibull or other complete description of expected failures over time. Sure, a Weibull curve over a 5 year application can be summarized with a single value, the expected value or MTBF, yet it is not as useful as an accurate description of the time to failure distribution.
For complex products provide a block diagram with expected life distributions for each major subsystem. You may need to include non-parametric and parametric descriptions of reliability over the desired duration. Monte Carlo analysis may be useful.
5 – Our vendors are only providing MTBF
Ask them for more information. What is likely to fail given our application? What data is available (do the analysis yourself – maybe they don’t know how)? Do some research on expected failure mechanisms and if they are predominately early life or wear-out – alter you expectations around when to expect failures (vs using just the MTBF).
By all means do not ask for MTBF by name. As for reliability information including the expected changes in failure rate over time period your application will use the vendor’s product. Be clear that MTBF is only a failure rate and by itself assumes a constant hazard rate. Ask if that is true and ask for evidence it is true over the duration, and application, in question.
6 – My team only want to use MTBF
That may be all they have ever used and all they know about reliability. This is the point you become a teacher and provide much of the same information as above.
Lead by example by not using MTBF, rather use probability of success (reliability) and duration (e.g. 98% reliable over two years).
Don’t use MTBF in specifications, calculations, test planning, etc. There are better and easier to understand and use alternatives.
Be ready to convert MTBF to reliability when some uses an MTBF value. For example, 50,000 hour MTBF over 2 years of 24/7 use means the probability of a unit surviving 2 years is
R(17,520 hours) = exp [ - 17,520 / 50,000 ] = 0.704
or, we can expect 70% of units to survive 2 years. Is that good enough? Using just 50,000 hours MTBF along with one or more mis-understandings certainly could lead to accepting 50,000 hours MTBF. By doing the simple calculation to reliability it helps all involved understand the meaning of the MTBF value.
Those are just a few of the hurdles and suggestions George and I received that day. I’m sure you have experienced a few others. Please continue the discussion and share the mis-understandings and how you recommend overcoming the hurdles of the stubborn use of MTBF.
Don (Industrial Training) says
Right on Fred. When I wrote the blog about PLC failures and MTBF, your point “complex products provide a block diagram with expected life distributions for each major subsystem” was also my main point. thanks