Please don’t remove MTBF, part 1
A forum post recently correctly found two of my many arguments for the eradication of MTBF incorrect or invalid. Maybe the author (HL) has a valid point. Let’s take a closer look at the note and the writer’s reasoning.
“MTBF is not useful”
The first argument in HL’s note refutes that MTBF is not useful. He cites the definition of MTBF as being a mean (statistical average or indication of central tendency). This is true, as MTBF is the first moment of the exponential distribution. Additionally, for those rare cases when you desire to know the point in time when 63% of the items have failed, the MTBF value is the go-to value.
Or is it?
The underlying assumption is that the rate of failure is constant. That assumption is the primary reason I find MTBF useless. I know of very few items that truly have a constant failure rate. Furthermore, and more directly regarding the notion that MTBF is the mean of a distribution, it is rare that the mean value is useful alone. For MTBF is a single parameter distribution, yet the comparison to other normally distributed measures implies that the first moment alone is sufficient to make decisions. Even the desire to know if a sample of students in a class with an average height above 2 meters is meaningful to conclude the population of all students also have an average height above 2 meters. We need the variance term to make a convincing judgment.
MTBF is a mean value of times to failure data. If I review some field return data and calculate the MTBF it is pretty straightforward. One needs to just tally the total time all units have been in operation and divide by the number of failures. It is an unbiased estimator of the first moment (mean) of the exponential distribution. Now lets say we want to recommend a maintenance time period (like 2 years, or 20k miles) such that we could improve the reliability of the system with regular maintenance.
Keep in mind that MTBF has the interesting property of being memory-less. The value MTBF is the 1/MTBF chance per hour that an item will fail. This is totally unrelated, and not conditional on the age of the item. This is very accurate when the item only experiences failure at a truly constant rate and failure is totally random. Further, even if we set a time period, say 2 years, then what changes? If we replace the item with a similar item, even if its brand new, the chance of failure the next hour is still the same value: 1/MTBF. No improvement and the very real possibility of damage during the maintenance activity.
So even with the expense of doing a replacement with a new item, there is no improvement in the system reliability. The only maintenance approach that makes any sense, for an item which is accurately represented by MTBF, is to replace the item when it fails. No other approach makes sense to me.
The last part of the first refutation indicates that if the MTBF value is changing over different time periods of consideration, then using a distribution which includes the rate of change would be more accurate. Yet, for complex systems with relatively constant failure rates over the duration of interest MTBF is “a quite good estimate”. Given the ease of using the appropriate math for reliability statistics, I wonder if a short study where we compare the results using both methods would reveal the same answers? In the many situations where I’ve been asked to review reliability data the comparison has been stark. In the past, using MTBF, ‘mistakes were made’, becomes the general conclusion. If maintenance costs, downtime, inventory costs, and customer satisfaction are of little concern, then go ahead, use MTBF and the ‘good enough’ approach. If you want to understand the reliability of your product, save money and time, improve availability and enjoy the praise of happy customers – then do the math and do it right.
I’m interested in anyone’s ability to use MTBF in a beneficial way. Please write to me and let me know– how you do it? What are some situations where using MTBF is the best and least error prone method?
HL does have a second argument he refutes – let’s explore that next week.
Chet Haibel says
Fred: The above intends to make some good points, but fails to distinguish between the failure rate, which follows an exponential distribution, and the hazard rate, which is constant. MTBF is not 1/failure rate, it is 1/hazard rate. Then by erroneously citing failure rate in a number of places, you find all kinds of things wrong with MTBF.
Also, no one schooled in Reliability Centered Maintenance would proactively replace working components (or subsystems) unless their Weibull Beta is 2 or higher. Only a very uninformed person would proactively replace components (or subsystems) with a constant hazard rate.
Fred Schenkelberg says
Hi Chet,
Guilty and I did and have often confused the two terms. Maybe you could draft up a short article (blog post) for the NoMTBF site on the difference between hazard rate and failure rate and help us all understand and use the proper terminology. Also, what could go wrong if we use the concept of constant failure rate when that is not what the math means?
cheers,
Fred