Fixing Early Life Failures Can Make Your MTBF Worse
Let’s say we 6 months of life data on 100 units. We’re charged with looking at the data and determine the impact of fixing the problems that caused the earliest failures.
The initial look of the data includes 9 failures and 91 suspensions. Other then the nine all units operated for 180 days. The MTBF is about 24k days. Having heard about Weibull plotting and using the beta value as a guide initially find the blue line in the plot. The beta value is less than one so we start looking for supply chain, manufacturing or installation caused failures, as we suspect early life failures dominant the time to failure pattern.
Initial Steps to Improve the Product
Given clues and evidence that some of the products failed early we investigate and find evidence of damage to units during installation. In fact it appears the first four failures were due to installation damage. The fix will cost some money, so the director of engineer asks for an estimate of the effect of the change on the reliability of the system.
The organization uses MTBF as does the customer. The existing MTBF of 24k days exceeds the customers requirement of 10K days, yet avoiding early problems may be worth the customer good will. The motivation is driven by continuous improvement and not out of necessity or customer complaints.
Calculation of Impact of Change on Reliability
One way to estimate the effect of a removal of a failure mechanism is to examine the data without counting the removed failure mechanism. So, if the change to the installation practice in the best case completely prevents the initial four failures observed we are left with just the 5 other failures that occurred over the 6 months.
Removing the four initial failures and calculating MTBF we estimate MTBF will change to about 300 days.
Hum?
We removed failures and the MTBF got worse?
What Could Cause this Kind of Change?
The classic calculation for MTBF is the total time divided by the number of failures. Taking a closer look at time to failure behavior of the two different failure mechanisms may reveal what is happening. The early failures have a decreasing failure rate (Weibull beta parameter less than 1) over the first two months of operation. Later, in the last couple of months of operation, 5 failures occur and they appear to have an increasing rate of failure (Weibull beta parameter greater than 1).
By removing the four early failures the Weibull distribution fit changes from the blue line to the black line (steeper slope).
Recall that the MTBF value represents the point in time when about 63% of units have failed. With only 9 total failures out of 100 units we have only about 10% of units failed so the MTBF calculation is a projects to the future when most of have failed, it does not providing information about failures at 6 months or less directly.
In this case when the four early failures are removed the slope changed from about 0.7 to about 5, it rotated counter clockwise on the CDF plot.
If only using MTBF the results of removing four failures from the data made the measured MTBF much worse and would have prevented us from improving the product. By fitting the data to a Weibull distribution we learned to investigate early life failures, plus once that failure mechanism was removed revealed a potentially serious wear out failure mechanism.
This is an artificial example, of course, yet it illustrates the degree which an organization is blind to what is actually occurring by using only MTBF. Treat the data well and use multiple methods to understand the time to failure pattern.