When to Use MTBF as a Metric?
I will not say ‘never’, which is probably what you expect. There are a rare set of circumstances which may benefit with the use of MTBF as a metric. Of course, this does not include being deceitful or misleading with marketing materials. There may actually be an occasion where the MTBF metric works well.
As you know, MTBF is often estimated by tallying up the total hours of operation of a set of devices or systems and dividing by the number of failures. If no failures occur we assume one failure to avoid dividing by zero (messy business dividing by zero and to be avoided). MTBF is essentially the average time to failure.
Expected Value as Metric
The metric we select should be measurable and of a measure we have an interest. We would like to detect changes, measure progress, and possibly make business decisions with our metrics. If we are interested in the expected value of the time to failure for our devices, then MTBF might just be useful.
When making a device we often hear of executives, engineers and customers talk about how long they expect the product to last. An office device may have an expected life of 5 years, a solar power system – 30 years, and so on. If by duration we all agree that we expect 5 years of service on average, then using the average as the metric makes sense.
Before starting the use of MTBF, just make sure that a 5 year life implies half or two thirds of the devices will fail by the stated duration of 5 years. Yes, if the time to failure distribution is actually described by the exponential distribution (and a few other distributions) it means that two third of the units are expected to failure by the MTBF value. Thus if we set the goal to 5 years MTBF we imply half or more of the units will fail by 5 years.
Product Testing Advantages
Having a goal helps the design and development team make decisions and eventually conduct testing to prove the design meets the reliability objectives. Setting the goal a the expected value allows the fewest number of samples for testing. Testing for 99% reliability over 5 years is much tougher. We may require many samples to determine a meaningful estimate of the leading tail (i.e. first 1% or 5% of failures) of the time to failure distribution.
If the time failure pattern fits an exponential distribution, then testing becomes simplified. We can test one unit for a long time, or many units a short time, and arrive at the same answer. The test planning can maximize our resources to efficiently prove our design meets the objective. When the chance of failure each hour is the same, every device-hour of testing provide an equal amount of information.
Unlike products that wear out or degrade with time, when the design and device exhibit an exponential distribution we do not need any aging studies. We can just apply use or accelerated stress and measure the hours of operation and count the failures. Also any early failures are obviously quality issues and most likely do not count toward failures that represent actual field failures. Or do they?
Metrics Should Have a Common Understanding
When the industry, organization, vendors, and engineering staff already use MTBF to discuss reliability, then management would be wise to establish a metric using MTBF. Makes sense, right? The formula to calculate MTBF is very simple. Even the name implies the meaning (no pun intended). MTBF is the mean time between (or before) failure. It’s an average, which calculators, spreadsheets, smart phones, and possibly even your watch can calculate.
While the spread of the data is often of importance when making comparisons, estimating a sample set of data’s confidence bounds, or estimating the number of failures over the warranty period, if we assume the data actually fits an exponential distribution, we find the mean equals the standard deviation. Great! One less calculation. We have what we need to move forward.
Nearly every reliability or quality textbook or guideline includes extensive discussions about MTBF the exponential distribution and a wide range of reliability related calculations. Our common understanding generally is supported by the plentiful references.
Ask a few folks around you when considering using MTBF. What do they define MTBF as representing? If you receive a consistent answer, you may just have a common understanding. If the understanding is also aligned with the underlying math and assumptions, even better.
When to Use MTBF Checklist
In summary all you need is:
- A business interest in the time till half or more of product fail
- A design with a fixed chance to failure each hour of operation
- A well educated team that understands the proper use of an inverse failure rate measure
I submit we are rarely interested in the time till the bulk of devices fail, rather interested in the time to first failures or some small percentage fail
I suggest that very few devices or system actually fail with a constant hazard rate. If your product does, prove it without grand waves of assumptions.
I have found that engineers, scientists, vendors, customers, and manager regularly misunderstand MTBF and how to properly use an MTBF value.
So back to the opening statement, it is possible though not likely you will find an occasion to effectively use MTBF as a metric. Instead use reliability: the probability of successful operation over a stated period with stated conditions and definition of success. 98% of office printers will function for 5 years without failure in a office…. Pretty clear. Sure we can fully define the function(s) and environment, and we need to do that anyway.
Mark Powell says
Fred,
You had three bullets it seems for when you could use MTBF. I could not follow your reasoning to come up with these.
Why would an interest in when half would fail have anything to do with MTBF?
Your second bullet, which of course defines a Poisson process which leads to an exponential failure model (that you caveat very appropriately later – not possible in this universe due to the second law of thermodynamics), the only purpose is to estimate the parameter.
And your third, if the team understands the proper use of an inverse failure rate, they won’t use it.
MTBF is an average, and using and average in a decision is guaranteed to produce an irrational decision, so I am still looking for a reason to justify the electrons to compute it.
Mark Powell
Fred Schenkelberg says
Yeah, Mark, I’m reaching a bit with this one – none of the arguments as you notice work well.
Best to just avoid MTBF.
Cheers,
Fred
Don Doan says
As a long time PdM expert and Reliability Engineer he is my blasphemy:
MTBF is equivalent to Overall Vibration – it is a trending tool with little or no analyzable data, but it is a great trigger to start analysis.
Fred Schenkelberg says
Hi Don,
And, you can do better using the same data.
Cheers,
Fred
Linda Cottrell says
While I understand the points, my biggest challenge with MTBF is fighting the misunderstanding and misapplication.
If I go ahead and use it anyway, no one will ask me why, and I lose an opportunity to educate.
Fred Schenkelberg says
well said Linda, cheers, Fred
Mark Powell says
Linda,
Not sure I understand.
What I observe is that nobody at all questions the use of MTBF. For me, bringing up the flaws of MTBF when nobody is questioning it is seen as just being a troublemaker (vice educator).
Now if folks were questioning it, then I might really have an opportunity to educate.
Mark Powell