There are occasions when we have either field or test data that includes the duration of operation and whether or not the unit failed. This can be, say, 10 large motors. For sake of argument, the test ran each motor for 1,000 hours and when a motor failed it was repaired quickly and returned to the test. There were 3 failures.
Sadly, this is all we need to calculate an estimate for the motor MTBF.
Total time divided by number of failures in this case is 10 times 1,000 hours for a total time of 10,000 hours. Divide 10,000 by the three failures to find, 3,333 hr MTBF.
What I find interesting is I could find the same MTBF value using 10,000 motors each run for one hour. Or, the same MTBF if we ran one motor for 10,000 hours. IF in each case there were three failures we would find the MTBF of 3,333 hours.
Now that works perfectly well when there is a constant failure rate. Meaning there is equal chance of failure each hour of operation. Old motors would have the same chance of failure as brand new motors.
Of course, you know why I choose motors for the example. To reinforce the idea that the chance of failure is not always a constant. Be sure to think about the failure mechanisms before using MTBF (or MTTF). If the failure rate is time dependent then this simple calculation is not useful.
I used this example during a class last week and it seemed to spark a good discussion. How have you explained MTBF to others? Any suggestions on how to best describe what the MTBF value really means, or doesn’t mean?
Bill Chancellor says
In my case, trying to calculate reliability and/or MTBF for a subsystem is very frustrating. The advertised reliability of many components are based on the OEM’s projection or engineering analysis because no one wants to spend the dollars and time required to truly test the component thoroughly. Trying to validate their projection or engineering analysis at the system level usually means that my only data is based on one or two failures for a series of tests that accumulate a total of 30 to 40 hours of operation. For simplicity, I am usually forced to ignore the conditions of testing (e.g., temperature, altitude, load). Components that require a high level of reliability (R) and confidence (C) need several hundred hours of operation to fully demonstrate R&C.
Fred Schenkelberg says
Hi Bill, I feel your pain. Vendors have to deal with many operating conditions and use cases. They tend to do what is requested by the majority. Unfortunately, so many seem happy with very poor information, that those that need and request better information and thwarted. I suggest we continue to ask for meaningful information, educate our peers to do likewise, and when all else fails do the testing ourselves.
Cheers,
Fred
Stefan Verschoor says
Hi Fred,
At the moment I am doing an internship at a company concerning MTBF.
My research is forcing the same question into my mind everytime: Is it even wise to help them calculate their MTBF? The FR is not constant at all, as they mainly produce flowmeters.
MTBF for me is not an estimation of how long an asset will last at all, for me it says more about the improvement/ decrease of the reliability of an asset or system.
Your topic intrigued me as I am starting to very much agree about whether it is wise to use MTBF at all.
I would be very excited if you could tell me your experience with other, similair and preferably more representative metrics.
Greatings,
Stefan
Fred Schenkelberg says
Hi Stefan, you should be concerned as MTBF most likely is misleading or not representing the actual failure rate at any particular time of interest.
Instead use reliability, probability of success at a specific duration. 98% reliable over 1 year, for example.
Use multiple points in time, or better and you have the time to failure information, fit a Weibull distribution (or appropriate distribution) and have the entire picture of probability of failure over time in a CDF plot.
Cheers,
Fred
Stefan Verschoor says
Hi Fred,
Thank you alot! I will have a look into that, this is very usefull and fun information for me to work with! Thanks again.
Greetings,
Stefan
Rafal says
Hi Fred, I appreciate your site a lot. How important mission it is one can understand searching the internet for exmaples of MTTF calculation and FIT.
I’ve started reading about hazard rate, failure rate and MTTF’s etc. but can’t find any advise how to interpret test data toward obtaining failure rate. Let me put here example:
I’m testing 10 devices (nonrepairable system) over e.g. 400 hrs.
Recorded failure times in hrs: {30, 45, 60, 90, 120, 180, 240, 300}, 2 devices survived.
Can I say my failure rate is 8/(30+45+60+90+120+180+240+300+2*400) and MTTF as reciprocal of Failure rate? maybe even I shouldn’t even try to calculate failure rate from this data? Does this method imply any problems with reliability calculation? I agree that mean value for particular distribution yields different reliability but please advise how to process this data in correct way.
I appreciate your feedback in advance.
Fred Schenkelberg says
Hi Rafal, thanks for the note and example problem. While you can estimate the failure rate and MTTF as described it is not all that useful in most cases.
Instead use a Weibull analysis (with so few data points Weibull is often a great starting point as it is versatile ) This will provide the probability of failure at various points in time.
I’m traveling at the moment and have limited access, so will follow up later when I can either work out the problem for you, or point to a better reference and example.
Cheers,
Fred
Rafal says
HI Fred.
Thank you for your interest. Meanwhile I was sitting and struggling to understand meaning of Failure Rate and its interpretation. I think it can be good supplement to the previous question if I ask if when for example failure rate equal 0,004 fail/hr or in other words 4 fails per 1000hrs means 4 of them will fail every 1000 hrs assuming exponential distribution and constant hazard rate. It also means that if I had 4 components then I could expect none of them functioning after 1000 hrs but it also means that if I had 1000 components 4 of them will fail within 1000 hrs and 996 will remain healthy until next 1000 hrs left? I read somewhere failure rate example: FR=0.1 means 10% of population will fail every time stamp and in fact it plots exponential curve but in this case having specific amount of devices e.g. 1000 components, it lineary drops to 0 after 250 000 of hrs gone (1000 * 250) where 250 is mean time to fail. Maybe it shouldn’t be understood so straight forward? Maybe if it is an average value and follows assumed exponential pdf then we can say in average 4/1000hrs fails but in our case 37% will fail in first 250hrs and remaining 63% within the next 250000-250 = 249750 hrs. If this is correct I’m home if not I’m lost…
Jude Nnanna says
Nice work here Fred!!
what is the best formula to use in calculating MTBF? Average time between failures i.e (total operating time/number of failures) or Total time/number of failures?
Fred Schenkelberg says
There are many definitions of what is and is not counted. I tend to use, if I cannot dissuade the use of anything else, the total operating time. If using total time, you are including the repair or replacement time. Best to use Availability directly if interested in uptime.
Fred Schenkelberg says
I would use operating time, yet I would also argue that you would be better off not calculating MTBF at all.
Vivek Panchal says
Dear Fred,
I work at a company where we are making a device wherein we are using several electronic components. The devices are in their trial phase and I want to calculate its mtbf. What I want to know that since it has not been more than 5-6 months since our first device was made,which method of calculating mtbf should I use, estimation or prediction? I have downloaded a software for mtbf prediction by ALD. If I want to calculate the mtbf by estimation how long do I need to collect data for to get a good idea of the failure rate.
Fred Schenkelberg says
Hi Vivek,
Pretty much doesn’t matter which method you use to find MTBF, as it is probably not useful for what you want to accomplish.
What is it you need to understand about your device? How long will it last in some environment? How long, under what conditions, etc? Or do you need to provide feedback to your team on the demonstrated reliability to date? Or, do you need to find elements of the design to make it more robust or reliable?
What is it your are trying to accomplish? What question is your team trying to answer?
Real data is always better than a tabulated set of failure rate (ALD or vendors datasets included). If you device is expected to operate for 5 to 6 months and you have some number of devices running (maybe a few failures or not), you might be able to sort out a life estimate. If the device should run for 20 years, then unless you’ve been running accelerated life tests focused on the dominant failure mechanisms, you have very little data to help you estimate the reliability at 20 years.
Parts count predictions are not useful to estimate the actual field reliability performance. Using just constant failure rates or MTBF values, you learn very little about how it will perform for your customers – most parts count based prediction methods even state they are not useful to estimate field performance. They can be useful to estimate relative changes in design or use conditions, and even that is suspect in many cases.
If the desire is to determine if ‘the design is in the ballpark or close enough on reliability’ then predictions are not the right approach. Instead, determine what is likely to fail. Use a physics of failure approach and based on what is likely to fail, and understanding your use conditions, estimate the time to failure information you need.
Bottom line, do not focus on calculating MTBF, instead understand the question(s) your team needs answered and work to answer those using reliability, availability, or other meaningful metrics.
Cheers,
Fred
Lisa McCollum says
I have been given this “Awesome” assignment of calculating the MTBF for roughly 4000 producing oil wells that have been producing for let’s say,, we are going to start calculations 1-1-2000. Obviously some of these wells have been producing for decades and have anywhere between 3 to 9 failures over the life of the well. I am not sure how to handle this on a monthly chart. The active well count will vary each month as wells are continually drilled or plugged,,,, I am using an application called spotfire and used a dense ranking function to number the work over dates for each well and calculating average days between each failure but trying to average the averages and plot it monthly,,,,, please,,,,
Fred Schenkelberg says
Hi Lisa,
First off, as you may know, my advice would be to avoid using MTBF in any shape or form.
Given it is an assignment, you should ask what the calculated values are used for, what decisions or actions will come from the results? You may find that the interest in performance over a set time frame, or by well, or some other way to parse the data.
The calculation for MTBF is really just the total operating time divided by the number of failures. Sure the numerator and denominator will change as well come on or off line – yet their operating time and number of failures still count.
You and likely your team would be better served by calculating mean cumulative functions for each well and overall. It’s a non-parametric plot and function that will help you determine the current status of the group of wells, plus the trend on rate of failures (getting better or worse with current operation and maintenance practices)
Hope that helps, let me know if you have more questions.
Cheers,
Fred
Keith Megow says
Hello Fred,
We typically run single complex systems in a reliability growth test mode. We are always in a position of determining the “Initial” MTBF (MI) as the ratio of final to initial MTBF (MF/MI) is limited. Average is about 4.0.
So just wondering if you had any opinion on how long should we test to determine the initial reliability ?
As a concrete example, use a required MTBF of 1000 (FMECA derived). I have seen literature of expected initial MTBF on the order of 25% of the required. This seems to say I should test long enough to prove at least 250 MTBF.
Thanks,
Keith
Fred Schenkelberg says
Hi Keith, I think you’re going about this in a misguided fashion. The reliability is what it is in the design and in part reflected by prototypes or early production. Guessing or estimating from parts count, FMEA, or other paper studies are all very poor methods to determine the reliability. BTW MTBF is not reliability it is a misleading and misunderstood inverse failure rate.
the best thing you can do is avoid using MTBF at every turn
For the estimating of reliability, using a test based on estimating MTBF will always provide a faulty value. Instead, measure and track the probability of successful operation over some duration. For repairable systems, you can use mean cumulative functions.
If you need to estimate reliability, how well do you need to know the value, how about constraint? such as time, samples, etc?
How about using a reliability block diagram and using that to estimate availability over some duration, while focusing on non-repairable reliability estimates (from historical or testing data)?
Bottom line, stop using MTBF and your work will become easier and clearer for all involved.
Cheers,
Fred