Reliability Growth and MTBF
Really? Is MTBF the only way to work with reliability growth?
Received this question via LinkedIn (feel free to connect with me there) and hadn’t given it much thought before. I am familiar with a few growth models and regularly have seen MTBF in use. Thus discounted the modeling as an approach of little interest to me or my clients.
MTBF measures the inverse of the average failure rate, when in many cases we really want to know about the first or tenth percentile of time to failure. Measuring and tracking the average time to failure provides little information about the onset of the first few failures.
Reliability Growth Models
Did just a quick check of common reliability growth models and found a few in the NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm .
The Homogeneous Poisson Process (HPP) when the failure rate is constant over the time period of interest. This relies on the exponential distribution and the assumption of a stable and random arrival of failures, which is almost always not true (in my experience). It’s a convenient assumption as it makes the math a lot simpler, yet provides only a crude model and poor results.
The Non-Homogeneous Poisson process (NHPP) Power Law and Exponential Law models provide information based on the cumulative number of failures over time. These models rely on the notion that any system has a finite number of design errors that once resolved create a system that has a HPP behavior.
Duane Plot provides a graphical means to show cumulative failures over time. When the arrival of failures slows the curve decreases in slope effectively bending over. This provides a means to estimate the final failure rate (average unfortunately).
What I use instead
Given my dislike of all things MTBF, I’ve not used these model to estimate MTBF. Instead stay with the Duane plot and graphically track when the team is finding and fixing enough faults in the design.
I also tend to use reliability block diagrams (RBD) with each block modeled with the appropriate reliability distribution. For a series model then all we need to do is multiple the reliability value from each block for time t (say warranty period, or mission time, etc.) to estimate the system reliability at time t.
For complex systems with some amount of redundancy the RBD does get a bit more complicated. For very complex systems with degraded modes of operation or significant repair times then use Petri Nets or Markov Models to properly model.
In the vast majority of cases a simple RBD is sufficient to capture and understand the reliability of a system. This allows the team to focus on improving weak areas and reduce uncertainty though improving reliability estimates. An RBD does not require nor assume an exponential distribution and the math is easy enough to manage, often even in your favorite spreadsheet.
Summary
Reliability growth starts with model of the estimated number of failures over a time period. Testing then provides a value for that estimate. This does not require the use of MTBF, so instead of assuming a constant failure rate, focus on the failure mechanisms and use a simple RBD to build a system model. The reliability growth is the result of identifying areas for improvement and doing the improvement. RBD, in my experience, provides a great way to communicate with the team where to focus improvements.
Roger H says
Hi Fred the debate on MT etc is one that interests many in reliability. I have worked on a number of programmes where reliability growth has been required. The use of Duane and AMSAA methods were used as a means of demonstrating progress towards the overall target. Bearing in mind these rely on point estimates and growth as a means of showing inprovement in the form of a reduced rate of failure. At the end if the day I had doubts as others did on the use of MT etc but recognised it was the chosen method. The real aim to improve reliability was through examining and eliminating or controlling failure mechanisms by creating fixes that were effective and retesting to demonstrate that success. I found the use of CUSUM allowed me to examine short term trends and in particular identify failures during the early stages after maintenance and as a result of poor quality resulting in a step change from 29% to a retest with no failures! Keep up the good work.
Fred Schenkelberg says
Hi Roger,
Thanks for the comment. Yes, you and others should doubt the value and usefulness of MT. Focusing on failure mechanisms is a great plan and seems to have worked well for you.
Cheers,
Fred
Pete Stuart says
Hi Roger, that is a good point to raise. In the modern world of iterative design development processes, alternate models have been developed that seek to address the issue you have highlighted regarding Failure Mode surfacing. There is certainly not a single universal model that is ideal for all situations however, the AMSAA Planning Model based on Projection Methodology (PM2) has been the most useful and versatile model for the developmental design projects I have been involved with. I couldn’t agree more with your comments regarding Failure Mode rectification either. Reliability Growth Models are indeed just one part of a much broader set of tools and techniques that need to be used in order to realise reliability growth. Perhaps that is a good topic for an article…
Sarath Jayatilleka says
Hi Fred,
J. T. Duane never used the term MTBF in his classic paper in 1964. But later many famous reliability texts and literature tied MTBF directly to Duane Model. That is not the right justice to J. T. Duane or his Model, in addition to the flowers of his model.
– Sarath Jayatilleka
Fred Schenkelberg says
Thanks for note, so, what can we do to help get back to Duane concept in a meaningful way?
Roger H says
Hi again Duane simple best fit through points relied on all of the faults to be upfront, the test fix retest principle in reducing faults through a reduced frequency creating an increase in the the slope of the graph but this is purely y=mx+c. The point estimate being factored by the growth rate to provide an instantaneous point estimate. If the design is immature or infested by quality issues then the slope will be shallow, the equipment unreliable, unaffordable and less pleasing to the market place. These tools are only one way of achieving the goal, dedication to quality designs and production processes supported by diligent rectification of faults arising during growth and demonstration programme is the key. Regards
Roger H says
Thanks Pete like minds hey, applied correctly with a balanced investment these techniques can make massive savings across the life cycle and cultivate a conjoined stakeholder enterprise with common goals cheers