It Started With a Question
It is the idea to eradicate MTBF from common use. The first question was
How do you explain what MTBF is and isn’t to someone that misunderstand MTBF?
Your Reliability Engineering Professional Development Site
It is the idea to eradicate MTBF from common use. The first question was
How do you explain what MTBF is and isn’t to someone that misunderstand MTBF?
How does your equipment fail? How do you plan for spares? Do you use your existing failure data to help refine your maintenance planning?
Given the title of the article, these questions are reasonable. As either a plant reliability or maintenance engineer do you also rely on gut feel to refine your estimates? If you rely on MTBF or similar metrics, you most likely do not trust the data to provide useful answers. [Read more…]
Guest Post by Andrew Rowland, Executive Consultant, ReliaQual Associates, LLC, www.reliaqual.com in response to the ‘Reliability Predictions‘ article.
Hi Fred,
In the section on predictions you mention Dr. Box’s oft quoted
statement that “..all models are wrong, but some are useful.” In the
same book Dr. Box also wrote, “Remember that all models are wrong; the
practical question is how wrong do they have to be to not be useful.” [see these and other quote by Dr. George Box here]
Reliability predictions are intended to be used as risk and resource
management tools. For example, a prediction can be used to:
None of these require that the model provide an accurate prediction of
field reliability. The absolute values aren’t important for any of the
above tasks, the relative values are. This is true whether you express
the result as a hazard rate/MTBF or as a reliability. Handbook methods
provide a common basis for calculating these relative values; a
standard as it were. The model is wrong, but if used properly it can
be useful.
Think about the use of RPN’s in certain FMEA. The absolute value of
the RPN is meaningless, the relative value is what’s important. For
sure, an RPN of 600 is high, unless every other RPN is greater than
600. Similarly, an RPN of 100 isn’t very large, unless every other RPN
is less than 100. The RPN is wrong as a model of risk, but it can be
useful.
I once worked at an industrial facility where the engineers would dump
a load of process data into a spreadsheet. Then they would fit a
polynomial trend line to the raw data. They would increase the order
of the polynomial until R^2 = 1 or they reached the maximum order
supported by the spreadsheet software. The engineers and management
used these “models” to support all sorts of decision making. They were
often frustrated because they seemed to be dealing with the same
problems over and over. The problem wasn’t with the method, it was
with the organization’s misunderstanding, and subsequent misuse, of
regression and model building. In this case, the model was so wrong it
wasn’t just useless, it was often a detriment.
Reliability predictions often get press. In my experience, this is
mostly the result of misunderstanding of their purpose and misuse of
the results. I haven’t used every handbook method out there, but each
that I have used state somewhere that the prediction is not intended to
represent actual field reliability. For example, MIL-HDBK-217 states,
“…a reliability prediction should never be assumed to represent the expected field reliability.”
I think the term “prediction” misleads
the consumer into believing the end result is somehow an accurate
representation of fielded reliability. When this ends up not being the
case, rather than reflecting internally, we prefer to conclude the
model must be flawed.
All that said, I would be one of the first to admit the handbooks could
and should be updated and improved. We should strive to make the
models less wrong, but we should also strive to use them properly.
Using them as estimators of field reliability is wrong whether the
results are expressed as MTBF or reliability.
Best Regards,
Andrew
Really? Is MTBF the only way to work with reliability growth?
Received this question via LinkedIn (feel free to connect with me there) and hadn’t given it much thought before. I am familiar with a few growth models and regularly have seen MTBF in use. Thus discounted the modeling as an approach of little interest to me or my clients.
MTBF measures the inverse of the average failure rate, when in many cases we really want to know about the first or tenth percentile of time to failure. Measuring and tracking the average time to failure provides little information about the onset of the first few failures.
Did just a quick check of common reliability growth models and found a few in the NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/apr/section1/apr19.htm .
The Homogeneous Poisson Process (HPP) when the failure rate is constant over the time period of interest. This relies on the exponential distribution and the assumption of a stable and random arrival of failures, which is almost always not true (in my experience). It’s a convenient assumption as it makes the math a lot simpler, yet provides only a crude model and poor results.
The Non-Homogeneous Poisson process (NHPP) Power Law and Exponential Law models provide information based on the cumulative number of failures over time. These models rely on the notion that any system has a finite number of design errors that once resolved create a system that has a HPP behavior.
Duane Plot provides a graphical means to show cumulative failures over time. When the arrival of failures slows the curve decreases in slope effectively bending over. This provides a means to estimate the final failure rate (average unfortunately).
Given my dislike of all things MTBF, I’ve not used these model to estimate MTBF. Instead stay with the Duane plot and graphically track when the team is finding and fixing enough faults in the design.
I also tend to use reliability block diagrams (RBD) with each block modeled with the appropriate reliability distribution. For a series model then all we need to do is multiple the reliability value from each block for time t (say warranty period, or mission time, etc.) to estimate the system reliability at time t.
For complex systems with some amount of redundancy the RBD does get a bit more complicated. For very complex systems with degraded modes of operation or significant repair times then use Petri Nets or Markov Models to properly model.
In the vast majority of cases a simple RBD is sufficient to capture and understand the reliability of a system. This allows the team to focus on improving weak areas and reduce uncertainty though improving reliability estimates. An RBD does not require nor assume an exponential distribution and the math is easy enough to manage, often even in your favorite spreadsheet.
Reliability growth starts with model of the estimated number of failures over a time period. Testing then provides a value for that estimate. This does not require the use of MTBF, so instead of assuming a constant failure rate, focus on the failure mechanisms and use a simple RBD to build a system model. The reliability growth is the result of identifying areas for improvement and doing the improvement. RBD, in my experience, provides a great way to communicate with the team where to focus improvements.
Over the past week I’ve seen or received a couple of questions about MTTF. One was on how to use failure data to calculate MTTF, another on how to estimate Weibull parameters after assuming a constant rate of failure.
It is good to see such questions, as it means the person is curious enough to take the time to ask. [Read more…]
Do you check assumptions? Not all assumptions are equal as some may lead you to a costly decision.
We regularly make assumptions about the uniformity of material, the consistency of part to part performance, and many other engineering elements of a design or process. We have to simply the problems we face in order to work out solutions and make decisions. [Read more…]
Every now and then we need to ask a supplier for a reliability estimate for a component they produce. Our team may be considering adding the part to a system and would like to know if it is reliable enough to meet our needs. [Read more…]
The basic question of ‘How long should it last?” may be the first question you consider related to reliability of your product or production equipment. Ideally we would like to create a product that will never fail for our customers, or a set of equipment that just keeps running. [Read more…]
Tim Rodgers interviews Fred Schenkelberg, consultant and blogger of NoMTBF, concerning Fred’s work and writing around the perils of MTBF.
We range from what started the site and the common issues caused by using MTBF. Then we discuss using reliability instead.
Over the past two weeks this site has received over 150 visitors each weekday. From what I can see in the analytics and from a few conversations with folks, the site provides insights and information around the use of MTBF, plus basic information concerning reliability engineering.
Google tends to like the site as they agree that visitors like the site, too.
Given the interest and plenty of encouragement (and helpful suggestions) I’m putting together a book based on the NoMTBF material. Not just bashing MTBF, although there is plenty of that, but also the steps to use reliability or other measure that provide better information.
I have the basic outline and draft completed and am now ready for some feedback. If you’d like to review the work, conditional on you providing you feedback, suggestions, ideas and comments, let me know and I’ll send you a draft copy.
The draft needs work on formatting, layout, adding clean graphics, etc. Yet the outline and basic text is there.
Can you follow the argument, is the writing clear, is there anything missing, how about the order or emphasis?
It’s not a long work, right now about 22,000 words or depending on book page size, fonts size, margins, etc. about 100 to 120 pages. In word it has 73 pages right now without any attention to formatting.
If you have the time and interest let me know and I’ll send you copy, but you have to comment, critic, and make suggestions. I really would like this work to be useful for you and for use to encourage others to avoid using MTBF.
This period of reflection concerning the NoMTBF project has reinforced the idea that we need to provide something concrete and positive to do instead of just not doing MTBF. Part of the issue is our education system, standards, and textbooks as they often include MTBF in examples and at length in the discussion.
So, the idea is to create a course for experienced reliability professionals and interested engineers and managers with an interest in reliability, that focuses on reliability metrics from goal setting to tracking performance.
I’ve the technology to put together an online course that could be self paced or provided on a fixed schedule (say weekly). It could include short lectures, discussions, reading material and quizzes or examples to work.
Here’s a draft outline – what do you think?
Common reliability measures: pros and cons
Reliability and Availability Goal setting – connecting the goal to your business objectives
Estimating reliability for comparison to the goals
Tracking reliability and reporting performance
Reliability testing with results that compare to goals
Reliability modeling that leads to meaningful discussions and decisions
Common mistakes and remedies concerning reliability measures
How to get useful reliability information from vendors
(plenty of opportunity for bashing MTBF, yet if done in contrast to much better methods and measures, may provide really practical and useful information.)
So, thoughts? What would you want added, emphasized, and what would you want to be main take aways for each topics? What would you like to see in the course for yourself or for those you’d recommend take the course?
If you’d like to participate in the course project, I’m very open to your ideas and suggestions. Maybe help create and present a topic, provide examples, or sample problems or discussion questions.
Anyway, looking for feedback and ideas to make the NoMTBF site much more positive and useful for the reliability engineering community and for anyone interested in reliability.
Hi Fred,
Your website has generated quite a bit of valid conversation about MTBF. I applaud you for that. Honestly though I have mixed feelings about some of what you present and thought I’d write this lengthy e-mail to provide some feedback. I hope you take this in the right light as constructive criticism from someone who, overall, appreciate your efforts.
Let me start with a point I disagree with. In your opening slide show “Thinking about MTBF” I think the “Common Confusion” slide could be better presented. Many viewers would interpret that slide to say that the MTBF is not the mean. Of course MTBF is the mean. Your point is that, while it is the mean, the distribution is not Gaussian. Fair enough. Funny thing is I’ve actually had quality engineers try and tell me the MTBF is not the mean of the distribution and I’m afraid your slide may perpetuate that misunderstanding.
In the same vein, later in the talk, and in the other sections on your site, you seem to indicate that the MTBF is not the expected value (See Perils “I heard one design team manager explain MTBF as the time to expect from one failure to the next.”). Of course the MTBF is the expected value. That is from a pure mathematical sense (as you discuss earlier in this section). So I’m confused on your point here. I guess you are commenting on the laymen’s feeling for “expected” value. Which leads me to my next section.
It almost appears that one of the premises of NoMTBF is that many people do not understand statistics and therefore we should not confuse them by using MTBF. I disagree with this. For example, many people don’t understand the difference between median and mean but no one is suggesting we remove those terms. Similarly because many people incorrectly assume a Gaussian distribution when they hear the term mean is hardly justification for removing the term MTBF. The problem is education not the definition. Same point for expectation. Because the average is some value does not imply all samples will be equal to that value. Anyone who thinks that, in my opinion needs more education in statistic and we shouldn’t try and “simplify” to account for lack of education.
I don’t really accept your implication that using MTBF implies constant failure rate. The proper definition is the integral form you present in a number of spots but I agree that many tie these two together. I think one of the themes of your website is that the constant failure rate assumption is not valid. In that, I’m in 100% agreement and applaud your efforts. (I guess the site name would not have the same panache if it was called NoConstantFailureRate). Clearly the constant failure rate model often does not apply and reducing all of reliability to one number is a gross simplification.
So where should people go instead? Just bashing something is not a solution. Your website really has had an impact but in a strange way sometimes it has had the opposite impact than what I think we would both like. I’ve had quality managers who did not want to gather the data on field failure with, in part, the justification that MTBF is bogus statistics. OK MTBF is not perfect but I’m sure we agree that the way to improve reliability is to gather data as a first step.
You have quite a following and, personally, I’d like to see you to lead more. Yes MTBF is a simplification but I also don’t expect to pick up a data sheet and see physics of failure paper stabled to the back of it or a chart of reliability over time. Fact is many complex things get reduced to a few key numbers (e.g. horsepower, MPG, 0 to 60 time for a car). I think your Actions/Alternative Metric is addressing this. Stating a reliability percentage over a time interval is an intriguing alternative. I like it. If that is your alternative then, personally I’d like to see it more clearly emphasized across the site. I’d also like to see you develop it more. How does one determine reliability % and duration from the Weibull parameters? How would one put together a reliability block diagram and estimate overall reliability if subcomponents were specified in this manner? I don’t know that answer to these questions and I’d be interested in reading more.
As I stated in the beginning, I hope you take this in the right light. While obviously I don’t agree with everything on your site you have many extremely valid points and you are doing a great job stimulating discussion. Thanks for your efforts.
Scott Diamond
Vice President of Quality and Customer Excellence
Surveillance Group
FLIR Systems Inc.
— Ed note:
Thanks Scott for the insightful and meaningful feedback – I will be making some adjustments and improvements. Thanks for the careful reading and taking time to provide you suggestions and comments. Very much appreciated. Fred
Summer Break
Taking a week off away from the article writing so in the vain of summer reruns, providing a list to the top five posts from the NoMTBF site.
In no particular order:
[display-posts tag=”popular” posts_per_page=”5″ include_excerpt=”true”]
Enjoy these again or for the first time.
As engineers laying out a factory or designing a new product we have to meet the reliability expectations of our customers. It would be great if the system would not fail or need repair, yet that is often not the case. [Read more…]
MTBF has issues. It is commonly mis-understood and mis-used. I find it hard to interpret and use for any meaningful discussion of reliability.
The entire premise of the NoMTBF site is to encourage you to not use MTBF.
There are exhaustive writings on setting meaningful goals and metrics in the business literature. A couple of tenants seem common: [Read more…]