Two Ways to Think and Talk about Reliability
Neither includes using MTBF, btw.
And, I’m not thinking about the common language definition either.
Plus, I may have this all wrong. Here is the way I think about the reliability of something. More than ‘it should just work’ and different than ‘one can count on it to start’. When I ask someone how reliable a product is, this is what I mean.
By explaining my basic understanding we can compare notes. It is possible, quite possible, that I will learn something. As you may as well. Let’s see.
Definition of Reliability in Four Parts
First, consider the definition of reliability as used by reliability engineers and others in the know.
The probability that an item will perform a required function without failure under stated conditions for a stated period of time. Practical Reliability Engineering, Fifth Edition. Patrick D. T. O’Connor and Andre Kleyner, John Wiley & Sons, Ltd. 2012.
The product will probably work for some duration. The reliability function for life distributions is the probability of success over a duration.
We define reliability as a probability over a duration. To me, there are two ways that I understand this concept.
- How many will survive?
- Or, what is the chance this one will survive?
Let’s explore these two a bit more.
How Many Will Survive a Duration
If a product that is about to launch offers a warranty, a common question may be, “how many will fail during the warranty period?” The finance team and others want to know so they can plan accordingly.
The complement of how many will fail is how many will survive (it does sound a bit more positive then acknowledging failure occur.) Hence, the ‘survive’ idea.
So, if we create and sell 1,000 widgets we are interested in the reliability of said widget. If the probability the widgets will perform over the warranty period without failure is 90%, then we would expect 900 widgets to have worked without failure over the warranty period.
This understanding works when we have more than one item in considerations. Say 10 million cell phones, 50 thousand electric vehicles, 768 electric toothbrushes. Given some number put into service, what is the tally of successful widgets, meaning the ones that haven’t failed, still working at the end of some duration?
So, if our goal is 90% reliable over 2 years, we expect, if we achieve our goal, to have 9 out of 10 widgets function without failure for 2 years starting when manufactured, sold, or placed into service (whenever we and our customer defines time zero for an individual widget.)
It’s not calendar time or time since first launch and first sale. Time is relative to the individual widgets. It is the duration over which the item is expected to function that we track.
How Likely is This Item to Survive the Duration
As a customer that just buys one widget, not an entire fleet, I’m interested in the chance my widget will survive some duration.
When I bought my current cell phone, I entered that purchase with the expectation it would last at least 3 years and would be great if it meets all my expectations functionally over 4 years. A consideration is the probability that the specific cell phone I purchase will survive over my expected duration. The sales folks and even reliability professionals could not provide me with the probability of success for my individual, serial number xyz, will survive 3 years or any other duration.
What we might know, or hopefully the product development team and supporting reliability professional know, is the probability an individual item will survive a duration, in general, or on average (average not being the right word, I think).
If the design and production process create a phone that is 90% reliable over 3 years for all phones they produce, then we can estimate that the phone in my hand has a 90% chance of surviving 3 years.
There is a lot of factors that contribute to the time each phone actually fails. Design changes, manufacturing variability, environmental and use differences, etc. In some circumstances, we may estimate the confidence bounds around the probability of surviving three years. Or we may find a range within which there is a 90% chance the individual items will survive. Either way, our best guess for an individual item is the reliability over the duration, R(t).
I once heard a woman buying an inkjet printer ask the clerk to select the box which has a printer most likely to last 3 years. The boxes are not labeled to indicate which has the most robust components, so the clerk selected the box with the fewest blemishes on the box. They laughed and she bought a printer hoping her printer would last 3 years.
How Do You Talk About Reliability?
I use probability as a function of time. The chance of survival over a short time is better than over a longer time, so I use a reliability function to keep track. The Weibull distribution is my goto, yet there are other options both parametric and non-parametric methods to describe the probability as a function of time.
Do you use the reliability definition listed in so many textbooks as what you mean when you ask, ‘how reliable is this component?’ A probability over a duration? Shouldn’t we ask for the probability of success over some duration, or set of durations?
What do you think of my two ways of considering ‘reliability’? What does ‘reliability’ mean to you?
Kevin Walker says
Reporting reliability test results and field data analysis in terms of time to 1% or 5% failures has been pretty well received here. The business team definitely gets it when they hear it in those terms – OK, we build the business plan based on cost of x # of failures coming back. They never really knew what to do with MTBF, so when you make someone’s job easier, acceptance comes along readily.
Fred says
Hi Kevin, good for you. I agree using simple time to failure metrics, such as time to 1% failures works well. If the metric is understood it helps it become useful. cheers, Fred