A Question about the Bathtub Myth
Question:
Do all components follow the bathtub curve? Is it possible to generate a bathtub curve for a component?
Thanks for your thoughts and insights!
V.
Response:
This is a good question.
No. No component or product I’ve ever worked with follows the bathtub curve, or looks like those in textbooks. An individual component or product does follow any curve; it simply fails at some point in time for some particular cause. The probability of failure for that single component we use a best estimate to predict the failure) is a combination of the many curves describing the various failure mechanisms, or is a combined model based on similar past products.
An entire population of components, like a capacitor, generally has some portion of units that are damaged in production and escape quality checks – they become early life failures or fail due to the latent damage. These failures tend to occur quickly relative to expected life times. Yet, they can also occur very late in the expected life cycle. There is no rule that they have to fail early, but in general latent damage failures occur early.
There is no such thing in my experience as a constant rate or flat part of the bathtub curve – I’ve never, ever seen one with real data. There are times when the change in failure rate is small enough over a short enough duration that it could be reasonably estimated as constant, but that is very rare. Hence my rather vehement objection to MTBF, which assumes constant failure rates.
Wear out is real. It occurs for many reasons and depending on the product, use environment and failure mechanism, it may have a well-defined model to describe the expected failure rate over time. There are many models that describe specific failure mechanisms. The rate of failure varies dramatically based on failure mechanisms. Metals rust or migrate, pn junctions degrade, polymers break down, gaskets become brittle, materials abrade, bearings and grease diminish, etc. For a specific product there are thousands of ways it can fail, and it is a grand race to cause the eventual failure.
Some products have a dominant failure mechanism. For the brake systems in cars, the brake pads abrade as they create friction and braking force. They wear and eventually will lose the ability to create braking force due to dimensional change, i.e. wear. They are replaced and the remaining equipment within the braking system lasts much longer. When the pads wear out and need replacing we don’t deem the system as failed; although it has, it was predicted and expected.
Mark says
As I have so often observed, Fred is right and provides a very good explanation. My experience is quite similar to Fred’s. The bathtub curve makes for a nice picture in a book and depicts various failure rate possibilities (decreasing, constant, increasing). But it is an idealization that I have never encountered in my work.
Mark
Chet Haibel says
The Bathtub Curve is a summary picture of the behavior of a complex system with Early-Life (infant mortality) failure modes, Random-in-Time (useful life) failure modes, and Wear-out (end-of life) failure modes . Individual components in a complex system have one or more of these types of failure modes, so the system exhibits the Bathtub Curve behavior.
The Bathtub curve is often misunderstood because it is the Hazard Rate, not the histogram of when failures occur, which is usually called the Failure Rate. The Hazard Rate shows the fraction (percentage) of the remaining (unfailed) components that fail per unit time. For example, the Failure Rate for an Exponential distribution is an exponential, whereas the Hazard Rate is constant.
Hilaire Perera says
MTBF does not assume Constant Failure Rate always
In general, MTBF equals the Integral of “0” to “Infinity” of Reliability expressed as a “Time Function”
MTBF describes the expected time between the two consecutive failures for a repairable system
Fred Schenkelberg says
Hi Hilaire,
The calculation of MTBF isn’t the issue, it is the use and understanding. If I measure a system’s time to failure or time between failures and calculate MTBF, that is fine. The assumption comes to play when reporting reliability using only MTBF because when provided with only a single point of information, the only distribution or use of that value assumes a constant failure rate.
As you know, many systems have increasing or decreasing failure rates and will notice a different MTBF value related to the period of time (miles, etc) the data is collected. This alone should indicate MTBF is of little value.
If we provide the Weibull parameters, slope and shape, we provide much more information including the rate of change of failure rates over time. Same data, much better information and much less prone to mistaken assumptions.
If repairable system I recommend mean cumulative function instead – no distribution assumption at all and very easy to understand and use.
cheers,
Fred
Emenike Gift Emenike says
Does software failure follow the bathtub curve?
How would you describe software failure characteristics using the bathtub curve and explanation any portion of the curve the software follows and which it does not follow.
Fred Schenkelberg says
Hi Emenike,
First – remember that the bathtub curve is a convenient fiction to described the changing nature of failure rates over time.
Second, yes software does show early life failure, a decreasing failure rate, and wear out. Initially designed and launched software may contain bugs which over time are resolved as updates occur.
Wear out occurs as the technology around the software makes the software obsolete. The range of issues increases as the hardware and expectation around the software advance. Plus, software may have elements that actually age, given the package runs much longer than the design team anticipated (memory handling, calibration creep, etc.)
I’m not a software engineer, rather a reliability engineer, and have worked with products that exhibited significant decreasing failure rates due to early software bug resolution, and with older software showing an increasing slowness, and increasing number of errors as it struggled to continue to operate.
Cheers,
Fred
Hilaire Perera says
Emenike
This “SlideShare” presentation shows the difference between Hardware & Software Reliability
http://www.slideshare.net/HPERERA/software-reliability-46662435