The Challenges in Reliability Engineering

What are the Other Challenges in Reliability

Creating a product or system that lasts as long as expected, or longer, is a challenge.

It’s a common challenge that reliability engineering and the entire engineering team face on a regular basis. It’s also not our only challenge.

We face and solve a myriad of technical, political, and engineering challenges. Some of our challenges are born and carried forward by our own industry. We have tools suitable for a given purpose altered to ‘fit’ another situation (inappropriately and creating misleading results). We have terms that we, and our peers, struggle to understand.

Sometimes, we, as reliability engineers, have set up challenges that thwart our best efforts to make progress.

Let’s examine a few of the self made challenges and discuss ways to overcome these obstacles permitting us to tackle the real hurdles in our path.

MTBF and Prediction are The Two Big Issues

This site has the expressed goal to ‘eradicate MTBF’. It is the worst four-letter acronym in our world. You already know this and so many of the readers here have taken steps to see this term relegated to the dust of forgotten history.

Parts count predictions, especially from our favorite military standard, are widely known to be less than useful. Then why do we continue to find requirements to use this method as a basis to estimate actual future field failure rates?

Even 20 years after 217’s retirement/obsolescence, it lives. Again, there are teams working on viable and actually useful alternatives. Physical of failure modeling, improved reliability modeling tools that permit (nay encourage) the use of appropriate life time distributions, and other work is slowly weaning our industry from the folly of parts count predictions.

HALT: “Let’s pass HALT”

This one isn’t discussed too often. Yet, have you heard someone wonder if their product could pass HALT?
How about, ‘of course it failed you were testing above the specified use level..’

HALT is the second-worst four-letter acronym.

We have a ways to go to make this basic concept clear. We will employ a stress testing process to identify weaknesses in the design. We are going to use elevated stresses to discover problems and margins quickly.

Cost of Failure

Engineers know intuitively that failures are bad. The design effort includes actions to design a robust and reliable product.

One tool that we often avoid employing is the actual or estimated cost of a failure. We tend to focus on failure rates and failure mechanisms, which is fine to a point. Yet, if we do not also include the consequence (safety, warranty, brand loyalty, customer losses, etc.) we only enjoy half the information we need to enable great decisions.

Our team needs to work on the potential and actual failures that make a difference when solved. Not all failure modes are the same. Let’s solve the ones that save the most lives, anguish, and money.

Get the information you need for your product to determine the cost per failure. This information along with a expected shipping volume and estimated failures rates enables the calculation of the cost of failure.

If you calculate the cost of failure per unit shipped, you have a value that is comparable to the bill of material cost of the materials and components in a product. In my experience, the cost of failure per unit shipped is the most expensive or within the top 5 most expensive components in a product.

We employ teams of engineers to develop a single critical component, to cost reduce an expensive component, and our ignorance allows wonderful opportunities for savings to remain hidden.

Determine the cost of failure and make that information widely available to your team. Show them how to use the information to weigh the everyday decision they make during design and development.

Mixed Priorities

I’ve been told product reliability is critical than asked to use less then half the sample size necessary for an accelerated life test.

Critical, important, and top priority are great terms. They sound great. If they do not come with resources, personnel, budgets, and support, those terms are hollow platitudes suggesting our work on reliability is critical, important, or a top priority.

I’m not suggesting, although often really do believe, reliability performance is a top priority. Organization have many priorities and I get that. The challenge is in the mixed signals. The unclear priorities. The many top priorities.

The remedy is to quantify the cost of failure again—management, mostly, talks in terms of money. So, we need to convert a 1% failure rate into dollars lost to warranty per year. We need to quantify the cost of uncertainty, especially when the uncertainly ranges from none to billions in potential losses. A 10% chance that we have a major safety issue for a $100 million product line suggests the likely loss is $10 million unless we reduce the risk. Few other product risks involve such threats to profit and business viability.

Part of why reliability isn’t well positioned in the pantheon of priorities is it is difficult to quantify. At least that is my observation. Difficult doesn’t mean impossible.

Reliability is one of the most important priorities for most organizations to get right. Let’s help our teams align the ability to deliver the expected reliability to achieve the goals, while properly balancing with other priorities.

Summary

There are challenges in the world of reliability engineering. MTBF and predictions are well known and many are working to help us and our peers move forward.
HALT, Cost of Failure, and Mixed Priorities are 3 of the many challenges you face on a regular basis. What would you add to this list? How can we, as a community of reliability engineers do to solve them? Add you suggestions and recommendations in the comments section below.

Rick Kossik says

April 20, 2017 at 9:12 AM

I believe your most important point here is that proper design requires not just modeling failure (and repairs), but modeling the consequences of failures. This is a much more difficult task than traditional reliability modeling, as it requires a “total system model” that not only simulates the components that can fail (and perhaps be repaired), but also models (in detail) the consequences of different types of failures. Only then is it possible to focus on the failures that are important.

A simple example of this is a water resource system. If a pump fails, how does it affect the rest of the system? Is the failure simply an inconvenience or does it lead to catastrophe (e.g., a dam failure)? Perhaps usually it is just the former, but if it fails during a storm event, it could be the latter. Moreover, although storms may be rare, the pump may in fact be more likely to fail during a storm (i.e., failure rates may increase during storm events), and this should be quantitatively represented in the model. So to properly understand the consequences of failure requires that you model the total system (dynamically and probabilistically), representing, for example, storm events, as well as the actual feedback loops that exist in the system.

A few of our customers have done this, including NASA, Sandia National Laboratories, and Los Alamos National Laboratory, but it is the exception, not the rule. I think the primary reason is that doing so requires a team approach. Most reliability engineers lack the background to model the “total system”, and those with the background typically lack the required reliability engineering skills. Hence, modeling such a system properly requires a team of individuals who together possess the necessary skills. This can be time-consuming and expensive, and hence is not typically done (of course, the ultimate cost of failure may be much more expensive, but this is rarely taken into account).

Comments

Rick Kossik says

April 20, 2017 at 9:12 AM

I believe your most important point here is that proper design requires not just modeling failure (and repairs), but modeling the consequences of failures. This is a much more difficult task than traditional reliability modeling, as it requires a “total system model” that not only simulates the components that can fail (and perhaps be repaired), but also models (in detail) the consequences of different types of failures. Only then is it possible to focus on the failures that are important.

A simple example of this is a water resource system. If a pump fails, how does it affect the rest of the system? Is the failure simply an inconvenience or does it lead to catastrophe (e.g., a dam failure)? Perhaps usually it is just the former, but if it fails during a storm event, it could be the latter. Moreover, although storms may be rare, the pump may in fact be more likely to fail during a storm (i.e., failure rates may increase during storm events), and this should be quantitatively represented in the model. So to properly understand the consequences of failure requires that you model the total system (dynamically and probabilistically), representing, for example, storm events, as well as the actual feedback loops that exist in the system.

A few of our customers have done this, including NASA, Sandia National Laboratories, and Los Alamos National Laboratory, but it is the exception, not the rule. I think the primary reason is that doing so requires a team approach. Most reliability engineers lack the background to model the “total system”, and those with the background typically lack the required reliability engineering skills. Hence, modeling such a system properly requires a team of individuals who together possess the necessary skills. This can be time-consuming and expensive, and hence is not typically done (of course, the ultimate cost of failure may be much more expensive, but this is rarely taken into account).

- Fred Schenkelberg says
  
  April 20, 2017 at 10:53 AM
  
  Thanks Rick for the comment and story. As you suggest these models can become rather complex, yet even considering the consequences will go a long way to help sort out priorities. cheers, Fred