Field Failure: A Quality or Reliability Problem

When my car fails to start, as a customer I only know that my car didn’t start.

When my phone fails to turn on, or the dishwasher leaks, or the printer jams, I only know I’ve experienced an unwanted outcome.

I really do not care, at the moment, why the coffee maker is not producing my morning cup of coffee. My first thought is ‘now where do I find a cup of coffee?’ As a reliability engineer I’m naturally curious about what caused the failure and can I fix it immediately to get the morning cup brewing.

My thinking does not classify the failure or the source of the failure as a quality or reliability problem. Then why is it that some organizations split reported field failures thus?

Division of Focus When Addressing Field Failures

In more than one organization I’ve witnessed the process of addressing field failures. Smaller organizations may have a small team dealing with support calls and field returns. Larger organizations my have tiers of teams scattered around the world and organization. In small and large organizations one of the first steps is to assess the field failure.

The assessment is a form of triage.

Is this a major safety problem and requiring swift action to remedy?
Or, is the nature of the failure benign.
Is the failure unique or common?
Is the failure cause known or not?

The assessment process also gathers information about the failure. Many teams attempt to gather:

Symptoms
Failure mode
Serial number of failed item
Environmental or use conditions
Warranty status

The response to a field failure tends to focus, as it should, on providing a remedy for the customer. The idea being to helping the customer to continue deriving value from the product. To keep the customer into the future.

A secondary response tends to explore understanding the failure and what caused the failure. The focus on knowing what to repair for repairable units. Plus, understanding what could be changed in the design or assembly process to prevent similar failures in the future.

Different teams may focus on fixing the issue, while others may focus on preventing future problems.

We do not ask the customer if the failure is a quality or reliability issue.

The Utility of Classifying a Failure

At some point in the process of dealing with field failures, in some organizations, a specific failure is deemed a quality issue or a reliability issue. This is one way to assign the problem to a team within the organization. It is also a means to track or prioritize or simply count different types of field failures.

Quality problems tend to occur early in the use of a product. Out of box or installation problems tend to become known as quality issues. When the suspected underlying root cause is primarily due to supply chain or assembly process variation, the reported failure becomes a quality issue.

The operation team deals with quality issues as it’s part of establishing a stable supply chain and manufacturing process. Storage, transportation, and installation related issues or often dealt with by the operations team, thus become quality issues.

Reliability problems tend to occur after some time has passed. Failures that occur after some period of normal operation tend to considered reliability issues. When the underlying root cause is wear out related, these are deemed reliability problems. Sometimes, a failure due to a poor design independent of when it occurs is deemed a reliability issue.

The design team tends to address reliability issues. At some point improvements in process control will not remedy or prevent future failures, thus requiring a design change.

The initial analysis of field failures attempts to route the information to the appropriate team to make process or design changes to best prevent future problems. This is a helpful practice.

The Problem with Splitting Failures into Quality and Reliability Buckets

The problem is not assigning the work to an appropriate team. The problem is how the organization reports field failures.

This issue I see is field failures tagged as a quality problem, in some organizations, are not counted in reliability metrics. In some organizations, quality tagged field failures are extensions of factory yield loss. Tracking how the various teams are doing to remedy customer identify problems if fine. My issue is the splitting of the information reduces the apparent magnitude of field failures. Plus it obscures the changing rate of failures as witnessed in the field.

If 4 of every 10 field failures are tagged as quality issues that almost cuts the reported reliability failures in half. A better practice is to report the number and changing nature of any failure that occurs as experienced by the customer. At least one of metrics should attempt to reflect the customer experience. Recall customer to not provide the classification of a failure, they, like me, have to find another way to brew coffee after the coffee maker fails. It does not matter if it occurs on first use or 1,000^th use.

Another problem with labeling field failures as quality or reliability is it limits or delays options to both fix the immediate problem and to change the design to prevent future problems.

That is right, it is always a failure of the design when a customer witnesses a failure. If the design is robust, it would survive the perils of manufacturing, storage, installation, and use. If the design included only highly capable components (think process capability or 6 sigma design concepts here), then the design would not incur field failures due to supply chain and assembly process variability.

The range of solutions for a team with out design change capability is limited. Adding another process control step, or another screen or test, adds cost, and complexity. These actions may be prudent and cost effective in the short team. To really fix the underlying root cause of any field failure, look to changing the design of the product.

Summary

A failure that occurs when a customer is using your product is a failure.

The focus on providing the customer a remedy for the failure is appropriate. The initial assessment and routing of the specific failure to one of potentially many teams to provide a remedy is appropriate.

What is not ok, is the limiting of solutions by virtual of the initial identification of the source of the root cause. What is not ok, is the altering of field failure reporting based on the classification of a quality or reliability issue.

Any failure that a customer experiences is a failure. I do not care what you call it, quality or reliability or what ever label you assign, it is a failure. Your metrics used to track field failures serves you best by representing all failures as witnessed by your customers.