All articles listed in reverse chronological order.

Solving a Reliability Optimization Example

In the previous article, What is Reliability Optimization, we defined the concept. One of the elements of optimization is identifying which elements of a system to focus improvement efforts on.

Simply improving every element of a design may provide an overall improvement of reliability performance.

Given constraints such as time or funding, selecting the specific few elements that would provide the most improvement is key. [Read more…]

by Fred Schenkelberg Leave a Comment

Design Reviews with Reliability Matter

On occasion, you and the team sit down to review the design.

The idea is to check the design for any issues with the combined wisdom of the people involved. Or, it may be a status update for the entire team providing a focus on the most important issues and action items.

The review may involve all departments, such as marketing, operations, supplier management, and the design team.

It may involve just you and the electrical engineer in a private meeting. In either case, it is a review and a chance to illuminate salient reliability issues and form a consensus on the appropriate action. [Read more…]

by Kirk Gray Leave a Comment

Why HALT is a methodology, not equipment

It is easy to understand why the term HALT (Highly Accelerated Life Test) is so tightly couple to the equipment called “HALT chambers” systems. Many do not think they can do HALT processes without a “HALT Chamber”. Many know that Dr. Gregg Hobbs, who coined the term HALT and also HASS (Highly Accelerated Stress Screens), spent much of his life promoting the techniques and was also the founder of two “HALT/HASS” environmental chamber companies. [Read more…]

by nomtbf Leave a Comment

What makes the best Reliability Engineer?

Formal education (masters or Ph.D) or design/manufacturing engineering experience?

Where do you look when hiring a new reliability engineer? Do you head to U of Maryland or other university reliability program to recruit the top talent? Or, do you promote/assign from within? Where do yo find the best reliability people? [Read more…]

by Fred Schenkelberg 2 Comments

What is Reliability Optimization?

Delivering the best reliability performance within the various constraints imposed.

Without constraints such as budget, time to market, customer expectation, product functional capabilities, and product weight, you certainly could design and deliver a highly reliable product.

There always are constraints.

In the Oliver Wendall Holmes poem, The One Hoss Shay, the deacon procures the strongest oak, the supplest leather, and the best of best materials. Cost was not a constraint. And the shay lasted 100 years to the day.

If the technology permits there may be stronger or more durable components available for a price, yet cost is often a limiting factor. [Read more…]

by Fred Schenkelberg 4 Comments

Key Elements for Your Project Specific Reliability Plan

A plan is a guide or roadmap for intended action.

A reliability plan is also a collection of specific tasks and milestones and enhanced with a rationale to allow the entire team fully understand their role accomplishing the reliability objectives.

The plan is a way to achieve the desired business objectives. Meaning the product is reliable enough to meet customer expectations, minimize warranty expenses, and garner market acceptance. The plan is just a plan, it is the accomplishment of the tasks, the decision which improves the design, the signals monitored that stabilize the supply chain and assembly process, that make the difference.

A plan without action is not worth the effort. [Read more…]

by nomtbf Leave a Comment

A World of Constant Failure Rates

What if all failures occurred truly randomly?

The math would be easier.

The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.

Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.

Would the Design Process Change?

Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.

A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.

Would Warranty Make Sense?

Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.

Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.

How About Maintenance?

If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.

You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.

If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.

The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.

Physics of Failure and Prognostic Health Management – would they make sense?

Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not

Accumulated damage
Drift
Wear
Abrade
Diffuse
Degrade
Etc.

Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.

No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.

A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?

by Fred Schenkelberg Leave a Comment

Deciding What Should Have Fault Tolerance

In some circumstances, it is desirable to ensure the system continues to operate even if there is an internal failure. An aircraft navigation system should be able to operate even if an internal dc-dc regulator fails, for example.

Not everything within some systems benefits by being fault tolerant.

For example, a failure of a cabin reading light over a passenger seat is not critical to the safe operation of the aircraft, thus is likely not created to be fault tolerant. One criterion to determine what should be fault tolerant is the criticality of the function the system provides.

This also applies to specific subsystems within a system allowing some elements to be created fault tolerant and others within the system not. [Read more…]

by Fred Schenkelberg 2 Comments

The Derating & Safety Margin Manual

Do you have one in your organization? Is it used regularly?

If not, your organization’s products are likely not as reliable as they should be. You are shipping products that are not as robust nor reliable as your customers deserve.

Derating and Safety Factors, defined earlier, provide a means to select components or create design features that have sufficient margin to accommodate variation in use and strength over time.

So why are these tools routinely ignored or given only fleeting attention? [Read more…]

by Kirk Gray Leave a Comment

Why Parametric Variation Can Lead to Failures and HALT Can Help

Many reliability engineers have discovered HALT will quickly find the weaknesses and reliability risks in electronic and electromechanical systems from the capability of thermal cycling and vibration to create rapid mechanical fatigue in electronic assemblies. Assemblies that have latent defects such as cold solder or cracked solder joints, loose connectors or mechanical fasteners, or component package defects can be brought to a detectable, or patent, condition by which we can observe and potentially improve the robustness of an electronics system.

[Read more…]

by nomtbf Leave a Comment

What Does ‘Lifetime’ as a Metric Mean

We talk about lifetimes of plants and animals. Also, you may talk about the lifetime of a product or system.

I expect to have safe and trouble free use of my car over its lifetime. Once in a while I find a warranty that says it is guaranteed over my lifetime — for as long as I own the blender, for example. [Read more…]

by Fred Schenkelberg Leave a Comment

Fault Tolerance Basics

Fault tolerance is a system that is reliant to the failure of elements within the system. It also may be called a fail safe design.

A fault tolerant system may continue to operate just fine, after one of the power supplies fails, for example. Or it may operate in a reduced or degraded state.

Other systems may have a ‘limp home’ condition, allowing the system to save critical data or allowing you to drive to a safe place to change a flat tire. [Read more…]

by Fred Schenkelberg Leave a Comment

Do Your KPIs Adversely Impact Reliability?

Key Performance Indicators (KPIs) are measurable values related to essential business objectives.

A KPI provides a means to monitor the performance of a specific function.

In larger organizations, with sales & marketing, research & development, operations, supply chain and other teams working to bring products to market, each department has a specific role. [Read more…]

by nomtbf Leave a Comment

Time to Update Our Standards

Not our personal or moral standards, rather the set of documents we rely upon as a foundation for reliability engineering tools and techniques.

We have a wide array of standards for reporting reliability test data to calculating confidence intervals on field returns. We have standards that describe various environmental conditions and appropriate testing levels suitable to evaluate your product. We define terms, concepts, processes, and techniques.

A Missing Element

Despite the many documents and impressive titles of numbers and abbreviations or acronyms, most of the standard related to reliability engineer fail to include sufficient context and rationale concerning when and why to use or modify the standard. If a specific test is to determine the expected lifetime of solder joints, well, which type of solder joints (shape, size, configuration, material, and process) is the standard appropriate and when does it not apply? Make the boundaries of applicability clear.

No single test works for all situations.

For example, a wrist watch standard defining how to test for specific water resistance claims does not evaluate the effects of corrosion. The standard has the watch or similar device exposed to a set of water conditions, then evaluate if the system is operating, nearly immediately after the water exposure.

We know that water encourages corrosion, yet takes time to occur. Water alone on a circuit board is no big deal (much of the time) it’s when the water facilitates the creation of additional and unwanted current paths that there is a problem. Metal migration and rusting, take time to occur.

If the standard for water resistance doesn’t evaluate corrosion, and it’s one of the ways your product fails, too bad. You can ‘pass’ the test, meet the standard, add it to your data sheet, and the customer will still experience a failure.

Same for many environmental testing, FMEA, life testing, field data analysis, and a range of other standards. They do not include the critical information necessary for appropriate application of the standard to your particular situation.

Connection to Value

Many, not all, standards provide a recipe to accomplish as task or evaluation. One of the values of the standard is different teams may replicate the results of one team by repeating the steps outlined in the standard.

One of issues with standards is they do not include how and why to actually accomplish the set of tasks and what to do with the results. In part, we need to clearly connect, say the task of testing a product across a range of temperature and humidity conditions, only if it will provide meaningful information.

Don’t run the test if the information is not needed, unnecessary or meaningless.

For example, if we expect that exposure to high temperature and humid conditions may increase the chance of product failure. We may want to know

how many failures will occur;
how the product will actually fail;
how the failure will initiate and progress;
when the failures occur under use conditions;

Or any number of reasons to use the results of the testing. Often we run a standard test with very few samples, experience no failures and erroneously conclude all it good. Then surprised that failures occur anyway when the product is in use.

The standard let us down.

The standard provided only a recipe or outline for a procedure and now that guidance and rationale on how it may or may not help us and our team resolve very real questions. Testing 3 units that all pass does not mean your solar panel will survive hot and humid conditions for 20 years with no failures. It doesn’t.

Only run the test or work to accomplish a process only if it is tied to answering a question. Focus on business decisions and the questions we have to resolve in order to make better decisions (i.e. Wrong less often).

Summary

Let’s change the way we read and use standards. You may need to add the how and why, the boundaries, and the connection to value for your situation. It’s not always easy. The people writing the standard often have sufficient experience to include guidelines to help you — when possible contact them and ask what was their thinking and what are the limitations.

If enough of us avoid simply meeting the requirements of the standard, we will

Enjoy reliable product performance
Create value to our organization with each test or task
And, eventually change how standards are written

by Fred Schenkelberg 1 Comment

Reliability as Part of Every Decision

Concurrent engineering is a common approach that pairs the development of the product design and it’s supporting manufacturing processes through the development process.

Design engineers may require the creation of new manufacturing processes to achieve specific material properties, component performance, or mechanical, electrical or software tolerances. [Read more…]