The Technical Skills of a Good Reliability Engineer

The fundamental technical skills, as I see it, have to include statistics and root cause analysis skills. This skill set is one of three broad areas introduced in the article, What Makes the Best Reliability Engineer?

I would say these are the minimum technical skills for a good reliability engineer. Able to calculate sample size requirements, understand a dataset, and correctly determine the root causes of a failure.

There are others skills that would be great to include, such as electrical, mechanical and software engineering, plus materials science, physics, and chemistry. Yet, what separates a good reliability engineer from other types of engineering is our ability to plan and analyze life tests and to truly understand how and why failures occur.

Statistics

This is often considered the same as leaping tall buildings with a single bound with respect to skill level.

Few enjoyed their undergraduate statistics class and recently fewer campuses require a stats course. Statistics is the language of variation and is essential for our understanding of the world our products experience.

If every product met the exact specifications of the design and only operated in one set of environmental and use conditions, we would have fewer field failures. If every failure mechanism led to failure exactly the same way within each and every product, we would have far fewer field failures.

Variability may lead to elements of a product being out of spec, or drifting/wearing to an out of spec conditions, thus failing. Variability may also lead to changes in the stress/strength relationships, again increasing the number of failures over time.

The ability of a good reliability engineer to use available data and statistical techniques to:

Estimate sample size requirements for environmental testing
Analyze vendor life testing results
Summarize field failure and warranty datasets

Is just the start of our expected statistical prowess. We also need statistical skills to:

Monitor and control processes
Design and analyze screening and optimization design of experiments
Review and identify field failure trends and unique failure mechanisms

Your ability to use the right tool to quickly solve a problem may span statistical process control, hypothesis testing, regression analysis, and life data analysis all before noon. That may well be like stopping a speeding bullet level of skill.

You may need to master all these elements of statistics if you’re working as a lone reliability engineer, or rely on a trusted colleague is so fortunate. Either way you need to understand enough statistics to know when and how to apply this set of technical skills.

Root Cause Analysis

Failure mechanisms are hard science – even the human factors related failures. Failures occur because something occurs at an atomic, molecular, code or interaction level that precipitate an error or fault to manifest.

Your technical skill includes understanding the range of possible errors and faults that may occur with your product and how to avoid, minimize or mirage each one. It may not be possible to anticipate and fully understand every possible failure mechanisms, thus we focus on the most likely and common, plus continue to learn about those new (or interesting) failure mechanisms that appear.

A second element to this set of skills is the ability to deduce the root cause of a failure. Given a failure, you should be able to conduct the root causes analysis to determine the underlying failure mechanism and initiating circumstance. This permits the team to take corrective action that actually works.

The skill set includes

Gathering evidence and understanding the relationships and contributing factors
Delving into the unseen elements (microscopes, cross sections, chemical analysis, etc.)
Replicating the failure at will

The root cause analysis skill may rely on tools like x-rays and thermal imaging tools, some operated by specialists, yet you need to know which tools to employ and how to interpret their results. It may be fun to explore failures in a well furnished failure analysis lab, yet you need to focus on solving the mystery of what caused the failure.

You also need to be well versed in how to proceed from the “crime scene” (or instance of failure location), through symptoms, to non-destructive and destructive testing. You need to build your “case” based on evidence and logic, plus a healthy dose of engineering knowledge of the fundamental elements involved.

If working as the lone reliability engineer, you certainly need to establish an ongoing relationship with a failure analysis lab. In other words, do not rely on your vendors, do the failure analysis work under your organizations control with your own lab or contracted facility.

Get the information your team needs to solve problems or to avoid future problems by exercising your technical root causes analysis skills.

Good Reliability Work

To be good, I’m suggesting you have to have robust skills in statistics and root cause analysis. Do you agree? What else would you argue is essential to be a good reliability engineer?

by nomtbf Leave a Comment

Considering WIIFT When Reporting Reliability

WIIFT and Reliability Measures

WIIFT is “what’s in it for them”. Similar to what’s in it for me, yet the focus is your consideration of what value are you providing your audience.

As a reliability engineer you collection, analyze and report reliability measures. You report reliability estimates or results. Do you know how your audience is going to use this information?

Consider WIIFT when reporting reliability. [Read more…]

by nomtbf Leave a Comment

What makes the best Reliability Engineer?

Formal education (masters or Ph.D) or design/manufacturing engineering experience?

Where do you look when hiring a new reliability engineer? Do you head to U of Maryland or other university reliability program to recruit the top talent? Or, do you promote/assign from within? Where do yo find the best reliability people? [Read more…]

by nomtbf Leave a Comment

A World of Constant Failure Rates

What if all failures occurred truly randomly?

The math would be easier.

The exponential distribution would be the only time to failure distribution. We wouldn’t need Weibull or other complex multi parameter models. Knowing the failure rate for an hour would be all we would need to know, over any time frame.

Sample size and test planning would be simpler. Just run the samples at hand long enough to accumulated enough hours to provide a reasonable estimate for the failure rate.

Would the Design Process Change?

Yes, I suppose it would. The effects of early life and wear out would not exist. Once a product is placed into service the chance to fail the first hour would be the same as any hour of it’s operation. It would fail eventually and the chance of failing before a year would solely depend on the chance of failure per hour.

A higher failure rate would suggest it would have a lower chance of surviving very long. Although it could still fail in the first hour of use as if it had survived for one million hours and then it’s chance to fail the next hour would still be the same.

Would Warranty Make Sense?

Since by design we cannot create a product with a low initial failure rate we would only focus on the overall failure rate. Or the chance of failing over any hour, the first hour being convenient and easy to test, yet still meaningful. Any single failure in a customer’s hands could occur at any time and would not alone suggest the failure rate has changed.

Maybe a warranty would make sense based customer satisfaction. We could estimate the number of failures over a time period and set aside funds for warranty expenses. I suppose it would place a burden on the design team to create products with a lower failure rate per hour. Maybe warranty would still make sense.

How About Maintenance?

If there are no wear out mechanisms (this is a make believe world) changing the oil in your car would not make any economic sense. The existing oil has the same chance of engine seize failure as any new oil. The lubricant doesn’t breakdown. Seals do not leak. Metal on metal movement doesn’t cause damaging heat or abrasion.

You may have to replace a car tire due to a nail puncture, yet the chance of an accident due to worn tire tread would not occur any more often than with new tires. We wouldn’t need to monitor tire tread or break pad wear. Those wouldn’t occur.

If a motor is running now, if we know the failure rate we can calculate the chance of running for the rest of the shift, even when the motor is as old as the building.

The concepts of reliability centered maintenance or predictive maintenance or even preventative maintenance would not make sense. There would be advantage to swapping a part of a new one, as the chance to fail would remain the same.

Physics of Failure and Prognostic Health Management – would they make sense?

Understanding failure mechanisms so we could reduce the chance of failure would remain important. Yet when the failures do not

Accumulated damage
Drift
Wear
Abrade
Diffuse
Degrade
Etc.

Then many of the predictive power of PoF and PHM would not be relevant. We wouldn’t need sensors to monitor conditions that lead to failure, as no specific failure would show a sign or indication of failure before it occurred. Nothing would indicate it was about to fail as that would imply it’s chance to failure has changed.

No more tune-ups or inspections, we would pursue repairs when a failure occurs, not before.

A world of random failures, or a world of failures each of which occurs at a constant rate would be quite different than our world. So, why do we so often make this assumption?

by nomtbf Leave a Comment

What Does ‘Lifetime’ as a Metric Mean

We talk about lifetimes of plants and animals. Also, you may talk about the lifetime of a product or system.

I expect to have safe and trouble free use of my car over its lifetime. Once in a while I find a warranty that says it is guaranteed over my lifetime — for as long as I own the blender, for example. [Read more…]

by nomtbf Leave a Comment

Time to Update Our Standards

Not our personal or moral standards, rather the set of documents we rely upon as a foundation for reliability engineering tools and techniques.

We have a wide array of standards for reporting reliability test data to calculating confidence intervals on field returns. We have standards that describe various environmental conditions and appropriate testing levels suitable to evaluate your product. We define terms, concepts, processes, and techniques.

A Missing Element

Despite the many documents and impressive titles of numbers and abbreviations or acronyms, most of the standard related to reliability engineer fail to include sufficient context and rationale concerning when and why to use or modify the standard. If a specific test is to determine the expected lifetime of solder joints, well, which type of solder joints (shape, size, configuration, material, and process) is the standard appropriate and when does it not apply? Make the boundaries of applicability clear.

No single test works for all situations.

For example, a wrist watch standard defining how to test for specific water resistance claims does not evaluate the effects of corrosion. The standard has the watch or similar device exposed to a set of water conditions, then evaluate if the system is operating, nearly immediately after the water exposure.

We know that water encourages corrosion, yet takes time to occur. Water alone on a circuit board is no big deal (much of the time) it’s when the water facilitates the creation of additional and unwanted current paths that there is a problem. Metal migration and rusting, take time to occur.

If the standard for water resistance doesn’t evaluate corrosion, and it’s one of the ways your product fails, too bad. You can ‘pass’ the test, meet the standard, add it to your data sheet, and the customer will still experience a failure.

Same for many environmental testing, FMEA, life testing, field data analysis, and a range of other standards. They do not include the critical information necessary for appropriate application of the standard to your particular situation.

Connection to Value

Many, not all, standards provide a recipe to accomplish as task or evaluation. One of the values of the standard is different teams may replicate the results of one team by repeating the steps outlined in the standard.

One of issues with standards is they do not include how and why to actually accomplish the set of tasks and what to do with the results. In part, we need to clearly connect, say the task of testing a product across a range of temperature and humidity conditions, only if it will provide meaningful information.

Don’t run the test if the information is not needed, unnecessary or meaningless.

For example, if we expect that exposure to high temperature and humid conditions may increase the chance of product failure. We may want to know

how many failures will occur;
how the product will actually fail;
how the failure will initiate and progress;
when the failures occur under use conditions;

Or any number of reasons to use the results of the testing. Often we run a standard test with very few samples, experience no failures and erroneously conclude all it good. Then surprised that failures occur anyway when the product is in use.

The standard let us down.

The standard provided only a recipe or outline for a procedure and now that guidance and rationale on how it may or may not help us and our team resolve very real questions. Testing 3 units that all pass does not mean your solar panel will survive hot and humid conditions for 20 years with no failures. It doesn’t.

Only run the test or work to accomplish a process only if it is tied to answering a question. Focus on business decisions and the questions we have to resolve in order to make better decisions (i.e. Wrong less often).

Summary

Let’s change the way we read and use standards. You may need to add the how and why, the boundaries, and the connection to value for your situation. It’s not always easy. The people writing the standard often have sufficient experience to include guidelines to help you — when possible contact them and ask what was their thinking and what are the limitations.

If enough of us avoid simply meeting the requirements of the standard, we will

Enjoy reliable product performance
Create value to our organization with each test or task
And, eventually change how standards are written

by nomtbf Leave a Comment

Finding the Hidden Field Data in Your Organization

Hiding From Your Field Data Reality?

One of the major dilemmas of reliability engineering is one we really need to solve. Too many times we are trapped by our organizations competing priorities and working with inadequate information.

We generally understand that field failure data provides the best possible representation of our product’s reliability performance. It’s data from our population of products with our customers while they apply all the stresses’ customer will apply to our product. Customer’s report the failures they care about, and not failures of little significance. [Read more…]

by nomtbf Leave a Comment

When Your Supplier Converts Reliability to MTBF

Oh, the trouble that will occur. The mistakes, mishaps and errors and most certainly the inability of the supplier to provide a reliability solution.

If you provide the supplier with a straightforward and complete reliability goal, and they convert it to an single number as an MTBF value, what really could go wrong? Also, why would the supplier degrade the requirement to an MTBF value? [Read more…]

by nomtbf Leave a Comment

What is MTBF?

The acronym MTBF is commonly known in our field as Mean Time Between Failure.

It is also associated with repairable systems in most text books.

It is also denoted as the theta parameter for an exponential distribution.

It is referenced as a metric for reliability, too. Oh, and it is the inverse of the failure rate.

And, it is mis-understood and mis-used by many. I digress, as there is plenty already written on the perils of MTBF.

What is MTBF? And where and how should it be used, if at all? [Read more…]

by nomtbf Leave a Comment

Holiday Break and a Few Notes

Thank you

First off I want to say thanks to you the readers of the NoMTBF blog. The notes of thanks, of encouragement, and support all propel me to write to you each week.

I especially like the stories of success helping someone ‘get it’ concerning the common misunderstandings of MTBF. I have to think your work and actions is making a difference across the field of reliability engineering. We’re making progress. [Read more…]

by nomtbf Leave a Comment

Predicting Failure vs. Reacting to Failure

One of the twitter notes I sent out a few weeks ago in part read, “Celebrate failures”. And a comment came back that it was a wonderful approach that she had not though of before. Failure will occur and when it does it is our chance to learn.

And, we need to learn. As reliability professionals, we continue to learn our entire career. New materials fail in novel manners. New assemblies fail in an assortment of ways. New designs fail due to unknown sources of variation. We will see failures. So rather than simply focus on the next try and hope to find success, let’s learn from each failure as we move toward success. [Read more…]

by nomtbf Leave a Comment

Thoughts on Testing One Sample and No Failures

Reliability Testing with Constraints

In some cases we have to conduct testing and are asked to not break the product. Now, that isn’t all that fun as a reliability engineer. We want to find what fails and understand it. Or, we want to confirm what we expect will fail, actually does as expected.

So, what do we do when confronted with a very small sample size (that is one issue) and are expected to conduct failure free testing (second issue)? Let’s explore each issue separately and come up with a few suggestions on how to proceed.

Thanks to Олег (@OlegV_Ivanov) via Twitter for the article suggestion. Thanks for the idea Олеr. [Read more…]

by nomtbf Leave a Comment

My Thoughts on the Internet of Things and Reliability

The Impact of IoT on Reliability Engineering

Article inspired by @JillNewberg thanks for the suggestion Jill.

There are two elements to this subject. First there is the reliability of the elements collecting and connecting to the internet. Second is the potential value of the connection and information. [Read more…]

by nomtbf Leave a Comment

Are You Doing Your Professional Reading?

Professional reading

As reliability engineers we are the local expert. We know the arcane arts of product life and equipment uptime design and maintenance. We are sought after to estimate useful life, time to first failure, and consulted when failures occur. [Read more…]

by nomtbf Leave a Comment

5 Things You Can Do Today to Avoid Using MTBF

Take Action Today to Improve How Your Organization Talks About Reliability

You know the perils of MTBF use. The widespread misunderstanding and mis-use. You know about how MTBF treats your data poorly.

You also know everyone around you uses MTBF. Your industry uses MTBF. And, now one likes change, least of all about metrics concerning reliability.

As I said to a friend this morning, “The madness has to stop.”

And, you feel that say way. So, what are you going to do about it? Here are five things you can do today.

Use the data to calculate reliability (probability of success) over a duration of interest along with calculating MTBF, then share the results.
Encourage five of your colleagues to check out and subscribe to this site, www.nomtbf.com.
Ask a vendor how they determined the MTBF value they are presenting on the data sheet? What evidence supports that claim and what assumptions are included (often unstated)?
The next time you hear someone mention MTBF, ask them what do they mean? And, than ask what percentage of items should survive a year? If they are not consistent — you found a learning opportunity.
Write a blog post for the www.nomtbf.com site. What have you done to encourage better understanding of reliability concepts in your world? Share you hints, tips, stories, and advice here.

Pick one for today and do as many as you can. What would you add to this list? What kind responses are you receiving when you speak out about the perils of MTBF.

Keep up the effort. Together we are making progress. Thanks for the support.

« Previous Page
1
…
5
6
7
8
9
…
19
Next Page »