The Technical Skills of a Good Reliability Engineer
The fundamental technical skills, as I see it, have to include statistics and root cause analysis skills. This skill set is one of three broad areas introduced in the article, What Makes the Best Reliability Engineer?
I would say these are the minimum technical skills for a good reliability engineer. Able to calculate sample size requirements, understand a dataset, and correctly determine the root causes of a failure.
There are others skills that would be great to include, such as electrical, mechanical and software engineering, plus materials science, physics, and chemistry. Yet, what separates a good reliability engineer from other types of engineering is our ability to plan and analyze life tests and to truly understand how and why failures occur.
Statistics
This is often considered the same as leaping tall buildings with a single bound with respect to skill level.
Few enjoyed their undergraduate statistics class and recently fewer campuses require a stats course. Statistics is the language of variation and is essential for our understanding of the world our products experience.
If every product met the exact specifications of the design and only operated in one set of environmental and use conditions, we would have fewer field failures. If every failure mechanism led to failure exactly the same way within each and every product, we would have far fewer field failures.
Variability may lead to elements of a product being out of spec, or drifting/wearing to an out of spec conditions, thus failing. Variability may also lead to changes in the stress/strength relationships, again increasing the number of failures over time.
The ability of a good reliability engineer to use available data and statistical techniques to:
- Estimate sample size requirements for environmental testing
- Analyze vendor life testing results
- Summarize field failure and warranty datasets
Is just the start of our expected statistical prowess. We also need statistical skills to:
- Monitor and control processes
- Design and analyze screening and optimization design of experiments
- Review and identify field failure trends and unique failure mechanisms
Your ability to use the right tool to quickly solve a problem may span statistical process control, hypothesis testing, regression analysis, and life data analysis all before noon. That may well be like stopping a speeding bullet level of skill.
You may need to master all these elements of statistics if you’re working as a lone reliability engineer, or rely on a trusted colleague is so fortunate. Either way you need to understand enough statistics to know when and how to apply this set of technical skills.
Root Cause Analysis
Failure mechanisms are hard science – even the human factors related failures. Failures occur because something occurs at an atomic, molecular, code or interaction level that precipitate an error or fault to manifest.
Your technical skill includes understanding the range of possible errors and faults that may occur with your product and how to avoid, minimize or mirage each one. It may not be possible to anticipate and fully understand every possible failure mechanisms, thus we focus on the most likely and common, plus continue to learn about those new (or interesting) failure mechanisms that appear.
A second element to this set of skills is the ability to deduce the root cause of a failure. Given a failure, you should be able to conduct the root causes analysis to determine the underlying failure mechanism and initiating circumstance. This permits the team to take corrective action that actually works.
The skill set includes
- Gathering evidence and understanding the relationships and contributing factors
- Delving into the unseen elements (microscopes, cross sections, chemical analysis, etc.)
- Replicating the failure at will
The root cause analysis skill may rely on tools like x-rays and thermal imaging tools, some operated by specialists, yet you need to know which tools to employ and how to interpret their results. It may be fun to explore failures in a well furnished failure analysis lab, yet you need to focus on solving the mystery of what caused the failure.
You also need to be well versed in how to proceed from the “crime scene” (or instance of failure location), through symptoms, to non-destructive and destructive testing. You need to build your “case” based on evidence and logic, plus a healthy dose of engineering knowledge of the fundamental elements involved.
If working as the lone reliability engineer, you certainly need to establish an ongoing relationship with a failure analysis lab. In other words, do not rely on your vendors, do the failure analysis work under your organizations control with your own lab or contracted facility.
Get the information your team needs to solve problems or to avoid future problems by exercising your technical root causes analysis skills.
Good Reliability Work
To be good, I’m suggesting you have to have robust skills in statistics and root cause analysis. Do you agree? What else would you argue is essential to be a good reliability engineer?