“My software never has bugs. It just develops random features.” Anonymous
More and more mechanical and electrical systems include software integration. The FMEA methodology applies very well to software as well as hardware. It is possible to include software functionality in the System FMEA as part of the functional descriptions. However, for complex software functionality such as embedded control systems, it may be useful to perform a separate software FMEA.
What is a Software FMEA?
Software FMEA is a type of Design FMEA that analyzes the software elements, focusing on potential software-related deficiencies, with emphasis on improving the software design and ensuring product operation is safe and reliable during useful life.
Pete Goddard, in his paper “Software FMEA Techniques” (RAMS 2000) wrote, “Software FMEA assesses the ability of the system design, as expressed through its software design, to react in a predictable manner to ensure system safety.”
Software FMEA is similar to System or Design FMEA, with the exception that Software FMEA focuses primarily on software functions.
What are objectives for Software FMEA?
Objectives for software FMEA include:
• Identifying missing software requirements
• Analyzing a system’s behavior as it responds to a request that originates from outside of that system
• Identifying (and mitigating) single-point failures that can result in catastrophic failures
• Identifying features that need fault-handling strategies
• Identifying software response to hardware anomalies
What is the difference between Software FMEA and hardware or electrical FMEAs?
All types of FMEAs are grounded in similar principles and fundamentals. The primary difference with software FMEA is the focus of the analysis.
For example, in the case of a software module that provides a warning for low-level of windshield washer fluid, the Item in the Software FMEA can be “Low washer fluid level warning software” and the Function might be “Communicate low fluid level to instrument panel.” A single-line excerpt from this software FMEA example is in the updated SAE J1739 FMEA Standard, due to be published in January 2021.
Other differences between Hardware and Software FMEAs include:
1. Software failure modes are analyzed from unique modes of operation, compared to hardware
2. Software has unique set of failure mode and cause categories
3. Software FMEA analyzes how software reacts to hardware failures
4. Can be minor difference between Design FMEA and Software FMEA template
Software Modes of Operation
Understanding the unique modes of operation for software helps ensure nothing important is messed. [1]
1. Functional (new requirements or significant revision)
2. Interface (complex hardware or software interfaces)
3. Detailed (high-risk items)
4. Maintenance (older legacy system prone to errors)
5. Usability (when user misuse can impact system reliability)
6. Serviceability (mass distribution or difficult installation location)
7. Vulnerability (risk from hacking or abuse)
8. Production (system schedule is disrupted by software production process)
[1] Recognition is due to Ann Marie Neufelder, whose book “Effective Application of Software Failure Modes Effects Analysis – 2nd Edition” is an excellent resource for anyone performing Software FMEAs. (Copyright © 2017 by Softrel, LLC., published by Quanterion Solutions, Inc., Utica, New York)
Software Common Cause Categories
What are some of the unique software cause categories? [1]
1. Missing error detection (missing in specifications, design and code)
2. Specifications missing important details (review process can miss what is NOT in code, design or requirements)
3. Faulty state transitions (leads to dead states, inadvertent / prohibited state transitions, etc.
4. Faulty logic (when software fails to consider all possibilities from logic perspective)
5. One size fits all error recovery (detecting and recovering from errors should be specific to circumstances)
Level of detail
Software FMEA can be applied at the system functional level, the detailed design (logic level) or at the code level.
Similar to System FMEAs, software system FMEAs should be performed early in the design process, as soon as the software design team has determined initial software architecture and transferred the functional requirements to the software design. Software FMEAs at the detail-level are typically done later in the software design process, when detailed design description and preliminary code exists.
What precedence-guidelines can be used to address software problems?
When identifying Recommended Actions, Software FMEA teams can use the following precedence suggestions in order to ensure the software is fail-safe and accomplishes its functions, with heightened focus on potential hazardous outcomes. Special attention should be paid to identify any need for new or modified software requirements. [2]
a. Design out the failure mode
b. Use redundancy to achieve fault tolerance
c. Go into fail-safe mode (for example, the ability to “limp home”)
d. Implement early prognostic warning
e. Implement training to reduce risk for human error
[2] Reference article titled “Software FMEA: A Missing Link in Design for Robustness,” by Dev Raheja, copyright 2003 by SAE International
What is “fail safe” and how is it used in software?
The software should always go to the desired state no matter what causes the software to malfunction. If a desired state is not identified in the specification, the software should always go into fail-safe state. A fail-safe state is one that, in the event of failure, responds in a way that will cause minimal harm to other devices or danger to personnel.
Software programmers should consider fail-safe strategies to help ensure systems are safe and robust.
What standard is used?
There is no universally agreed-upon standard for performing software FMEAs, although it is the subject of various standards committees. Practitioners who will be performing software FMEA projects are encouraged to read the software standards and articles that apply to their projects. In addition, it is essential for any software FMEA team to include a subject matter expert in the specific software systems that are being analyzed.
As mentioned above, Ann Marie Neufelder’s book “Effective Application of Software Failure Modes Effects Analysis” is an excellent resource for anyone performing Software FMEAs.
Tips
Understanding software potential trigger events can improve the effectiveness of Software FMEA and ensure important issues are addressed.
1. Consider potentially critical hardware failures or user misuse when performing Software FMEA. Software must be robust to potential hardware failures or user misuse, and must default to safe condition.
2. Potentially critical hardware failures or user misuse can be derived from System FMEA, lower-level FMEAs, Hazard Analysis or Fault Tree Analysis.
3. Lack of robustness or lack of default to safety condition can be input to software FMEA cause descriptions. Note, Software FMEA causes should be expressed as a potential software deficiency.
Next Article
Hazard analysis is the process of examining a system throughout its life cycle to identify inherent safety related risks. The next article in the Inside FMEA series discusses the application of Hazard Analysis.
Angel Montañez says
Good evening
Will you have information from an RFMEA or Reverse FMEA?
Thank you
Angel
Carl Carlson says
Hello Angel,
I am aware of “Reverse FMEA,” but from what I have read and understand, it is not a method that I advocate.
One definition is “Reverse FMEA (R FMEA) is a structured process of continuous improvement that is aimed at ensuring the permanent updating and progress of an FMEA (Failure Mode and Effect Analysis) study. This risk assessment method is based on the actual situation and not predictive reliability.”
My personal view is that if Design FMEA and Process FMEA are done well, there is no need to use this variant.
Carl
Paul Ortais says
Hello Carl,
As a real-time control system architect, most projects I contributed were badly impacted by software production methods, seemingly unavoidable, to treat every problem as an event-driven issue.
I understand this is culturally linked to the existing consumer computers, but real-time is something quite orthogonal: time dependency first vs data dependency.
Failures (“bugs”) occur when reality didn’t comply with the designer’s expectations = the specification. Everything was ok according to SW FMEA, so I examined how things are specified, coded and checked and, by the way, all is consistent with a list of asynchronous “services”, generally based on extremely thick, opaque OS and libraries. But Time is far from a priority, even when some events are timed.
When a µP and FPGA are coupled for redundancy, and the FPGA programmed by a HW engineer, the result is a straight, clean and extremely robust, synchronous FSM implementation, where the actual STATE the system is best known, and entering “else” states immediately reported in the best designs.
In SW failures, it can take months to discover what the program was doing “around” the moment of the failure; the notion of system state is squarely ignored, as well as what is done in the suspected libraries…
I ended up specifying SW as sync FSMs, with all the observability I need from a HW design, and working so saved me unimaginable amounts of setbacks and frustration.
I am interested in reading your opinion on this
Thank you,
Paul
Carl Carlson says
Hello Paul,
I appreciate your sharing your software experiences and insights. I do have a few comments, and they will be centered around SW FMEA.
I like your statement, “Failures (“bugs”) occur when reality didn’t comply with the designer’s expectations = the specification.” However, you also say, “Everything was ok according to SW FMEA.” Based on what you write in the post, it seems the SW FMEA could have been more helpful. For example, did the SW FMEA identify missing or incorrect software requirements? This is one of its objectives. You ended up specifying SW as sync FSMs, and mention that saved a lot of time. I’m wondering why the SW FMEA did not help with that improved specification.
Also, you say, “In SW failures, it can take months to discover what the program was doing “around” the moment of the failure.” I’m sure you are right. You go on to say, “the notion of system state is squarely ignored, as well as what is done in the suspected libraries…” SW FMEA should not ignore system state and should examine relationship with libraries.
I would take a look at the quality of the SW FMEA. I have not examined the SW FMEA, so this is just a suggestion.
Thanks again for your excellent post.
Carl