Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
  • Reliability.fm
    • Speaking Of Reliability
    • Rooted in Reliability: The Plant Performance Podcast
    • Quality during Design
    • Way of the Quality Warrior
    • Critical Talks
    • Dare to Know
    • Maintenance Disrupted
    • Metal Conversations
    • The Leadership Connection
    • Practical Reliability Podcast
    • Reliability Matters
    • Reliability it Matters
    • Maintenance Mavericks Podcast
    • Women in Maintenance
    • Accendo Reliability Webinar Series
  • Articles
    • CRE Preparation Notes
    • on Leadership & Career
      • Advanced Engineering Culture
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • ReliabilityXperience
      • RCM Blitz®
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Feed Forward Publications
    • Openings
    • Books
    • Webinars
    • Journals
    • Higher Education
    • Podcasts
  • Courses
    • 14 Ways to Acquire Reliability Engineering Knowledge
    • Reliability Analysis Methods online course
    • Measurement System Assessment
    • SPC-Process Capability Course
    • Design of Experiments
    • Foundations of RCM online course
    • Quality during Design Journey
    • Reliability Engineering Statistics
    • Quality Engineering Statistics
    • An Introduction to Reliability Engineering
    • Reliability Engineering for Heavy Industry
    • An Introduction to Quality Engineering
    • Process Capability Analysis course
    • Root Cause Analysis and the 8D Corrective Action Process course
    • Return on Investment online course
    • CRE Preparation Online Course
    • Quondam Courses
  • Webinars
    • Upcoming Live Events
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home

by Kirk Gray Leave a Comment

Why Parametric Variation Can Lead to Failures and HALT Can Help

Why Parametric Variation Can Lead to Failures and HALT Can Help

Kirk Gray, Accelerated Reliability Solutions, L.L.C.

Many reliability engineers have discovered HALT will quickly find the weaknesses and reliability risks in electronic and electromechanical systems from the capability of thermal cycling and vibration to create rapid mechanical fatigue in electronic assemblies. Assemblies that have latent defects such as cold solder or cracked solder joints, loose connectors or mechanical fasteners, or component package defects can be brought to a detectable, or patent, condition by which we can observe and potentially improve the robustness of an electronics system. Thermal cycling creates expansion and contraction, stressing mismatched material thermal coefficients of expansion (TCE) interfaces. Applying vibration to an assembly, especially the pneumatic repetitive shock of HALT chambers, creates very rapid mechanical fatigue. When Gregg Hobbs, Ph.D., PE created HALT and HASS methods  back in the 1980’s, digital systems were not as prevalent and bus speeds were much slower than today’s electronics. As the signal speeds continue to increase and circuit features get smaller in electronics HALT has a potentially significant new benefit for signal integrity (SI) and operational reliability during new product development.

Today’s electronics are requiring bus speeds that have to have ten times better resolution than the time it takes light to bounce off your nose and hit your eye, which takes about 85 picoseconds. As data bus speeds increase affects in data transmission that were second and third order affects are now becoming dominant in SI issues. These new variables may be difficult if not impossible to model accurately. The continue decrease in metallization dimensions and higher bus frequencies will result in increased sensitivity to fabrication variations. SI issues are likely to become more dominant in reliability of hardware as a result of the continued decrease of metallization and increase in bus speeds. Yet, the effect of these developments on operational reliability may also be more difficult to find and reproduce before thousands or millions are sent to the field.

Failures in SI in many times results in marginal operational reliability or “soft failures” where a system can be reset and operate normally. Depending on the frequency of these operational failure events, the user may or may not tolerate their occurrence. When too frequent, intermittent operational reliability may result in returning the system to the manufacturer. The returned system then may then be broken down and all subassemblies subjected to failure analysis. When divided up, the subsystems tested will likely be declared “No Fault Found” (NFF) as the marginality may only come from the stack up of parametric variations, or unique environmental conditions of original system in the end-use environment. To modify an old adage “If you cannot find what broke, you cannot fix it” and the cause of the marginality and returns will continue. The result is a churn of “good” parts being returned being sent out to replace “good” parts. The returned parts may be sent to a repair depot to be used for repair or replacement. Those returned parts may or may not work with a different system depending on the systems stack up, but it is likely the manufacturer will never come to know one of the potential real contributors to the high NFF rate. Of course there are many other causes of NFF returns not necessarily related to hardware issues. If the issues come from SI and timing marginality thermal stress to operational limits can be a very useful tool to discover these issues before mass production.

We know that in mass manufacturing of anything there will be variation in any parameter that is measured. We know that during PWB manufacture that some dimensional variations will occur during mass manufacturing, although hopefully the variations are small. Dimensional variations in PWB can affect impedance crosstalk, noise, and EMI issues in the system. Dimensional expansion and contraction of the PWBA of course is what induces the thermo-mechanical fatigue damage during thermal cycling that has been a primary focus of HALT and HASS methodology, but the dimensional variations also effects SI quality. We know from the SPC teachings of Dr. W. Ed Deming that reduction of manufacturing variation is the path to making a defect free product and “six sigma” production capability is the goal. When we design and build a complex high speed digital electronics system we cannot know necessarily how the stack up of all the real future variations in component manufacturing, circuit board fabrication, solder quality, and second sources of these possibly impact operational reliability. Yet we do know for sure that there will be parametric variations created at all the levels of assembly, and the affect operational reliability may only be discovered after a large numbers are produced and sold.

The challenge of finding marginal operation during early product development is illustrated graphically in the Figure 1. . Early samples of a new electronics product are typically expensive and scarce and all development teams want the limited samples.  The graphic shown on the left side of figure 1 represents the parametric timing distribution found with a limited number of units. With a small number of units the parametric variation that could be near the upper and lower limits of would likely remained undiscovered before the product is released to from development to be manufactured in mass.

 

Figure 1. Thermal stress skews timings to discover marginal conditions

The graph on the right side illustrates the potential of the larger variation found during mass manufacturing and the higher probability that the stack up of parametric variations could fall near operational limits resulting in soft operational failures.

The benefits of the effect of thermal stress in inducing mechanical fatigue to expose mechanical and material weaknesses is well established, but there is another aspect of thermal stimulation that may be become more important in the future for assuring reliable operation of high speed digital systems. A little known fact to those who have not performed real thermal HALT on digital electronics is that it almost always ends with finding an operational limit only. It is very rare ever to find a thermal destruct level in digital systems such as IT Hardware. Hot and cold thermal stress causes impedance shifts and signal propagation shifts in conductors and semiconductors resulting in “skewing” of signals throughout the system. This is probably why thermal HALT on most digital systems results in finding an operational limit and not destruct limit. At the thermal operation limits the SI fails, and a lock up or shut down occurs, but it can easily be reset when the stress is removed.

The graphic in figure 2 represents how using small number of samples stressed to empirical thermal limits we can skew the systems signal propagation timings. Higher temperatures slow signals and cold increases the signal speeds. Through thermal stressing a small number of samples we can observe the thermal hot and cold operating limit and this can be repeated many times without causing a catastrophic damage. Marginal operational reliability may be realized later from worst case stack up of parametric variations in smaller percentage of products when thousands or millions are produced.  As manufacturing volumes ramp up, a wider distribution of parametric variations may then extend near or over the stable operational limit as previously shown on the right graphic in figure 1.  Of course the stimulation of timing variations using thermal stress on a system moves all the components parametric skew to either slower or faster.  In the larger mass manufacturing population, the lot to lot and second source of components parametric variation is mixed with high and low speed distributions. Rapid thermal cycling stress found in HALT chambers helps discover more mixing of timing variations by differentially skewing timings across a PWBA. This is created by very fast air temperature transitions producing thermal gradients across the PWBA. Low mass components have higher thermal transition rates than larger mass or high wattage components resulting in a mix of temperatures across a PWBA.  An even more detailed understanding of the risk of variations timing distributions could be created by individually heating and cooling of active components. Individual heating and cooling of components is a good way to isolate a limiting component found during a thermal HALT.

 

Figure 2. Thermal stress skews signal timings

Examples of the benefits of HALT techniques on finding software issues are have been documented by Allied Telesis. Donovan Johnson and Ken Franks of Allied Telesis wrote and published a white paper several years ago on how the use of HALT has benefited their discovery of reliability issues due to software. In the paper they give examples of significantly increasing thermal operational margins and limits from only software changes. Click on the following link to access the paper:  “Software Fault Isolation using HALT and HASS” . Please download and read it. Most companies have not realized Thermal HALT has so much potential for rapid discovery of operational issues, not just catastrophic hardware failures.

The benefits of HALT to find mechanical issues in electronics assemblies have been well established over the last several decades. As the speed and density of electronics continue to increase, operational reliability may be more sensitive to manufacturing variations that result in parametric variations, leading to marginal SI and operational reliability. Along with the traditional established benefits of HALT, there is a growing benefit of improving operational reliability by using thermal HALT for finding how parametric variations that will ultimately occur in mass manufacturing over time.

 

Filed Under: Articles, NoMTBF

About Kirk Gray

My Passion for developing reliable products

Why did it fail?

This is the fundamental question that drove my career from first repairing electronics in the 1970’s to today. It was from this perspective that my passion for reliability engineering grew from investigating, discovering and understanding of why products fail. By starting with how electronics systems actually fail (empirical not theoretical) gave me a frame of reference to understand ways to rapidly discover failure mechanisms.

« What is Reliability
Dependability »

Comments

  1. Mark Powell says

    October 15, 2012 at 12:41 PM

    Kirk,

    My head is not that big. Light travels about 1 foot in 1 nanosecond.

    Mark Powell

    Reply
    • Kirk Gray says

      October 15, 2012 at 6:41 PM

      Mark, Thanks for catching that error. I have corrected it. It should have been 85 picoseconds.
      Regards, Kirk

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

[popup type="" link_text="Get Weekly Email Updates" link_class="button" ]

[/popup]

The Accendo Reliablity logo of a sun face in circuit

Please login to have full access.




Lost Password? Click here to have it emailed to you.

Not already a member? It's free and takes only a moment to create an account with your email only.

Join

Your membership brings you all these free resources:

  • Live, monthly reliability webinars & recordings
  • eBooks: Finding Value and Reliability Maturity
  • How To articles & insights
  • Podcasts & additional information within podcast show notes
  • Podcast suggestion box to send us a question or topic for a future episode
  • Course (some with a fee)
  • Largest reliability events calendar
  • Course on a range of topics - coming soon
  • Master reliability classes - coming soon
  • Basic tutorial articles - coming soon
  • With more in the works just for members
Speaking of Reliability podcast logo

Subscribe and enjoy every episode

RSS
iTunes
Stitcher

Join Accendo

Receive information and updates about podcasts and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Dare to Know podcast logo

Subscribe and enjoy every episode

RSS
iTunes
Stitcher

Join Accendo

Receive information and updates about podcasts and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Accendo Reliability Webinar Series podcast logo

Subscribe and enjoy every episode

RSS
iTunes
Stitcher

Join Accendo

Receive information and updates about podcasts and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Articles

  • test
  • test
  • test
  • Your Most Important Business Equation
  • Your Suppliers Can Be a Risk to Your Project

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy