Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
  • Reliability.fm
    • Speaking Of Reliability
    • Rooted in Reliability: The Plant Performance Podcast
    • Quality during Design
    • Way of the Quality Warrior
    • Critical Talks
    • Dare to Know
    • Maintenance Disrupted
    • Metal Conversations
    • The Leadership Connection
    • Practical Reliability Podcast
    • Reliability Matters
    • Reliability it Matters
    • Maintenance Mavericks Podcast
    • Women in Maintenance
    • Accendo Reliability Webinar Series
  • Articles
    • CRE Preparation Notes
    • on Leadership & Career
      • Advanced Engineering Culture
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • ReliabilityXperience
      • RCM Blitz®
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Feed Forward Publications
    • Openings
    • Books
    • Webinars
    • Journals
    • Higher Education
    • Podcasts
  • Courses
    • 14 Ways to Acquire Reliability Engineering Knowledge
    • Reliability Analysis Methods online course
    • Measurement System Assessment
    • SPC-Process Capability Course
    • Design of Experiments
    • Foundations of RCM online course
    • Quality during Design Journey
    • Reliability Engineering Statistics
    • Quality Engineering Statistics
    • An Introduction to Reliability Engineering
    • Reliability Engineering for Heavy Industry
    • An Introduction to Quality Engineering
    • Process Capability Analysis course
    • Root Cause Analysis and the 8D Corrective Action Process course
    • Return on Investment online course
    • CRE Preparation Online Course
    • Quondam Courses
  • Webinars
    • Upcoming Live Events
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home

by Larry George 1 Comment

Markov Approximation to Standby-System Reliability

Markov Approximation to Standby-System Reliability

Age-specific reliability of a standby system depends on components’ failure rates. Reliability computation is interesting when part failure rates depend on age, which is what motivates having a standby system. A Markov chain, approximates the age-specific reliability and availability, which are complicated to compute exactly, unless you assume constant failure rates. Why not use age-specific (actuarial) rates? They are Markov chain transition rates.

Markov chain models of standby systems are not new [Carer et al., Chakravarthy, Pattavina, El-Damcese et al., George 1973 and 2007, Manglik and Ram, and others]. Most of the Markov chain references assume constant transition rates and compute steady state behavior and MTBF, not age-specific system reliability or availability. The reference by El-Damcese et al. describes a standby system with a partial failure mode. The reference by Manglik and Ram uses constant failure rates but general repair time distributions. This article describes a transient Markov chain workbook with age-specific (actuarial) failure rates to approximate age-specific, transient standby system reliability and availability.

Workbook implementation

The workbook computes the age-specific system reliability, availability, and MTBF of the cold-standby system in figure 1. It includes a discrete-time Markov approximation and exact solution for a continuous-time system [George 1973]. 

Figure 1. Reliability block diagram for a cold-standby system

Markov chain transitions include, from left to right in figure 2:

  • Successful operation of part 1 for mission time
  • failure of part 1 and successful operation of standby part 2 for remaining mission time
  • failure of both parts before mission completion

Circular arrows on operating states represent part survival through one transition. The circular arrow on the failure state means there is no repair. This transition diagram doesn’t show that transition rates depend on age. 

For a standby system with a finite mission time, transition from part 1 operating to success occurs when mission time or useful life is over, assuming part 1 doesn’t fail. If part 1 fails, transition from part 2 operating to success cannot occur until the remaining mission time is over, assuming part 2 doesn’t fail.

Figure 2. State transitions of the simple standby system

Computations with constant failure rates

If failure rates were constant, then a three-state (part 1 operating, part 2 standing by or operating, failure) Markov transition matrix would be sufficient to describe the system and compute age-specific reliability and availability. 

Table 1. Markov chain transition matrix P with constant transition (failure) rates “a()” for both parts when operating.

StatesPart 1 OperatingPart 2 OperatingFailure
Part 1 operating1-a()a()0
Part 2 standby or operating01-a()a()
Failure001

The Markov chain approximation multiplies state probability vector p(t-1) times transition matrix P, p(t) = p(t-1)P, t = 1, 2,…,mission time, where p(t) is the state probability vector after t transitions, and P is the transition probability matrix with constant transition (failure) rates. System reliability R(t) at age t is the complement of the sum of the p(s|Failure) failure-state probabilities, R(t)= 1-Sp(s|Failure); s=1,2,…,t. (Age-specific availability is the sum of the part 1 or part 2 operating-state probabilities at any time t.) Table 2 shows the data and discrete Markov chain p(t‑1)P system failure rate, reliability, and exact continuous-time reliability.

Table 2. Failure rates (FR) and Markov chain system reliability R(t) for cold standby independent and identical parts. “Exact R(t)” is a numerical integration of R(t)+∫f(s)R(t‑s)ds, where the integral is from 0 to t. 

AgePart FRSystem FRSystem R(t)Exact R(t)
00.10.00001.00001.0000
10.10.00001.00001.0000
20.10.01000.99000.9900
30.10.01800.97200.9729
40.10.02430.94770.9500
50.10.02920.91850.9227
60.10.03280.88570.8920

Computations with age-specific, actuarial failure rates

If transition rates depend on age, then the state space could be expanded to include ages of components and actuarial (transition) rates. The computation becomes p(t) = p(t-1)P(t) where the P(t) transition rate matrix includes actuarial rates conditional on state of system at age t. Actuarial rates are failure rates conditional on survival to age t, so they satisfy the Markov property that that transition at time t only depends on the state of Markov process at age t-1.

Table 3 shows the transition probability matrix for a two-period mission. The first three rows and three columns define the Markov chain states. “Cal Time” stands for calendar time into the mission, and “Res Time” stands for residual mission time. The other rows and columns are transition matrix, P(t), in terms of the age-specific part failure rates a(1) and a(2), conditional on survival up to the beginning of each age, a(t) = P[t<Life≤t+1|Life>t]. The matrix represents one event per transition. 

Table 3. Markov transition matrix for a cold standby system with two-period mission time and actuarial failure rates a(1) and a(2) for parts at ages 1 and 2

Part  112SuccessFail
 Cal Time 12233
  Res Time21100
11201-a(1)a(1)00
1210001-a(2)a(2)
2210001-a(1)a(1)
Success3000010
Fail3000001

Exact system reliability with age-dependent failure rates in the transition matrix is

Rsys(t) = R1(t)+∫f1(s)R2(t-s)ds, t ≥ 0,

where integration is from 0 to t, Rsys(t) represents reliability, P[System Life > t], and f1(t) is the probability density function of part 1 life. R1(t) is the probability part 1 survives the mission. The second term is the probability of failure of part 1 at age s and successful operation of part 2, R2(t-s) for the remainder of mission time t-s. The Markov approximation and exact solutions don’t agree exactly, because the Markov approximation allows at most one event per transition.

Table 4. Failure rates (FR) and Markov chain system reliability R(t) for cold standby independent and identical parts but with age-specific failure rates.

AgePart FRSystem FRSystem R(t)Exact R(t)
00.10.00001.00001.0000
10.0150.00001.00001.0000
20.0150.01000.99000.9900
30.0150.00270.98730.9873
40.0150.00280.98450.9845
50.020.00300.98150.9815
60.0250.00310.97840.9785

Operating instructions

Open the Markov2.xls workbook and enable the VBA computer program (convolution function), if you want the exact solution. The workbook computes state probabilities for a finite-time mission of eight time units, system reliability, and failure rates as functions of age, and MTBF (mean time between failures of successive missions). It graphs failure rates (figure 3).

Figure 3. Part and system failure rates

Table 1 of the Markov2 spreadsheet contains input and output data. Put your discrete, age-specific (actuarial) part failure rates in column B. Columns C, D, and E contain results. If your mission time differs, rescale the rates to eight time units, or add more rows.

Table 2 of the Markov2 workbook contains the state probability vectors, p(t), starting with p(0), the initial probability vector. It currently represents starting new with eight time units to go. You could change it to represent other starting conditions. The other rows of table 2 contain p(t) vectors after time 0, computed by matrix multiplications p(t-1)P. Table 3 of Markov2.xlsx contains the Markov transition matrix, P. 

Table 1 of the Exact spreadsheet implements a discrete integral approximation for the exact solution. Table 2 computes the expected failure time during a failed mission, and

MTBF = 8*E[Number of missions before failure] + E[Time to failure|Mission failure].

Generalizations and limitation

The Markov chain approximation generalizes to: different failure rates for parts 1 and 2, “warm” standby, more redundant parts, other mission times, repair, and parallel subsystems in series. Exact solution of these generalizations is impractical.

A VBA computer program make the transition matrix from failure rates, part counts, and mission time, does the matrix multiplication, and prints the results in columns. In the real world, failure rates change as parts age, and age-specific failure rates require no unjustifiable, mathematically convenient assumptions to estimate and use. 

If you want the workbook Markov2.xlsx for doing these computations, let me know. Request the Markov2.xls workbook or send field data to pstlarry@yahoo.com, and I will send back estimates of your age-specific failure rates and implementation of your standby system, free of charge. Please refer to https://sites.google.com/site/fieldreliability/ for field data alternatives. 

References

Carer, P., J. Bellvis, M. Bouissou, J. Domergue, and J. Pestourie, “A new method for reliability assessment of electrical power supplies with standby redundancies,”https://www.semanticscholar.org/paper/A-new-method-for-reliability-assessment-of-power-Carer-Bellvis/c7a284a711b7db4cf15962437c050585dd9165f0/2002

Chakravarthy, Srinivas R., “Analysis of a k-out-of-n system with spares, repairs and a probabilistic rule,” J. Appl. Math. and Stochastic Analysis, Vol. 2006, Article ID 39093, pp. 1-23, https://www.hindawi.com/journals/ijsa/2006/039093/2006

Medhat Ahmed El-Damcese, Naglaa Hassan El-Sodany, “Discrete Time Semi-Markov Model of a Two Non-Identical Unit Cold Standby System with Preventive Maintenance with Three Modes,” American Journal of Theoretical and Applied Statistics, Volume 4, Issue 4, pp. 277-290, doi: 10.11648/j.ajtas.20150404.18, July 2015

George, L. L, “Diffusion Approximation for Two Channel, Poisson-Exponential Service Systems with Dependence,” Ph. D. thesis, University of California, Berkeley, 1973

George, L. L., “Markov Approximation of Standby System Redundancy,” ASQ R&M Tech Briefs, Vol. 1, No. 2, pp. 2-5, Jan. 2007

James Li, “Reliability Comparative Evaluation of Active Redundancy vs. Standby Redundancy,” International Journal of Mathematical, Engineering and Management Sciences, Vol. 1, No. 3, pp. 122–129, https://dx.doi.org/10.33889/IJMEMS.2016.1.3-013 122, 2016

Monika Manglik and Mangey Ram, “Reliability Analysis of a Two Unit Cold Standby System Using Markov Process,” Mathematical Sciences Research Journal,December 2013

Pattavina, Jeffrey S., “Tutorial on Analyzing High Reliability: Part 2,” Comms. Design,https://www.eetimes.com/tutorial-on-analyzing-high-reliability-part-2/, March 11, 2004

Filed Under: Articles, on Tools & Techniques, Progress in Field Reliability?

About Larry George

UCLA engineer and MBA, UC Berkeley Ph.D. in Industrial Engineering and Operations Research with minor in statistics. I taught for 11+ years, worked for Lawrence Livermore Lab for 11 years, and have worked in the real world solving problems ever since for anyone who asks. Employed by or contracted to Apple Computer, Applied Materials, Abbott Diagnostics, EPRI, Triad Systems (now http://www.epicor.com), and many others. Now working on survival analysis, epidemiology, and their applications: epidemics, randomized clinical trials, risk-based inspection, and DoE for risk equity.

« Reactive Chemicals
Essence of Reliability Centered Maintenance, and Risk Assessment »

Comments

  1. Larry George says

    August 18, 2022 at 3:41 PM

    Thanks for publishing the article on using age-specific failure rate functions in Markov models. Next article should be about making simultaneous, nonparametric estimates of age-specific failure rate functions, by failure mode, without life data, for use as transition rates in Markov models. It shows competing-risk modeling, without assuming independent, competing risks. Face it, competing risks are depedent: on age (of course), process, environment, usage, customer, etc.
    BTW: R(t)= 1-Sp(s|Failure) should have been R(t)= 1-SUM[p(s|Failure)]; s=1,2,…,mission time. I’ll try to use English instead of Greek.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Articles by Larry George
in the Progress in Field Reliability? article series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Articles

  • test
  • test
  • test
  • Your Most Important Business Equation
  • Your Suppliers Can Be a Risk to Your Project

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy