Markov Chain Modeling – Just the Basics

Abstract

Chris and Fred discuss Markov Chain modeling. Where we model transitioning from one state to another – which is often used for availability. How and when do I use it? Does it work in today’s reliability applications? Listen here to learn more.

Key Points

Join Chris and Fred as they discuss Markov Chain modeling – which is often taught in universities but may not be as practically useful. A Markov Chain is a series of ‘states’ that could (for example) represent a system fully functional (state 1), degraded (state 2) and failed (state 3). You can have as many states as you like. Then there are transition rates between each state – which must remain constant. This starts to become useful when you want to model failure AND things like repair – where you transition from a failed state to a functional state. Does this work?

Topics include:

How are Reliability Block Diagrams (RBDs) and fault trees different? These traditional modeling mechanisms tend to focus on failure – and not going back to a functional state, degraded state, or any other state you define.
… but Markov Chains are ‘memoryless’ or ‘ageless.’ This means that the transition rates are constant. They don’t change with respect to time – or system age. Nor do they change based on how long your system has been in each state.
So Markov Chains are primarily used for steady-state availability. Trying to work out the likely long term probabilities of your system being in any one of the states you defined.
Are they like Petri nets? No. Petri nets may look like Markov Chains, but instead, the model is based on tokens moving through the chain to understand system behavior.
Other ways? Try Agent-Based Models. Relatively simple rules can then treat your system as an ‘agent’ … and then you can control the way your systems behave in much greater detail.
What about Markov Chain Monte Carlo (MCMC) simulations? If you have heard of this – don’t worry! MCMC is a simulation technique based IN PART on Markov chains to help create a representative sample of random variable values from a given function (like a probability density function). This is useful for solving some complex calculations – including reliability engineering complex calculations – but is not what we are talking about here!

Enjoy an episode of Speaking of Reliability. Where you can join friends as they discuss reliability topics. Join us as we discuss topics ranging from design for reliability techniques to field data analysis approaches.

Speaking Of Reliability: Friends Discussing Reliability Engineering Topics | Warranty | Plant Maintenance

SOR 495 Markov Chain Modeling - Just the Basics

00:00 /

Download Audio RSS

Show Notes

About Christopher Jackson

Chris is a reliability engineering teacher ... which means that after working with many organizations to make lasting cultural changes, he is now focusing on developing online, avatar-based courses that will hopefully make the 'complex' art of reliability engineering into a simple, understandable activity that you feel confident of doing (and understanding what you are doing).

Comments

David W Coit says

February 5, 2020 at 11:43 AM

I enjoyed the podcast, but as an ivory-tower academic, who occasionally teaches Markov Chains, I need to make a few comments.

First off, it was informative, so no complants, but I want to make one correction and several comments. The transition rates from state-to-state are constant, but this does NOT imply the rate of failure (hazard rate) is constant unless there are only two states (working-failed). As the number of states between fully functional and failed increase and the transition rates might be different for different state-to-state, there can be a wide variety of hazard rate functions. For example, if there are 3 states (fully working-partial-failed) and the transition rates are the same, then the failure time distribution is gamma (or more specifically k-Erlang). The gamma distribution can behave similarly as a Weibull. As an another example, consider 4 states, but the state transition rates are not the same, they get higher as it gets closer to failed state. Then, once it gets closer to failure (states 2 or 3), failure happens quickly to result in an increasing hazard rate.

Some more comments: For hardware reliability of consumer products, which is of most interest to podcast listeners, the use Markov chains is limited as you conclude. However there are many more applications, like computer or telecommunication networks, of multi-state reliability where repair actions can take place prior to full failure. Another comment, you failed to mention there are 2 basic types of Markov chains – discrete time and continuous time. If you can assume transition only take place or are only detected on a regular schedule (weekly, monthly) then the mathematics is quick different and easier. Alternatively, when continuous time Markove Chains get big, it is often not practical at all.

Keep up the good work.

Reply
- Christopher Jackson says
  
  February 8, 2020 at 5:53 AM
  
  David – thanks for your feedback and additional comments! You are certainly correct and we had to balance a ‘complete discussion’ on the topic with a 25 minute podcast time limit.
  
  To elaborate on your first point and to provide context to our listeners, there are non-constant hazard rate distributions out there like the gamma distribution that Markov chains can model. The gamma distribution models systems where failure occurs after a certain number (let’s call this number ‘k’) components have failed – in order. That is, after the 1st component fails, the second component starts to get used, and when the 2nd component fails the 3rd component starts to be used and so on (sort of like an extreme value distribution – but not quite). So if we set up a Markov chain with ‘k’ states, and the ‘last’ state is failure, then the probability of our system being in the last state (failed) is described by the gamma distribution. However, all components must have a constant and identical hazard rate. BUT – this can be OK if there are lots of them because the Central Limit Theorem pretty much says that for lots of random variables we add together … they tend to look the same.
  
  Markov chains can be better than gamma distributions in that we can have different hazard rates for all ‘k’ components (… which is what I think you were saying David?). And (if I think I get your gist), we can increase the transition rate as the number of states increase to represent things like fatigue, where the rate of degradation increases the closer you get to failure. But again … we can perhaps use the lognormal distribution here. And a Weibull distribution also does a pretty good job of modelling systems where the hazard rate accelerates as you approach failure.
  
  I would still suggest that for the overwhelming majority of reliability engineering problems, we can use things like the Weibull, gamma, lognormal distributions, aggregate models of each and other approaches before Markov chains become the modelling tool of choice – happy for you to challenge this David!
  
  I also agree with you David that Markov chains also allow us to model different repairs for different levels of degradation … we simply have different transition rates from different levels of degradation back to the fully functional state. But again, we still have to use the constant hazard rate (which is particularly inappropriate for modelling maintenance). So this Markov chain is still mainly useful for steady state availabiltiy applications.
  
  And finally – we did not talk about ‘discrete’ Markov chains. A discrete Markov chain doesn’t involve rates of transition – instead your system can only transition from one state to another at discrete steps in time or usage. This can be quite useful for systems that react to ‘demands’ or specific applications. And I must apologize for not mentioning this type – we may still be doing the podcast if we didn’t stop!
  
  So comments like yours David help us fill out the material! Thank you very much for your comment … and any others the flow on from this.
  
  Reply