A Life Data Analysis Challenge

Here is a Challenge: Life Data Analysis

Some years ago a few colleagues compared notes on results of a Weibull analysis. Interesting we all started with the same data and got different results.

After a recent article on the many ways to accomplish data analysis, Larry mentioned that all one needs is shipments and returns to perform field data analysis.

This got me thinking: What are our common methods and sets of results when we perform life data analysis?

The Life Data Analysis Challenge

So, here’s a challenge: Given the data in this life-data-challenge.csv file, perform an analysis to answer two questions:

How many returns should we expect next month?
Is the rate or returns increasing or decreasing?

3 [Bonus question] Based on your analysis and experience, what questions should we answer next?

Here is the data, life-data-challenge.csv

Notes About the Data

It is made up data and kept relatively simple for the purpose of allowing a wide range of analysis approaches. The data represent the time to failure in days. The count of days are from shipment till the day, including weekends and holidays, the customer reported the failure.

The item is a battery-powered portable hand drill for use in a home workshop or by a woodworking enthusiast. In other words, not a contractor. The drill is used sporadically for a wide range of uses and situations around a persons home, office, or workshop.

To keep things very simple there were 1,000 units shipped on one day and the failure data is all from that one day of shipments. Not all units have failed; only 75 have failed.

The data is in one column and not sorted nor in any particular order.

Reporting Your Results

There are two main points in this challenge.

First, please answer the two (three) challenge questions based on your analysis. Provide a summary of your analysis, graphics, charts, or what ever makes sense for us (me and your peers) to understand your results and how to you got them.

Second, please comment on what, if any, assumptions you made for your analysis. For example, if you assume the data is exponentially distributed (please, I really hope not!), list that as an assumption.

Third, I really do have a problem with keeping to two points today. Please comment on what additional information you would like to have available, if any, to improve your analysis.

Please add your results to the comments section below, or email them to me (Fred) at bruno@rainmakerdigital.com

That is the challenge. Looking forward to your results and analysis.

Thanks for taking part and enjoy.

About Fred Schenkelberg

I am the reliability expert at FMS Reliability, a reliability engineering and management consulting firm I founded in 2004. I left Hewlett Packard (HP)’s Reliability Team, where I helped create a culture of reliability across the corporation, to assist other organizations.

« Duty Cycle in Depth

Fault Tree Analysis (FTA) – More Than Just a Diagram »

Comments

Jurgita Simaityte says

May 29, 2017 at 12:21 PM

Hello Fred! I opened data in phone, but I do not see any state indicating sencored or failed units. Which ones and at which day these 75 failed?

Reply
- Fred Schenkelberg says
  
  May 29, 2017 at 12:40 PM
  
  Hi Jurgita,
  
  All 75 in the data file have failed. there is no column to indicate censored or failed, as all in the data set have failed. The value is days till failures. The total units shipped in 1,000 thus 1,000 – 75 are right censored.
  
  hope the helps.
  
  Cheers,
  
  Fred
  
  Reply
Jurgita says

May 29, 2017 at 11:40 PM

Ok, now understood, thank you, Fred! One more question: do you have a deadline for this challenge? The date on the top 24th of May, does it mean it finished already? 🙂

Reply
- Fred Schenkelberg says
  
  May 30, 2017 at 6:54 AM
  
  No deadline – the date is when the post was published. cheers, Fred
  
  Reply
Oleg Ivanov says

June 1, 2017 at 5:18 AM

Hi Fred,

I think you agree that failure times are enough for statistic engineer but are not enough for reliability engineer. We need know failed part, failure mechanism, the cause of failure.
Based on this data I can say 75 products from 1000 has some kind of manufacture defect and has failed. Failure time has Weibull distribution (beta=2.5; eta=2000). I think this defect “burned out” and does not appear on the rest of the products.

Thanks for the interesting question.

Reply
- Fred Schenkelberg says
  
  June 1, 2017 at 6:03 AM
  
  Hi Oleg, we have very little information concerning the data concerning failure mechanisms, etc. So, based on the analysis of the available data, what questions do you have?
  
  For your analysis, how did you treat the censored data? What analysis approach did you take? Which software package and what assumptions or settings?
  
  For example, using Weibull++ and ignoring the 925 right censored points, I get one fit, adding the censored data assuming the last point in the data is the censor point, using rank regression or MLE I get two other answers.
  
  I have found other software package provide different answers as well.
  
  So, two questions, which is right and why? Based on your analysis, rather than state conclusions, what questions should one be asking to help make the right conclusions?
  
  Cheers,
  
  Fred
  
  Reply
Ricardo says

June 22, 2017 at 6:11 AM

Hi Fred,

In this example, the operational conditions of one hand drill and another can vary a lot (the item that has failed after 515 days and the item that has failed after 1460 days can have been used in a very different way – load, time cycle, environment…). So first question could be: can we group failures in a certain use pattern? Second question could be related to the failure reporting system (the questions mentioned by Oleg: failure effect? failure mode? potential failure causes? etc.)

With no more data and from a pure statistical point of view I can share these three approaches:

A. Parametric estimation approach without taking into account the censored data:
– Rank Time To Failure data
– Benard approximation for the time to failure probability
– Least Squares fit to Weibull (R^2=97.1%): BETA=2.48; ETA=2002
Question 1 response: 925 units x [1- R(3673+30 days)/R(3673 days)] = 84 expected returns during the next month.
Question 2 response: BETA >1 –> increasing failure rate –> increasing return rate.

In this case we do not use the information that 925 units have survived 3673 days and our estimation could be very conservative…

B. Parametric estimation approach with taking into account 925 right censored data:
– Rank Time To Failure data
– Mean order number and Benard approximation for time to failure probability
– Least Squares to fit to LogNormal (R^2=94.1%): MEAN=9.86; STANDARD DEVIATION=1.32
Question 1 response: 925 units x [1- R(3673+30 days)/R(3673 days)] = 2 expected returns during the next month.
Question 2 reponse: Failure rate function is increasing with time –> increasing return rate

In this case, we have used the censored data information, but they represent a big proportion of the data (>90%). Could our estimation be very optimistic? I guess it could, but I think we should use this information in the analysis.

C. Your “beloved” in-service MTBF = 1778 days
– Average failure rate during the period = 1/1778
Question 1 response: (constant failure rate assumption during next periods) 925 units x 1/1778 x 30 days = 16 expected returns during next month
Question 2 response: constant failure rate assumption during next periods –> constant return rate

In this case, assumption shall be checked (if possible…). Here we could not know if our approach seems to be conservative or optimistic as we have make our analysis too simply…

I am looking forward to listening your thoughts

Cheers,
Ricardo

Reply
- Fred Schenkelberg says
  
  June 22, 2017 at 4:26 PM
  
  Thanks Ricardo for the detailed analysis – The weibull with censored data, which I think is the way to go, although I would use Maximum Likelihood Estimation method given the large number of censored data… Cheers, Fred
  
  Reply
Adrien says

June 26, 2017 at 12:00 AM

Using 2 parameters Weibull, with MLE method, without taking into account censoring:
beta= 2,47
eta= 2010

Rank Regression X or Y methods give around the same results.
But Komolgorow-Smirnof test is rejected the goodness of fit hypothesis.

Using 3 parameters Weibull +MLE give the following results:

Sub-Pop 1
beta1 = 3,5
eta1=1248
p1=0,453443475 (proportion of sub-pop1)

Sub-Pop 2
beta2=4,435635
eta2=2550,594619
p2=0,546556525 (proportion of sub-pop2)

Komolgorow-Smirnof test is not rejected the goodness of fit hypothesis.

1/ Assumption that the end-time is 3673 => failure rate at 3673 is 6,09e-3/d so number of failure for 30 days more is : 6,09e-3*925*30 = 169
2/ The lower bound on Beta1, at NC=90% =2,53, lower bound on Beta2, at NC=90% =2,85 => so the failure rate is increasing.
3/ For taking into account censored data, what is the time of good working for all others equipments? What is the maintenance policy : are the failed unit removed or repaired ?

Reply
- Fred Schenkelberg says
  
  June 26, 2017 at 1:07 PM
  
  Thanks Adrian, which software package (or by hand?) did you use for the analysis? Thanks for taking the challenge. Why consider the regression without the censored data? Does that make sense? cheers, Fred
  
  Reply
  - Adrien says
    
    July 3, 2017 at 12:28 AM
    
    The software is Weibull++ (Reliasoft).
    Censoring is not well taken into account with regression methods, as it used the rank (median rank) instead of the exact time. So with heavy censored (here (1000-75)/1000=92,5%) the results may be wrong.
    MLE used the exact time of censoring.
    
    Reply
    - Fred Schenkelberg says
      
      July 3, 2017 at 7:50 AM
      
      Thanks Adrien – I agree that we need to account for the censored units and do so in a meaningful manner, which mean using MLE regression.
      
      Cheers,
      
      Fred
      
      Reply
      - neel sharavana says
        
        July 8, 2017 at 2:48 AM
        
        dear sir..greetings from india..i am an engineer with no reliability background..i am seeking your help and advice..i admire your nomtbf blog
        
        i work for a firm with several big engineering systems and we have been using MTBF and MTTR with exponential distribution assumption to calculate system reliability etc.
        
        i need your help sir to move away from MTBF and implement a better study methodology..our systems are used for 10-15 years and have non-repairable LRUs, i.e least replaceable units, which are replaced from spare stocks whenever failure occurs….basically , overall the main systems are repairable but the spares failing are non-repairable….
        
        we have robust failure reporting system from users on a monthly basis to carry out field reliability and maintainability analysis on field failures/ field performance.the monthly feedback forms have data like number of running hours, cause of failure, spare replaced, repair time, system down time etc sir.
        
        can you kindly help me sir with a simple system to do standardized field failure data analysis, without too much of maths and too much of reliance on software analysis using tools like RELEX sir.
        
        i will be very grateful for your kind help sir because of your expertise and knowledge in the field..sincere regards.. mr. sharavana gowda from mumbai, india
Mark Powell says

January 2, 2019 at 10:45 PM

Fred,

I am late to this, but it looks like fun.

I don’t understand how anybody could get any answers though. I am seeing plug and chug without problem analysis.

You said 1,000 were shipped on a single day. When was that day relative to the date you reported the data (I am presuming for question one that you are looking for the next 30 days beyond this report date, please confirm). Obviously it has to be at least as many days as the largest failure time, but if the ship date were 25,000 days before the report date, or the ship date was exactly the number of days of the largest failure, it makes a huge difference.

These are very seemingly innocuous and ordinary questions. But they are not trivial to answer properly.

Mark Powell

Reply
- Fred Schenkelberg says
  
  January 3, 2019 at 5:33 AM
  
  Hi Mark, good questions. The longest time to failure is just about 10 years, so let’s say the data was provided to you at 10 years plus the first month. If there were three leaps years than we’d be at the 10 year mark.
  
  Or, you can assume the report is failure truncated and you have as many failure points as are know and no additional time as elapsed.
  
  I agree is all failures stopped over the next 10 years or so, it would change the analysis, yet please make your assumptions clear, as with any analysis.
  
  Cheers,
  
  Fred
  
  Reply