Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
  • Reliability.fm
    • Speaking Of Reliability
    • Rooted in Reliability: The Plant Performance Podcast
    • Quality during Design
    • Way of the Quality Warrior
    • Critical Talks
    • Dare to Know
    • Maintenance Disrupted
    • Metal Conversations
    • The Leadership Connection
    • Practical Reliability Podcast
    • Reliability Matters
    • Reliability it Matters
    • Maintenance Mavericks Podcast
    • Women in Maintenance
    • Accendo Reliability Webinar Series
  • Articles
    • CRE Preparation Notes
    • on Leadership & Career
      • Advanced Engineering Culture
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • ReliabilityXperience
      • RCM Blitz®
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Feed Forward Publications
    • Openings
    • Books
    • Webinars
    • Journals
    • Higher Education
    • Podcasts
  • Courses
    • 14 Ways to Acquire Reliability Engineering Knowledge
    • Reliability Analysis Methods online course
    • Measurement System Assessment
    • SPC-Process Capability Course
    • Design of Experiments
    • Foundations of RCM online course
    • Quality during Design Journey
    • Reliability Engineering Statistics
    • Quality Engineering Statistics
    • An Introduction to Reliability Engineering
    • Reliability Engineering for Heavy Industry
    • An Introduction to Quality Engineering
    • Process Capability Analysis course
    • Root Cause Analysis and the 8D Corrective Action Process course
    • Return on Investment online course
    • CRE Preparation Online Course
    • Quondam Courses
  • Webinars
    • Upcoming Live Events
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home

by Larry George 1 Comment

Why Kill Controls?

Why Kill Controls?

“The effects of chance are the most accurately calculable, and the least doubtful of all factors in the evolutionary situation.”

R. A. Fisher, ca. 1953

COVID-19 vaccination claims have changed from “prevention” to “reduced severity.” FDA approved Pfizer’s vaccine for 95% efficacy, compared with the placebo control sample. Pfizer’s placebo sample had 86% efficacy, compared with the US population case rate! Sample subjects resembled each other but not the US population! 

Ronald Fisher deserves credit for current randomized clinical trials practice and for the method of enumeration of alternative outcomes’ probabilities. Why would reliability people care? 

Reliability testing is supposed to show what might happen in the field. Field reliability is useful in life testing as well as in other applications: warranty reserves, spares stocks, diagnostics, recalls, etc. 

I had to do life testing on something and found the Kolmgorov-Smirnov (K-S) test was commonly used. However it was for complete samples, not censored. So I modified the K-S test to deal with censored samples. I also used likelihood ratio test because nonparametric reliability functions could be estimated from test samples and population data with or without life data. 

In case there is no population life data, just period ships and returns counts (without identifying which cohort returns came from), I used Ronald Fisher’s enumeration of simulated, grouped life data that matched the observed ships and returns counts. This is an example of “Neurosophic” statistics.

Tests results on one sample produce the equivalent of the sound made by one hand clapping. Why not compare test reliability results with population reliability?    

Randomized Clinical Trials vs. Single-Arm Trials

Vaccine efficacy is 1−Risk(vaccinated)/ Risk(unvaccinated), where Risk() is infections/sample size. Pfizer received emergency use authorization with vaccine efficacy =1−(8/21500)/(162/21728) = 95.06% (8 cases vaccinated vs. 162 placebo cases). Suppose instead of 21728 placebo control sample, compare with unvaccinated US population Risk()? Placebo (saline) efficacy = 1−(162/21728)/(17.8M/328.2M) = 1−0.007455/0.054235 = 86.25%! The difference between unvaccinated case rate 0.75% and US population case rate 5.42% shows that Pfizer’s sample is not representative of US population. 

Others recognize this problem [Averitt et al.]. David Moore (former statistics group leader of Abbott Laboratories) told me, “We’re lucky to find 100 subjects with the disease, and we have to split them into control and treatment blocks.” What are the consequences? [Deeney] Translating clinical trials evidence into medical practice may be facilitated by representative sample vs. population comparisons.  Comparing a treated sample vs. untreated population avoids the ethical dilemma of killing controls and removes the bias due to convenience sampling.  

The FDA says, “Real-world data and real-world evidence are playing an increasing role in health care decisions.” https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence. Why not compare treated sample life statistics with untreated population statistics? [Deeney]

Lady Tasting Tea Led to Randomized Clinical Trials Practice

Ronald. A. Fisher did experiments at Rothamsted Research Station, on plants! At UC Berkeley, I took engineering statistics from Elizabeth Scott. She taught Ronald Fisher’s lesson about “The Lady Tasting Tea.” [https://www.nbi.dk/~petersen/Teaching/Stat2009/Fisher_ExactTest_LadyTastingTea.pdf/]

The Lady claimed she could tell whether the milk was poured into the tea cup before or after the tea. Fisher proposed two sets of cups: one set had the milk poured in first. The lady passed the test. I asked Professor Scott what this had to do with engineering? I am ashamed that I asked the question, but I got the answer. The Fisher exact test observes yes-no results, computes the probabilities of all possible combinations, and rejects the null hypothesis of guessing if the number of correct answers (identifying whether milk was added first) is improbable. 

There is at least one clinical trial to see whether plasma-antibody treatment improves corona-virus case fatality rate (deaths/cases) [Joyner et al]. The clinical trial is to see whether treatment for prolongs lifetime (survival) and reduces time to recovery: 

Ho: survival function of a treated sample is same as untreated population vs.

Ha: survival function of treated sample is stochastically better than that of untreated sample; i.e., P[Life>t|treated sample] > P[Life>t|untreated sample] for some t. Why not compare with other populations?  

Typical randomized clinical trials presume similar data from randomly selected treated and untreated samples. Treated sample life data differs from untreated population case and death or recovery counts data, although both contain survival function information. Treated sample subjects produce (censored) life data, times from infection to death or recovery, by patient name or unique identifier. The Kaplan-Meier nonparametric maximum likelihood estimator could be used to estimate the treated survival function, P[Life>t|Treated].

What does this has to do with reliability? Engineers do life tests, to determine if changes in design, process, or other factors improve reliability, P{Life > t] for reasonable life t. The life test null hypothesis is that the change(s) cause no difference in life between the changed sample and the unchanged controls. Why have controls when there is a product population already in service with field reliability function that you could estimate from available ships and returns count data required by generally accepted accounting principles?  

The untreated population produces case and death or recovery counts, without lifetime data [George and Agrawal]. Periodic cohort case and death or recovery counts are statistically sufficient to make nonparametric estimates of population survival functions, P[Life>t|Untreated],  https://sites.google.com/site/fieldreliability/corona-virus-survival-analysis/. 

The clinical trial hypothesis test could be done by sample survival function estimate vs. population survival function estimate using the Kolmgorov-Smirnov (K-S) maximum absolute difference, likelihood ratio, or other test statistic. The FDA would call this a “single-arm” trial and the population an “external” control. Dan Moore (real biostatistician) says, “The FDA does accept “historical controls” as a comparison to treated in phase II trials. You have to show that there has been no change in your endpoint over chronological time.” [Leblanc and Tangen [2012], Belin et al. [2017], Dean et al. [2020], and others] Death is a clear endpoint although recovery from corona virus may not be as clear. 

Kolmgorov-Smirnov Life Test for Censored Data

My 1999 paper with the same title presumed that both the sample and population data consisted of cases and death counts, without lifetime data. It uses a likelihood ratio test. But life tests generate lifetime data, because sample subjects are tracked by name or unique identifier. Lifetimes give more precise survival function estimates than case and death counts; e.g., the Kaplan-Meier nonparametric maximum likelihood estimator for censored, grouped life data vs. nonparametric maximum likelihood estimator for case and death counts [https://sites.google.com/site/fieldreliability/random-tandem-queues-and-reliability-estimation-without-life-data/]. 

The references by Grover and by Fleming and Harrington deal with censored life tests. Grover’s paper and my 1997 presentation assumed equal size treated and untreated samples of life data. What if you had a small, treated sample of censored life data and a huge untreated population of case and endpoint event count data, in which the case cohorts started at different times (“staggered start”), and the event counts did not identify the cohort they came from? 

This problem falls in the realm of “neutrosophic” statistics [Smarandache], because the population case cohort and endpoint event counts could have come from a variety of lives with the same periodic event counts. Table 1 shows grouped life data and event counts from two cohorts started in two periods. Table 2 shows alternative life data that result in the same event counts in the bottom row. These alternatives don’t have the same probabilities, assuming the population survival function estimate from population case and event counts.

Table 1. Grouped life data and case and endpoint event counts. Period 1 cohort has 2 deaths in period 1 and 3 in period 2. Period 2 cohort has 2 deaths in period 2. Bottom row are endpoint event count sums of event counts by period. More than one period cohort (cross-section) of population cases are needed to reduce length-bias without life data [Chan].  

PeriodCasesDeaths Period 1Deaths Period 2
19823
2100
2
Period Sums19825

Table 2. Alternative grouped life data endpoint event counts that could have resulted in same event counts as in table 1 bottom row. Each pair of 1-2 columns shaded yellow shows alternative grouped life data that gives same column sums as in table 1. 

Period1212121212
12221202425
2
3
4
5
1
0
Sums2525252525

Problem statement

From “To the Man with a Hammer,…” [George 1997]

“I compute nonparametric, age-specific reliability estimators from ships and failures data, without life data. Although they are maximum likelihood estimators, they are not Kaplan-Meier (K-M) estimators because failures are grouped by calendar time intervals regardless of ages-at-failures. Fortunately they are population, not sample, estimators, so their only uncertainty is due to censoring.” 

“The modified (for censored data) Kolmgorov-Smirnov (K-S) test applies to K-M estimators (from life data), not ships and returns estimators. What is the asymptotic distribution of the maximum difference between two reliability functions estimated from grouped ships and returns data [George 1996]? Is the modification in [Gnedenko] still appropriate? Is only power affected, not P[type I error]? I conjecture the modified K-S test still has the same asymptotic distribution, but the numbers of observed failures should be replaced by the numbers of time intervals containing failures. The references by Nikiforov derive the asymptotic distribution of the K-S test statistic and provide a robust program for the K-S test statistic, but not for the modification for table 2 data. Reference by Fleming and Harrington, 1991, describes log-rank statistic alternatives to the K-S test, which may be more powerful than K-S tests when reliability estimates cross.”

Muhammad Aslam [2020] proposed one- and two-sample “neutrosophic” K-S tests (NK-S) where observations are contained in intervals, not known “crisply.” The test is based on an interval containing the K-S difference statistic instead of its exact value. Aslam did not specify how to deal with functions of interval observations: enumeration, interval arithmetic, or simulation.

Solution

Simulate population life data with the same column sums or event counts as in the population data. Compute the K-M estimator from the simulated population life data and its K-S distance from the sample K-M survival function estimate. If the sample K-M estimator K-S distance is less than some percentile of the simulated |population−sample| K-S distance, do not reject the null hypothesis. Naturally, I call this an SNK-S test. 

I simulated life data from the population data in table 3 and 20 simulations of the K-S distance between population and sample data. Figure 1 shows lognormal distribution fit pretty well, especially near the upper end. Simulated mean of ln(K-S distance) was -4.23 and standard deviation was 5.2. The 95-th percentile was 0.032. If a population nonparametric maximum likelihood estimator, from case and death counts, and sample K-M K-S distance is less than 0.032, do not reject the null hypothesis with significance level 95%. 

However, each set of simulated life data is not equally likely, assuming the population survival function estimate from population case and event counts. So I weighted each simulated K-S distance by a normalized Kullback-Leibler (K-L) divergence of its simulated K-M estimator from the population survival function. Figure 2 shows the weighted alternative to figure 1, for the same simulated K-S distances. Simulated mean of ln(weighted K-S distance) was 0.00115 and standard deviation was 0.00112. The 95-th percentile was 0.00304. If a population nonparametric maximum likelihood estimator, from cases and deaths, and weighted sample K-M |population−sample| K-S distance is less than 0.00304, do not reject the null hypothesis with significance level 95%. 

Table 3. Life data for simulation to give same bottom row

PeriodShips123
1100234
2100
23
3100
 2
Sums300259
pastedGraphic.png

Figure 1. Simulated K-S distances from table 3 data. Distance is maximum absolute difference between nonparametric maximum likelihood estimator from bottom row and the Kaplan-Meier estimator from simulated event counts.

pastedGraphic_1.png

Figure 2. Simulated K-S distances from table 3 data, weighted by K-L divergence from population survival function. Horizontal axis differs from figure 1, because K-S distances are multiplied by ratio of (K-L divergence)/Σ(K-L divergences).

Afterthoughts: Multiple inference and COVID-19 vaccine

“Recognize that any frequentist statistical test has a random chance of indicating significance when it is not really present. Running multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one “significant” result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading.” Professionalism Guideline 8, Ethical Guidelines for Statistical Practice, American Statistical Association, 1997. [https://web.ma.utexas.edu/users/mks/statmistakes/multipleinference.html/]

1. Simulate the K-S distance for all simulated population K-M estimates and do not reject the null hypothesis if all the K-S distances are small. 

2. Do the likelihood ratio test too, [George 1999 and 2021] using the column sums from the sample life data and the population counts. If somebody gives me some treatment life data and I can find corresponding population case and death (event) counts, I will run both tests.

3. Do log-rank and Gehan-Wilcoxon tests too? [Ed Gehan suggested that to me, 1976.]

4. Do the weighted-difference in cumulative failure rate functions proposed by Fleming and Harrington [1980]. This deals with crossing failure rate functions. Their test statistic has known asymptotic properties. 

5. Does 95% Pfizer COVID-19 vaccine trial efficacy apply to population? Vaccine efficacy = (Cases(unvacc.)/TTT(unvacc.)−Cases(vacc.)/TTT(vacc.))/Cases(unvacc.)/TTT(unvacc.); (TTT() stands for total time on test.) TTT() is time since vaccination for the treated subjects, and is total time since February or March 2020 when COVID-19 started or comparable time since June when vaccination trials started.

Are you running sample life tests? What are you comparing to the sample reliability function estimate? Compare sample with population reliability function estimates; the latter doesn’t have any sample uncertainty!

References

Aslam, Muhammad,  “Introducing Kolmogorov−Smirnov Tests under Uncertainty: An Application to Radioactive Data,” http://pubs.acs.org/journal/acsodf, ACS Omega 5, 914−917, 2020

Amelia J. Averitt, Chunhua Weng, Patrick Ryan, and Adler Perotte, “Translating evidence into practice: eligibility criteria fail to eliminate clinically significant differences between real-world and study populations,” Digital Medicine 3:67 ; https://doi.org/10.1038/s41746-020-0277-8, 2020 

Belin, Lisa, Yann De Rycke, and Phillippe Broët, “A two-stage design for phase II trials with time-to-event endpoint using restricted follow-up,” Contemporary Clinical Trials Communications, Volume 8, Pages 127-134, https://doi.org/10.1016/j.conctc.2017.09.010, December 2017

Chan, Kwun Chuen Gary, “Survival analysis without survival data: connecting length-biased and case-control data,” Biometrika 100 (3): 764-770, 2013 

Dean, N., Gsell, P.S., Brookmeyer, R., Crawford, F., Donnelly, C., Ellenberg, S., Fleming, T., Halloran, M. E., Horby, P., Jaki, T., Krause, P., Longini, I., Mulangu, S., Muyembe-Tamfum, J.J., Nason, M., Smith, P., Wang, R., Henao-Restrepo, A., and De Gruttola, V.  “Creating a Framework for Conducting Randomized Clinical Trials During Disease Outbreaks.” The New England Journal of Medicine, 382, 1366-1369, 2020

Dianna Deeney, “How Many Controls Do We Need to Reduce Risk?” https://lucas-accendo-site-speed.sprod01.rmkr.net/podcast/the-reliability-fm-network/qdd-027-how-many-controls-do-we-need-to-reduce-risk/#more-449094, Sept. 2021 

FDA, “Submitting Documents Using Real-World Data and Real-World Evidence to FDA for Drugs and Biologics Guidance for Industry,” May 2019

Fleming, Thomas R. and David P. Harrington, “A Class Of Hypothesis Tests For One and Two Sample Censored Survival Data,” Technical Report Series, No. 9, August 1980

Fleming, T. R. and D. P. Harrington, Counting Processes and Survival Analysis, Wiley-Interscience, New York, 1991

Gehan, E. A., “A generalized Wilcoxon test for comparing arbitrarily singly-censored samples.” Biometrika 52, 203-223, 1965

George, L. L., and A. C. Agrawal, “Estimation of a hidden service distribution of an M/G/∞ system,” Naval Research Logistics, 20: 549–555. doi: 10.1002/nav.3800200314, https://sites.google.com/site/fieldreliability/home/m-g-infinity-service-distribution, 1973 

George, L. L.  “Ergodic Theory, Nyquist Samples, and Field Reliability,“ Triad Systems Corp., March 1996

George, L. L.  “Product Reliability Comparison with Censored Data,” or “To the Man With a Hammer, Everything Looks Like a Nail,” ASQ Reliability Review, Vol. 17, No. 1, March 1997

George, L. L.,  “Compare Population and Customer Reliability,” Quality and Productivity Research Conference, ASQ and UC Berkeley, Santa Rosa, CA May 1998

George, L. L. ,“Why Kill Controls?” https://www.linkedin.com/feed/update/urn:li:activity:6704848960865103872, 1999

Gnedenko, B. V., Yu. K. Belyayev, and A. D. Solovyev, Mathematical Methods of Reliability Theory, Academic Press, New York, pp. 274-276, 1969

Grover, N. B., “Two-sample Kolmogorov-Smirnov test for truncated data,” https://doi.org/10.1016/0010-468X(77)90039-3

Joyner, Michael, et al., “Effect of Convalescent Plasma on Mortality among Hospitalized Patients with COVID-19: Initial Three Month Experience,” MedRxiv preprint, https://doi.org/10.1101/2020.08.12.20169359, Aug. 2020

Koziol, James A.  and  David P. Byar, “Percentage Points of the Asymptotic Distributions of One and Two Sample K-S Statistics for Truncated or Censored Data,” Technometrics, Vol. 17, No. 4, pp. 507-510, doi = 10.1080/00401706.1975.10489380, https://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489380, 1975

LeBlanc, Michael and Catherine Tangen, “Choosing Phase II Endpoints and Designs: Evaluating the possibilities,” Clin. Cancer Res. 18(8): 2130–2132. Published online 2012 Mar 8. doi: 10.1158/1078-0432.CCR-12-0454, 2012 Apr 15

Nikiforov, A. M.  “Algorithm AS288, Exact Smirnov Two-sample Tests for Arbitrary Distributions,” Appl. Statist, v. 43, No. 1, pp. 265-284, 1994

Nikiforov, A. M., “Subroutine GSMIRN,” statlib@lib.stat.cmu.edu 

Smarandache, Florentin, Introduction to Neutrosophic Statistics, Sitech & Education Publishing, Columbus, Ohio, 2014  

Filed Under: Articles, on Tools & Techniques, Progress in Field Reliability?

About Larry George

UCLA engineer and MBA, UC Berkeley Ph.D. in Industrial Engineering and Operations Research with minor in statistics. I taught for 11+ years, worked for Lawrence Livermore Lab for 11 years, and have worked in the real world solving problems ever since for anyone who asks. Employed by or contracted to Apple Computer, Applied Materials, Abbott Diagnostics, EPRI, Triad Systems (now http://www.epicor.com), and many others. Now working on survival analysis, epidemiology, and their applications: epidemics, randomized clinical trials, risk-based inspection, and DoE for risk equity.

« From Maker to Manager, Part 4: Learning Business Finance
The Problematic Project – Ignorance, Curves, Spirals, and Slopes »

Comments

  1. Larry George says

    February 23, 2022 at 4:19 PM

    Just read…AAAS Scientific Freedom and Responsibility award given to Ronald Jones…
    “Jones is being honored for his role in exposing one of the biggest medical scandals in New Zealand’s history. He was a part of a group of three Kiwi doctors who exposed ethical abuses in a study examining cervical carcinoma in situ, or CIS.”
    “In 1973, Jones joined the staff of National Women’s Hospital in Auckland as a junior obstetrician and gynecologist. At this time, Professor Herbert Green had been conducting a study into CIS that had been in progress for seven years. Despite common knowledge at the time that CIS was a precursor to cancer, Green had embarked on a study of women with CIS, without their consent, that involved merely observing rather than treating them.”
    “Sadly, many of the women subsequently developed cancer and some died.”

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Articles by Larry George
in the Progress in Field Reliability? article series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Articles

  • test
  • test
  • test
  • Your Most Important Business Equation
  • Your Suppliers Can Be a Risk to Your Project

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy