In 1995, the United States Department of Energy (DoE) funded research into Princeton University’s Plasma Physics Laboratory (PPPL). PPPL was developing plasma fusion techniques, and the research in question focused on quality assurance within the laboratory. It was investigating the utility of a new type of quality assurance: on that was performance-based.
What is different in this approach? Instead of focusing on the actions of the laboratory staff, the research focused on the outcomes of these actions. When audited, laboratory staff had to explain what they did to improve the ‘quality’ of their processes. In the past, staff had to explain how they had complied with a standard or guidebook and what they did when they observed breaches. The results of focusing on outcomes were astonishingly good, and performance-based auditing was introduced permanently. This saw quality assurance conducted by cross-functional teams and not just ‘quality’ personnel.
And why is this relevant? For many industries that deal with quality, safety and reliability, there has long been a shift towards compliance-based frameworks. Not performance. And as the PPPL example illustrates, compliance does not ensure the best outcomes. Compliance will yield some baseline levels of reliability and quality. But it will never result in ‘highly reliable’ or ‘high quality’ products. For that, you need a particular type of organizational culture.
This is the second in a series of articles examining emerging technologies and reliability using small satellites as a platform for discussion. Small satellites are (as their name suggests) smaller versions of their larger historical cousins. But they are fundamentally different in how they are designed. They are miniaturized, need to deploy in space by themselves, and can form ‘constellations’ of satellites that collectively provide a single function.
So why small satellites? This series was initiated by my recent involvement in helping small satellite industry members improve their processes. And again, I wasn’t surprised by being confronted by the same issues I had seen in many other emerging technologies as people try and get their product to market. This prompted me to gather (at least some of) my thoughts and create this series.
Small satellites have been born into a compliance-based framework for reliability and mission success. This article discusses some of the many issues with a compliance-based framework as it relates to reliability, safety and risk. We need to talk about compliance now as it sets the scene with ‘what is wrong’ – which then gets addressed in subsequent articles. All of which are based on my recent experience helping some stakeholders in the small satellite industry, with many challenges that we often see in emerging technologies repeating themselves.
So let’s start with listing some observations I made during this experience.
Observation #1: Today’s satellite (and many other technology’s) mission assurance paradigms focus on compliance, not critical thinking or engineering judgment.
Perhaps the most prominent document regarding satellite mission assurance is Aerospace Corporation’s Mission Assurance Guide (MAG). The MAG proposes a ‘Mission Assurance Baseline (MAB)’ which is:
the corporate, configuration-controlled set of tasks performed to increase confidence toward the goal of achieving mission success for a satellite system and associated ground systems.
Doesn’t this sound fine? It does, until it becomes ‘the’ mission assurance strategy. Why? Because it focuses on tasks. Not what the probability of mission success is. Sure there may be a link between these tasks and mission success in some scenario somewhere. But what about your satellite? And what if the task is done, but not done well? The MAB doesn’t care – it just wants the task done.
NASA likewise issues a Mission Assurance Requirements (MAR) document for each satellite. Even a Class D (or experimental) satellite MAR document is over 50 pages long. It contains a similar set of things to do.
A compliance culture is (unfortunately) thriving because of documents like these. Engineers devote all available energy in completing a checklist of activities in lieu of brainstorming how to make systems more reliable. They don’t have time.
But there would be some who disagree. Many hard-working satellite designers and engineers would justifiably point out that a lot of thinking goes into the design – making the criticism above unfair. But the design process focuses on the physics and logic of success. Design is a difficult to define combination of art, science and technological expertise that can be easily verified. The art of ‘making something that can work’ is not the same as ‘making something that will work.’ So it doesn’t matter if you think long and hard about making something that can work if you don’t think long and hard about making something that will.
Reliability, availability and mission assurance revolve around the physics and logic of failure. It is this domain that tends to be outsourced to a compliance framework with typically underwhelming results (as demonstrated by the examples littered through the remainder of this article and this series) that extend well beyond the domain of small satellites.
Observation #2: Compliance is not assurance
Focusing on a set of activities or standards doesn’t guarantee a reliable or available system. Don’t misunderstand me here: there are plenty of customers and agencies that passionately believe that compliance is assurance. But the compliance culture this creates is really, really bad.
For example, passenger vehicles are tested to ensure they comply with emission standards. The idea being that if a vehicle passes an ‘emissions test,’ it complies with emission standards (think about the MAB definition above.) But ‘bad’ organizations will inherently fight this link. Volkswagen very famously installed ‘defeat’ devices on its highly polluting vehicles to pass emissions testing. They did not want to use Mercedes Benz components, so they designed their own catalytic converter – and failed to produce an effective one. The only way they could use their substandard catalytic converter was to cheat the test. And they did.
The compliance culture enabled Volkswagen’s behavior. Their vehicles were able to identify when it was being tested for emissions. When they did, they switched the operational mode of their vehicle to a much less powerful state. This less powerful state would not be tolerated by customers but would allow the vehicle to pass the test.
We have this continuing evolutionary battle between those who measure compliance and those who wish to defeat it. An arms race where the better funded commercial entities will typically win. But it gets much worse.
Standards and best practices don’t work with emerging technologies. This is because they are pioneering new processes and procedures that standards and best practices haven’t evolved to accommodate. But engineers are almost suffocated with standards, guidebooks and textbook doctrine that they often feel powerless to challenge. Engineering standards (in particular) are based on consensus and review, which are inherently retrospective. Standards can quickly diverge from best practices. Everyone knows this.
Elon Musk founded Tesla with a vision to fundamentally change motor vehicles. Tesla vehicles have the ‘autopilot’ feature, which introduces a level of autonomy but still relies on the driver. There was substantial criticism and predictions of backlash associated with the inevitable first fatal crash.
Which occurred in May 2016. But the National Highway Traffic Safety Administration (NHTSA) identified that for 43 781 Tesla vehicles, engaging the autopilot made them 40 per cent less likely to be involved in an accident (when compared within similar driving conditions). This is an astounding improvement in safety. But we are being told to be afraid of machines with autonomy.
On the topic of vehicles, the United States Federal Motor Vehicle Safety Standards (FMVSS) were all enacted through the rear vision mirror. From the first standard relating to seat belts in 1967, they all mandate something that manufacturers had previously come up with, tested, commercialized and offered as standard – before they became ‘the standard.’
Far from leading manufacturers to safety, regulators and standards tend to document what manufacturers have done well in the past. This (of itself) is not a robust assurance framework. Nor does it work with emerging technology.
Observation #3: Compliance is easy, thinking is hard
Lawmakers, regulators and customers are often frustrated when they to impose some assurance framework onto a designer of a new product, only for it to be unreliable and unsafe when they get it. This is always because these lawmakers, regulators and customers only want to deal with compliance.
It is easy to demonstrate compliance (or otherwise) in a court of law when there is a contract dispute. We can easily decide to not purchase something or levy a fine on an organization that does not comply. Governmental customers can review a table of standards and place check marks against those that a system doesn’t satisfy. And there are some instances where compliance is an entirely satisfactory approach. But for the same reason we expect a builder to not just rely on a hammer to build a house, we need a suite of techniques beyond compliance to assure a mission.
Contractors become familiar with governmental customer behavior, and they accommodate what they expect them to do – not what the contract says they will reward. If a governmental customer prioritizes compliance, then they will get a completed checklist (not a reliable system).
The Three Mile Island, Chernobyl and Fukushima Nuclear Power Plants were all ‘safe’ on the day of their respective disasters. As was the Deepwater Horizon, Exxon Valdez and too many other things that have exploded or sunk. Regulatory frameworks permitted them to operate as they were compliant. So this meant they were ‘safe’ and ‘reliable’ at the time.
This may seem contradictory – how can something be ‘safe’ and ‘reliable’ on the day it exploded, melted down, or otherwise failed catastrophically? Because society and the lawmakers who represent them ultimately use the easier language of compliance. We only use binary and absolute ideas of safety and reliability. If it complies it is ‘safe’ or ‘reliable.’
Safety, availability, risk and reliability are measures – not absolutes.
The fact that in hindsight, something we once thought was ‘safe’ and ‘reliable’ wasn’t on the day it catastrophically failed is irrelevant. It adds nothing to the flawed design process or the short-sighted management decisions to prioritize developmental money over long-term performance. There is still a smoking hole in the ground where that thing used to be. There is still a massive oil slick in the ocean. The people that have been killed will never come back to life.
No one has hindsight about a system they are about to purchase. Customers agree to purchase a system that is ‘safe’ and ‘reliable’ at the point of sale. So we must continue to focus our energies on designing ‘safe’ and ‘reliable’ systems. But we must stop using ‘safe’ and ‘reliable’ as absolutes.
Observation #4: Compliance is supposed to set a limit on how ‘bad’ an organization can be
Obvious right? ‘Good’ organizations are not the focus of compliance-based assurance frameworks. Compliance makes sure there is at least some baseline of mandatory activities under (the often unfounded) assumption that this can limit how unreliable or unsafe a system can be.
Any set of ‘mission assurance tasks’ is only intended to address ‘bad’ organizations. The ‘good’ organizations who inherently share the same principles of the compliance framework do not need external compliance. It is the ‘bad’ organizations locked in a perpetual war to defeat the intent of standards or testing.
But resources are not limitless. If the mission assurance tasks are not best practice (for reasons discussed above), then they will have a negative effect on the intent.
Observation #5: … but compliance actually imposes a limit on how ‘good’ an organization can be
There is a school of thought (regarding reliability) that believes all potential failures can be prevented by doing something to prevent it during design or manufacture. This is principally true – provided you know every potential failure and you know how to prevent each of them in all possible scenarios.
This is impossible – even for mature technologies. We are surprised time and time again by how seemingly benign design changes to something that previously worked introduces different ways for the system to fail.
And even if we know all these failure mechanisms, we cannot successfully address all of them. We need to address the ones that will occur most often and have the biggest negative impacts. But without a detailed understanding of the physics or logic of failure, we can never accurately predict the hazard rate for every failure mechanism. So we can never be totally sure of which ones to address first (if such a list of failure mechanisms exists.)
These challenges are completely unbeatable for emerging technologies. You simply cannot have this level of knowledge during the first iteration of design and manufacture.
Yet many catastrophic failures tend to see us expand regulatory and compliance frameworks to try and limit them from occurring again. Why do we keep doing this? Well, t is easy (see observation #3). And there are several other reasons.
Firstly and as discussed above, a compliance approach only works if the regulators are aware of every way a system can fail.
The ways in which emerging technologies (such as small satellites) can fail evolves as quickly as the technology used to design them does.
Industry standards and other consensus driven approaches to ‘best practice’ simply cannot keep up (more on that below).
Secondly, regulators cannot regulate critical thinking through compliance approaches. For example, a failure mode effect and criticality analysis (FMECA) will always help improve system reliability when done well. A FMECA is essentially a group thinking session on ways systems can fail. This tells designers what to ‘look out for’ during the design process. But how does a regulator assess if a FMECA is robust? The answer is – with great difficulty. So too often we see regulators (and customers) simply look ask for things like FMECAs to be done, but don’t invest any time in ensuring they are done well.
Thirdly, a compliance-based framework promotes a culture of compliance – not robust engineering principles (see the Volkswagen example described above).
The focus on compliance has a disproportionate effect on small organizations. Aerospace’s MAG is 547 pages long, outlines hundreds of mission assurance tasks and hundreds of standards. The MAG is intended to be broadly applicable to all aerospace applications. This can be accommodated within multi-billion dollar, multi-year projects that have large engineering workforces. Small, agile organizations who rely on being small and agile to produce things like small satellites cannot.
The comparatively undersized but nimble small satellite manufacturers have to choose:
adherence to industry mission assurance standards or the use of engineering judgment. They can’t do both.
A customer that insists on a suite of standards and compliance tasks just eliminated engineering judgment. They have just communicated to the manufacturer what they value. And it is the badge, certificate or sticker that spells ‘compliant.’ Organizations that cannot adhere to a suffocating compliance framework sometimes walk away from them leaving nothing in their place. Many organizations do not have the ability (or cannot see if the customer cares) to individually discern the intent of each mission assurance task, work out which is most applicable to the emerging technology design problem, and design a more reliable system. This must change.
This brings us to the fourth ramification of a singular focus on compliance based assurance:
compliance replaces critical thought.
And while most standards have obligatory disclaimers about how the standard should be considered ‘guidance,’ overly bureaucratic customers often walk away from systems that are ‘non-compliant.’ Standards and mission assurance guides are treated as mandatory, so they become mandatory. And critical thinking disappears from all perspectives.
And perhaps the most important feature of compliance is:
Compliance (by definition) replaces continual improvement.
Don’t waste your time arguing otherwise. Because once you have invented a binary measure of ‘compliance’ or otherwise, money is lost making something more reliable or safe. It will still be ‘compliant’ – just more expensive. And customers who want compliance … want compliance.
Even though many standards describe themselves as ‘guidance’ only, they become mandatory in a compliance framework. And standards are too slow in improving over time.
And then there is the request by a customer to ‘tailor an extant standard’ in lieu of incorporating a better process. Why? Because the customer wants the new system to be compliant. Tailoring a standard is a form of mental gymnastics they are happy with when doing this. This happens at the expense of better processes.
Continual improvement (if it does happen) occurs despite a fixation on standards and compliance-based assurance frameworks – not because of it.
Observation #6: The wrong people are making the wrong standards
In his fourth edition of his book What Went Wrong? Case Histories of Process Plant Disasters, Ted Klatz quoted the following from an investigation into an accident:
Do not say “It must be safe because we are following the regulations and industry standards.” They may be out of date or not go far enough.
Klatz’s book deals with process plants that involve heat exchangers, valves and many other components whose technology has existed for decades. When compared to emerging technologies such as small satellites, process plants are ‘relatively ancient.’ So if the standards for process plants risk being irrelevant, what does that say about the standards for a relatively new technology?
The uncomfortable reality is that standards are often irrelevant, even when they must be treated as mandatory. World War II pilot and ace Douglas Bader (who had a very successful military career after having both his legs amputated during an accident) famously said:
Rules are for the guidance of wise men and the obedience of fools.
But the reality becomes more uncomfortable when we seriously examine those responsible for standards. They are not a core focus of the technological pioneers who should be authoring them. Standards involved committees: The British Standards Institution states:
[r]epresentatives of organizations having an interest and expertise in the subject matter are brought together … to form a technical committee to draw up the standard, … Typically, [technical committees] comprise representatives of industry bodies, research and testing organizations, local and central government, consumers and standards users.
So where emerging technology involves pioneers who are separating themselves from the status quo, standards involves everyone – which guarantees a status quo. And standards bodies are (lets be honest) often made up of out of touch people fighting for relevance in a weird, closed off society of resume fillers. Is this too harsh? I have not met any wildly successful technology pioneers who see the value in taking time off their (now high-paying) job to spend time talking about dendritic growth in semiconductors for a couple of weeks in some far off hotel or conference center.
And even if you were able to attract the true pioneers of a new technological field, consensus limits the extent to which they are hear. Consensus is great for things that are widely understood well. Not pioneering something new.
Emerging technology is driven by the exceptional few with the drive to do something both different and better.
Consensus standards is driven by the average many who chronicle things that are the same as what has worked well in the past.
If not compliance, then what do we need?
The next article in this series tackles this question. Compliance might be the law, but it is not the answer. Too many catastrophes and failures have occurred in regulated industries for there to be any serious debate about this. The things required for compliance to work make it impossible for it to be effective.
But despite so much evidence to the contrary, the ease of compliance means it is still around. And this is not to say that there is not a role for standards and guidebooks. There clearly is. But their scope is limited by experience and history. What if we are building something that is new, with no one ever ‘experiencing’ it? What typically happens is we pretend this is not the case and use our backwards looking standards and guidebooks and use them when we try to look forwards. It does not work. Neither does driving by looking your rear vision mirror. Sooner or later you are going to crash.
The next article talks about the first thing we need to do when we are escaping the compliance prison. That is to define what ‘good’ is. Or perhaps better yet, what is the metric we are going to use to measure ‘good?’If we are talking mission assurance, then what is the mission? What needs to happen for us to say the mission was a success or otherwise? If we are talking about reliability and small satellites, is ‘success’ declared if a certain satellite works, or if a constellation successfully sends us the data we asked them to?
This might sound basic, and it is. The problem is that people think it is ‘so basic’ that entire design teams completely miss this step. And when they do, they forget why they are running a test, introducing design rules and so on.
Until next time!
Leave a Reply