| by
Michael Stamatelatos |
Space flight is
one of the riskiest human endeavors. The hardware and systems used by
NASA and other space agencies are among the most complicated ever devised
by humans, and with every added component or complexity comes the added
chance of failure. It is a testament to the hard work and ingenuity of
the engineers working in the space program that such complicated systems
get launched successfully.
Even so, it is important not to simply accept every failure as part of
the price for "pushing the envelope." When risks can be
identified and assessed, engineers can devise ways to find and improve
weak points in a complex system. Indeed, risk assessment can play a crucial
part of the decision-making process, enabling managers to reduce the existing
level of uncertainty when choosing whether to go ahead with a given mission
or with specific features of a mission. Sometimes the decision involves
the need to trade off between options. For example, how can one make the
best allocation between time to perform experiments on the International
Space Station and maintenance time needed to create a safer and healthier
work environment? Quantitative risk assessment can be used to answer this
question. Not every decision so informed by risk assessment is necessarily
the right one, but using this methodology helps make better decisions
more likely.
To the people who study it professionally, risk is the probability, or
frequency (probability per unit time), and the consequence (severity)
of an undesired event, and the uncertainties associated with the estimated
probabilities and consequences. For a space mission such as the International
Space Station, now in Earth orbit, the risks that need to be assessed
go far beyond the potential loss or injury of astronauts, though that
is certainly the most important outcome to be avoided. Other undesired
events, or end states as they are called in risk assessment, include such
mishaps as damage to one or more of the station modules, the failure of
an important system, or the inability to complete a scheduled mission.
What's more, in a system as complex as the ISS, most everything
that could go wrong can be assessed and determined whether it can be linked
with one or more of the undesired end states. Take, for instance, the
temporary addition of a logistics module to the station. This module,
which is used to ferry new material and remove old equipment, has its
own power and life support system. But how will these integrate with the
systems aboard the space station? And how likely is a mishap while the
module is installed? Those are the kinds of questions a robust quantitative
risk analysis can help answer.
NASA has adopted a "continuous risk management" process
for all its programs and projects. This process begins with the identification
and analysis of program or project risks that impact success criteria.
The risk management process continues with risk analysis, planning, tracking,
and control. All unacceptable risks are dealt with before a project or
program can proceed.
 |
| Every event, such as this space
walk by an astronaut aboard the International Space Station, can be
analyzed using probabilistic risk assessment. |
One analytic tool that helps in identifying and analyzing these kinds
of risks quantitatively is a modeling process known as probabilistic risk
assessment. The most robust use of PRA is to make risk comparisons among
competing choices or to identify major contributors to the overall risk
of a given choice. Once identified, decision makers can choose either
to eliminate a given form of risk at all cost (if the likelihood of the
potential end state, such as loss of life, is too great to bear) or to
reduce the risk on a basis of a risk-benefit tradeoff. All forms of risk
cannot be eliminated, however. Taking care of one form of risk can cause
other types of risk to arise or increase.
Engineers use a probabilistic risk assessment to get answers to three
basic questions: What can go wrong? What is the likelihood that such an
outcome will occur? And what are the consequences if the outcome occurs?
To find the answers to those questions using a PRA, engineers follow a
multistep process.
First, engineers must identify the objective of the PRA within the context
of the mission and the potential detrimental end states (usually characterized
by levels of severity) that they want to avoid in order to ensure mission
success. Engineers assessing the potential risks in a system need to immerse
themselves in design, test, and operation information, including talking
with experts who have designed, tested, and operated the equipment or
even physically inspected the system if at all possible. It's important
to remember that the system being assessed may be quite different from
its blueprints.
The next step is to identify the events or failures that can lead to the
defined end states. From the outset, it isn't always obvious what
can cause an end state to happen. In systems as complex as a space mission,
components may be linked in indirect, yet crucial ways. To trace failures
back to every potential trigger, engineers using PRA employ a technique
called a master logic diagram. The MLD is a special type of fault tree,
with the ultimate failure at the top and specific causes that can trigger
this top failure at the base (basic events). Then each of these causes
is examined and all the events that can link them to the top failure event
are logically traced through a series of logic steps (gates).
Once the MLD has been fully developed, the next task of the probabilistic
risk assessment becomes one of modeling each possible accident or mishap
chain of events, or scenario. This step starts with a trigger event, and
follows the sequence of intermediate (pivotal) events that can eventually
lead to an end state.
Each pivotal event in the scenario is a branch point leading to success
or failure; the likelihood of failure is evaluated using a fault tree.
The fault tree uses logic symbols (gates) and intermediate events to link
the top event with some basic components that have failed (basic events).
Maybe it's a switch, or a capacitor. And if you have analyzed the
system to its fullest extent, you'll be able to trace through the
fault tree how the failure of components can, under the right combination
of circumstances, lead to the top (pivotal) event in the mishap scenario,
which may be the failure of a crucial system that can prevent or mitigate
the progression of the scenario.
Up to this point, the analysis has been essentially qualitativeunderstanding
the entire system and modeling potential event sequences and system components
that could lead to an undesirable end state. But that sort of analysis
has limited value in helping decision makers determine how to proceed
or assisting engineers in identifying and remedying weak points in the
system. That's why the next step is to quantify all probabilities
in the model described so far. These numbers can come from past experience,
databases, or some other estimates including expert judgment, and there's
a level of uncertainty associated with each piece of data. Essentially,
engineers wind up with probabilities within a range of errors for the
failure of each part of the system analyzed by this modeling method.
At the end, you can calculate the relative contribution of various systems
and different end states to the overall risk value. You may find out,
for instance, that two or three systems contribute up to 90 percent of
the total risk. In an outcome such as that, the calculation becomes a
red flag to engineers that those systems should be reexamined, modified,
or perhaps redesigned.
The risk analysis isn't complete until uncertainty and sensitivity
analyses are performed, since these help identify how trustworthy the
analysis results are. One way to propagate the uncertainties through the
PRA model can be accomplished by running a Monte Carlo simulation
essentially letting a computer run through the model thousands of times.
For each analysis parameter needed, a random number generator selects
a value from a probability distribution associated with that parameter.
 |
| In systems as complex as the International
Space Station, shown here in Earth orbit, adding new components carries
the risk of inadvertently disabling older ones. |
Sometimes it is necessary to conduct a sensitivity analysis, in which
one varies the value of one analysis component keeping all others fixed
in order to see how greatly that component value affects the rest of the
overall analysis results. Engineers can thus uncover if uncertainties
in the analysis pertain to systems that are most mission critical.
Finally, probabilistic risk analysis can include evaluation of the relative
risk importance measures associated with various components and systems
within the overall system being analyzed. These risk importance measures
can be used in risk rankings that can be presented to a manager to help
in the decision-making process.
One could make the mistake of believing that quantitative risk assessments
are impossibleor at least not very usefulwhen many input
data are unavailable or have large uncertainties. In fact, that is precisely
the situation when a PRA is most useful. We need a PRA when we are faced
with the largest amount of uncertainties. Why would we need a PRA if all
the necessary information were available? And if an area of analysis has
fairly large levels of uncertainty, it is a sign that more research needs
to be done on that area's rate of failure.
Probabilistic risk assessments are useful in every phase of a mission
life cycle, not just at design or before launch. A PRA performed in the
design phase can help identify the risks associated with systems and components
and with technological options. Quantitative risk associated with different
design or technological options can be compared and the results used as
input into the management decision and tradeoff process. This can be done
even if some mission-specific data do not yet exist.
During operation, a PRA can help optimize resource allocations and can
help predict detrimental effects to the program. When it's time
to upgradeeither because of aging and obsolete equipment or adding
an enhancement to the systemPRA can identify technologically acceptable
options that minimize risks and provide a consistent and unbiased assessment
tool to evaluate the risks and benefits of each upgrade.
In the recent past, for example, a shuttle flight to resupply the ISS
presented the following choice between two safety improvement options:
There was room to bring up either a new window cover for protection against
micrometeorite impact or an additional carbon dioxide removing systembut
not both. It was critical to find out which addition would best improve
the safety on-board the station. A PRA showed that the new window covers
could wait because the carbon dioxide remover provided a greater safety
value (more risk reduction) to the ISS in the near term.
When an asset such as a satellite is at the end of its useful life, its
disposal must be carried out safely and cost-effectively. A PRA can help
find dismantling and disposal options that minimize risk.
It may appear that the most risk-free thing to do would be to do nothing
at all. But that is not true because there are risks in doing nothing.
While there are risks in exploring space, there are also risks in not
doing it. Moreover, our nation has made a commitment to exploring space,
so riskrisk of mission failure, risk affecting human livescannot
be avoided. Thanks to the analysis tools we have at hand, however, we
can help decision makers determine what options or tradeoffs are acceptable,
and how to push forward the options involving the least amount of risk
and cost.
Michael Stamatelatos is director of the Safety and
Assurance Requirements Division of NASA's Office of Safety and Mis- sion
Assurance in Washington, and is past chair of ASME's Safety Engineering
and Risk Analysis Division.
home
| features | breaking
news | marketplace
| departments | about
ME back issues | ASME
| site search
© 2005 by The American Society
of Mechanical Engineers
|