The Hard Thing About Safe Things

Information security needs a more accurate metaphor to represent the systems we secure. Invoking castles, fortresses and safes implies a single, at best layered, attack surface for security experts to strengthen. This fortified barrier mindset has led to the crunchy on the outside, and soft chewy center decried by those same experts. Instead of this candyshell, a method from safety engineering - System Theoretic Process Analysis -  provides a way to deal with the complexity of the real world systems we build and protect.


A Brief Background on STPA

Occasionally referred to as Stuff That Prevents Accidents, System Theoretic Process Analysis (STPA) was originally conceived to help design safer spacecraft and factories. STPA is a toolbox for securing systems which allows the analyst to efficiently find vulnerabilities and the optimal means to fix them. STPA builds upon systems theory, providing the flexibility to choose the levels of abstraction appropriate to the problem.  Nancy Leveson’s book, Engineering a Safer World, details how the analysis of such systems of systems can be done in an orderly manner, leaving no possible failure unexamined. Because it focuses on the interaction within and across systems, it can be applied far outside the scope of software, hardware and network topologies to also include the humans operating the systems and their organizational structure. With STPA, improper user action, like clicking through a phishing email, can be included in the analysis of the system as much as vulnerable code.


Benefits of STPA

At its core, STPA provides a safety first approach to security engineering. It encourages analysts to diagram and depict a specific process or tool, manifesting potential hazards and vulnerabilities that otherwise may not be noticed in the daily push toward production and deadlines. There are several key benefits to STPA, described below.


Diagrammatically, the two core pieces of system theory are the box as a system and the arrow as a directional connection for actions. There is no dogmatic view of what things must be called.  Just draw boxes and arrows to start, and if you need to break it down, draw more or zoom into a box.  The exhaustive analysis works on the actions through labels and the systems’ responses.  The networked system of systems can be approached one connection at a time. Unmanageable cascading failure becomes steps of simultaneous states. Research has shown that this part of STPA can be done programmatically with truth tables and logic solvers. The diagram below illustrates a simple framework and good starting point for building out the key components of a system and their interdependencies.



The diagram of systems and their interconnections allows you to exhaustively check the possible hazards that could be triggered by actions.  Human interactions are modeled the same way as other system interactions, allowing for rogue operators to be modeled as well as attackers. This is an especially useful distinction for infosec, which often fails to integrate the insider threat element or human vulnerabilities into the security posture. As you see below, the user is an interconnected component within a more exhaustive depiction of the system, which can be useful to extensively evaluate vulnerabilities and hazards.


Clear Prioritization

Engineering a Safer World - and STPA more broadly - urges practitioners and organizations to step back and assess one thing: What can I not accept losing?  In infosec, for example, loss could mean exfiltrated, maliciously encrypted or deleted data or a system failure leading to downtime. During system design, if the contents of a box can be lost acceptably without harming the boxes connected to it, you don’t have to analyze it. An alternative method is to estimate the likelihood of possible accidents and assign probabilities to risks. Analyzing these probabilities of accidents instead makes it more likely that low likelihood problems will be deprioritized in order to handle the higher likelihood and seemingly more impactful events.  But, since the probabilities of failure for new, untested designs can’t be trusted, the resulting triage is meaningless. Instead, treating all losses as either unacceptable or acceptable forces analysts to treat all negative events seriously regardless of likelihood. Black Swan events that seemed unlikely have taken down many critical systems from Deepwater Horizon to Fukushima Daiichi. Treating unacceptable loss as the only factor, not probability of loss, may seem unscientific, but it produces a safer system.  As a corollary, the more acceptable loss you can build into your systems, the more resilient they will be. Building out a system varies depending on each use case. In some cases, a simple diagram is sufficient, while in others, a more exhaustive framework is required. Depending on your specific situation, you could arrive at a system diagram that falls in between those extremes, and clearly prioritizes components based on acceptable loss, as the diagram depicts below.



Still accidents happen and we must then recover.  Working on accident investigation teams, Dr. Leveson found that the rush to place blame hindered efforts to repair the conditions that made the accident possible.  Instead, STPA focuses the investigation on the connected systems, making the chain of cause and effect into more of a structural web of causation. To blame the user for clicking on a malicious link and say you’ve found the root cause of their infection ignores the fact that users click on links in email as part of their job. The solutions to such problems require more than a blame, educate, blame cycle. We must look at the whole system of defenses from their OS, to their browser, to their firewall. No longer artificially constrained to simply checking off the root cause, responders can address the systemic issues, making the whole structure more resilient.


Challenges with STPA

Although designed for safety, STPA has been recently expanded to security and privacy. Colonel William E. Young, Jr created STPA-Sec in order to directly apply STPA to the military needs to survive attack. Stuart Shapiro, Julie Snyder and others at MITRE have worked on STPA-Priv for privacy related issues. Designing safe systems from the ground up, or analyzing existing systems, using STPA requires first defining unacceptable loss and working outwards.While there are clear operational benefits, STPA does come with some challenges.

Time Constraints

STPA is the fastest way to perform a full systems analysis, but who has the luxury of a full system analysis when half the system isn’t built yet and the other half is midway through an agile redesign? It may be difficult to work as cartographer, archeologist and safety analyst when you have other work to get done. Also, who has the time to read Engineering a Safer World? To address the time constraint, I recommend the STPA Primer.  When time can be found, the scope of a project design and the analysis to be done may look like a never-ending task.  If a project has 20 services, 8 external API hits and 3 user types the vital systems can be whittled down to perhaps 4 services and 1 user type, simply by defining unacceptable loss properly.  Then, within those systems, subdivide out the hazardous from the harmless.  Now the system under analysis only contains the components and connections relevant to failure and unacceptable loss. While there may be a somewhat steep learning curve, once you get a hang of it, STPA can save time and resources, while baking in safe engineering practices.

Too Academic

STPA may be cursed by an acronym and a wordiness that hides the relative simplicity beneath.  The methodology may seem too academic at first, but it has been used in the real world from Nissan to NASA. I urge folks to play around with the concepts which stretch beyond this cursory introduction. Getting buy-in doesn’t require shouting the fun-killing SAFETY word and handing out hard hats.  It can be as simple as jumping to a whiteboard while folks are designing a system and encouraging a discussion of the inputs and outputs to that single service systematically.  I bet a lot of folks inherently do that already, but STPA provides a framework to do this exhaustively for full systems if you want to go all the way.


From Theory to Implementation: Cybersecurity and STPA

STPA grew from the change in classically mechanical or electromechanical systems like plants, cars and rockets as they became computer controlled. The layout in analog systems was often laid bare to the naked eye in gears or wires, but these easy to trace systems became computerized. The hidden complexity of digital sensors and actuators was missed by the standard chain of events models.  What was once a physical problem, now had an additional web of code, wires, and people that could interact in unforeseen ways.

Cybersecurity epitomizes the complexity and systems of systems approach ideal for STPA. If we aren’t willing to methodically explore our systems piece by piece to find vulnerabilities, there is an attacker who will.  However, such rigor rarely goes into software development planning or quality assurance. This contributes to the assumed insecurity of hosts, servers, and networks and the “Assume Compromise” starting point which we operate from at Endgame. Secure software systems outside of the lab continue to be a fantasy. Instead defenders must continually face the challenge of detecting and eradicating determined adversaries who break into brittle networks. STPA will help people design the systems of the future, but for now we must secure the systems we have.