Resilience, Recovery and Antifragility in Security

Resilience and Recovery are undervalued aspects of Security. When it comes to security, popular culture and politicians focus too much on control. They should be focussing on resilience and recovery. Control strategies contribute to security theater. They give the impression that someone is in charge and that things are under control. When that is always a bit of a facade on reality. Our ability to recover is more important and effective, especially when it comes to extremely rare events. To take two kinds of failure of control,  school shootings and terrorism.  Fortunately, both are extremely rare.

Being able to respond quickly to school shootings appears to have reduced their impact. Control based strategies haven’t been successful in this regard. Limiting access to guns sounds good, but the proliferation of guns in the United States make it unlikely that a ban would be effective. There is a lot of emotional rhetoric around assault rifle bans, but the reality is that even those bans wouldn’t have covered the weapon used by Adam Lanza, a weapon acquired by his mother legally.

Another control based topic is around how we treat our mentally ill. I’ll agree that our society’s way of dealing with the mentally ill is ineffective, occasionally inhumane, and often… dreadful. But, that doesn’t mean a policy of proactive jailing of potential school shooters would be effective. Despite their increasing numbers, there just isn’t enough data and actual science on the subject to make good decisions.

We have made progress on becoming more resilient. New practices for schools and responders appear to have limited the damage shootings have caused. The same is true for terrorism. The NSA’s shockingly unconstitutional intrusions didn’t stop the Boston bombings. The authoritarian someone is in charge spying didn’t stop it, but the city’s preparedness saved lives, limbs and caught the perpetrators.

Focusing on prevention while ignoring recovery will give bad results. Resilience is important: something will get through the defenses.

In tech and identity the modern plague is identity theft. Some forms are harder to recover from than others. If someone steals your credit cards or your social, people seem to recover reasonably well.[1] But if someone steals your email or your social media accounts, usually the master-keys to your identity online, you’re probably totally screwed.  And they’re regularly stolen. The recovery side is almost non-existent. The companies in question find it easier and cheaper to have people create a new account and instead of providing secure customer service.[2]  They often have “recovery” built in where you can add multiple email addresses or some such thing, but in the examples above these systems often help or are easily evaded by the attackers. We need a better way to recover from having our online identities stolen.

There’s a related concept to resilience and that’s Nassim Nicholas Taleb’s concept of antifragility. Antifragile system benefits from stress.  In this case it would become more secure the more you try to hack it. Don’t believe antifragile systems exist, then you should read his book. Let’s take an example of the anti-fragile in the tech world. Netflix’s operations trends toward the antifragile because of their practices and their use of Chaos Monkey.

What is Chaos Monkey? Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group.

Netflix’s Tech Blog

It randomly kills stuff in Netflix. All the time. This means Netflix becomes better and better at dealing with failures. I’ve seen companies create “Recovery plans” and other grand schemes to deal with the loss of systems or datacenters. The last time I was somewhere, that took the “Recovery Plan” strategy to instead of the “Chaos Monkey Strategy,” when they lost an entire datacenter they spent more than a day rebuilding it. They essentially didn’t use their plan. Because they knew it was worthless.  It was just another document they had to create.  They weren’t antifragile, they weren’t even resilient. They were fragile.

You can't treat recovery as an abstract when Chaos Monkey is... causing chaos.

But that seems to be how we deal with identity theft online. The big email and social media companies create “plans,” but they rarely are the ones who suffer for a bad plan or have to deal with the consequences. Every year, after a well advertised attack, they get a little better, but the attackers seem to be outpacing them. We need antifragility in our internet identities. Imagine if the Googles, Facebooks and Twitters of the world had a Chaos Monkey for their employees and executives online identities. Perhaps they need a Chaos Monkey to help make the identities they manage antifragile… or at-least resilient.


  1. The average cost in dollars isn't bad, but in time and peace of mind it's pretty bad.  It shouldn’t be as bad as it often is, but that’s another discussion.  ↩
  2. A pretty big challenge in itself.  ↩