The repercussions of recent cloud outages—AWS’s S3 crash and Azure’s Active Directory cascading failure—linger in IT departments and manifest in revenue loss. But, the bigger story is that the next outage is around the corner—unpredictable, coming to get us on a random Tuesday. Whether businesses are using cloud providers, on-premise data centers, or hybrid setups to host web services and backends, infrastructure failures are a fact of life and have to be on our radars as a matter of routine. This makes architecting for failure and for the future, from the start, among the most pressing imperatives for business IT departments.

The next five years will see the rise and democratization of centralized control systems for cloud ops with fault tolerance architected into the very fabric of those systems. Configuration management is being reinvented and taken to entirely new levels of automated action, where the machines take responsibility for failure and do the right thing as part of their continuous tasking. The cloud’s scalability, elasticity, distributed resources, and potential cost savings increasingly make it the wiser, preferred choice for enterprises. Unlike on-premise data centers, with cloud, the pieces are all there to help us withstand the storm of outages and their fallout. The challenge is to figure out how to stack, manage, and tune those pieces to automate resilience—and to do that as the pieces change over time.

  • Be Honest About the Weakest Links
  • Is Disaster Recovery an Antiquated Notion?
  • Separate Concerns and Determine Priorities
  • Do No Harm: Architect a Responsive “Circuit Breaker”
  • Humans Forget, Computers Remember

Read all about these points on this blog post.