Failure is not an option; it’s a requirement

05 Jul

Software engineering involves a complex balance of problem solving and risk management. In most solutions, you start by addressing the happy-path first and when the system finally works as expected, you switch your focus to buttoning up the failure scenarios. How deep do you go in your quest to dodge pitfalls?

Let’s consider a solution that requires a pipeline of messages to be exchanged between two systems, a Producer and a Consumer. If the collection of every single individual message is deemed as mission critical, the team might propose a near-real-time message queue based solution.

How should the system respond if the message pipeline fails to deliver a message? There are many options, so let’s create a contingency chain by listing options and repeatedly asking “and what if that fails?”. You should continue your contingency planning efforts until the answer of “we accept failure” is an acceptable option. It is important to go through the exercise of listing out the contingency chain, even with the smallest number of options as this allows you to gauge the inherent risk of the proposed solution at every step.

blog-failure-image1

That is clearly a completely inadequate solution. One basic approach and no safety net. Let’s expand on it by adding another contingency option.

blog-failure-image2

This still feels like an inadequate solution. Adding a weekly and monthly SFTP backfill process might serve to add more contingency options to the mix, but would leave 3 of the 4 items susceptible to a single SFTP server failure. Care should be taken to diversify the contingency options so that they don’t share the same failure points.

blog-failure-image3

This seems like a much more thorough solution for a mission-critical system. Not every software development effort needs this sort of contingency planning, but for service based behaviors or critical middleware components, the drive to understand exactly when failure is an option is an important exercise.  

In Summary:

  • Create a contingency chain
  • Repeatedly ask “and what if that fails?” until the answer is “we accept failure”
  • Diversify the contingencies
Larry Klug
Larry Klug
lklug@bandwidth.com

Larry Klug has been a Software Development Manager at Bandwidth since 2012. His passion for computer software started in 1982 with the arrival of an Atari 400 home computer and a BASIC programming language cartridge, both of which he keeps on display (still working) in his office in Raleigh. When away from the computer keyboard, Larry enjoys collecting musical instruments, singing and performing acoustic music in local venues.

1Comment

Post A Comment