Failure Driven Design
Why planning for failure—not success—leads to better software April 23, 2021, revisited April 5, 2025, revisited 2025-10-08
Failure Is the Best Way to Learn
There’s a profound truth hiding in plain sight: failure teaches us far more than success ever could. Yet our entire software development culture is built around the opposite assumption.
Two Ways of Looking at Development
When approaching any software project, you face a fundamental choice in mindset:
It’s going to succeed – Plan for everything to work the first time
It’s going to fail – Expect multiple iterations and build recovery into your workflow
Your outlook determines your workflow. How you write software when convinced it will work the first time differs dramatically from how you write it expecting failure.
The Current Workflow: Assuming Success
Today’s dominant development approach assumes success from the outset. We build systems expecting them to work on the first attempt, ignoring the possibility of changes. We invest heavily in type checking and correctness proofs. We construct elaborate type systems early, letting them calcify into our code.
But what happens when that original type system proves inadequate? Often, nothing changes. The types become too ingrained in the code and design to modify without massive rewrites.
This is the waterfall mindset—the antithesis of Failure Driven Development. It’s built on overconfidence, assuming the design will succeed without planning for recovery from failure. Requirements and understanding of the problem space are invariably incomplete at the start. Early attempts inevitably fail to completely solve the problem, yet no recovery mechanism is built into the workflow.
It’s similar to premature optimization: we optimize for the wrong things too early, before we truly understand the problem.
The Alternative: Assuming Failure
Failure Driven Development (FDD) takes the opposite approach. Instead of building systems that must work perfectly the first time, we build in easy recovery from changes and failures. We embrace meta-design—designing systems that are easy to redesign.
The wisdom of Fred Brooks’ The Mythical Man Month still holds true: plan to throw one away, because you will anyway. Brooks observed a pattern across countless projects:
First try: Fails because understanding is incomplete
Second try: Fails because of over-engineering
Third try: “Just right”
Rather than fight this reality, FDD embraces it and asks: How do we fail fast? How do we recover quickly?
Learning from Failure
Consider this: when software works, we “abandon” it—we ship it and move on. But when software fails, we continue working on it. This means most of our time is spent working with failing code and failing designs.
Failure can occur at multiple levels: requirements, design, architecture, engineering, implementation, and testability. The first several attempts at solving any significant problem will fail. That’s not pessimism; it’s realism.
So what do we need to learn? We need to discover what the requirements actually are, understand all aspects and gotchas of the problem space, and ensure testability. Just as automobiles should be designed for serviceability—making repairs easier—software should be designed for modifiability.
Planning for Failure: Practical Tactics
FDD isn’t just philosophy; it’s a collection of concrete practices that make failure less painful:
Isolation
Create units of software that are fully isolated from one another. Unlike OOP-style encapsulation, which only encapsulates data, true isolation also encapsulates control flow. Draw bounding boxes around separate units with ports that allow data to flow in or out. Function calls must not flow beyond these boundaries.
Each isolated Part has exactly one input queue and one output queue. This single-queue design prevents low-level deadlock and maintains events in order of arrival, making it possible to reason about the relative ordering of events.
Divide and Conquer—For Real
Chop problems into sub-problems and use different paradigms and notations for each, as appropriate. Function calling creates tight coupling, making it difficult to truly divide a problem into smaller, isolated pieces. While programmers think they’re dividing and conquering, tight coupling creates clockwork fragility and interdependence.
Code Generation
Don’t write code—write code that writes code. When you find a bug, fix it at the design level, push a button, and regenerate everything. Build transpilers that map Solution-Centric Notations (SCNs) to existing languages rather than writing full compilers.
Tools like OhmJS are game changers here. OhmJS extends PEG (Parsing Expression Grammars) and makes it possible to create nano-DSLs—what we call SCNs—orders of magnitude faster than with traditional techniques. This enables entirely new workflows for software development.
Layering Over Sprawl
Instead of an infinite canvas where units are strongly coupled and bigger systems mean wider canvases, use layers where units are loosely coupled. Bigger systems mean more layers, but each layer remains small—about 7±2 units, matching human cognitive limits.
Multiple Paradigms and Languages
Compose programs using many paradigms and languages. Yes, this sounds complex, but the complexity of solving hard problems remains regardless of how many languages you use. Using multiple paradigms makes it easier to express and understand different parts of the problem.
The transport layer must be simple—just strings or bytes. This is how UNIX commands work: lines of text ending with “\n”, regardless of the programming language used. JSON serves the same purpose for modern systems.
Write the Code Twice
This sounds inefficient, but it’s actually more efficient than the current workflow when using appropriate languages:
First version is design, not code: Iterate and refine using dynamic languages with design-level types (not machine-level types), REPLs, prototypal inheritance, and control-flow languages like StateCharts where appropriate.
When design is finalized, rewrite it: Use efficiency-biased languages like Rust, Python, or whatever fits your needs, taking advantage of libraries and frameworks.
Fail Fast
Divide the problem, then attempt implementation of the greatest or riskiest unknown first. If that unknown becomes known, defer it and choose the next riskiest. If it proves impossible, backtrack and redefine the problem.
This mirrors the scientific method: a scientific theory must be falsifiable. You can’t prove a theory correct—data points can only support it. But you can disprove a theory with a single data point. Strive to kill theories early; fail sooner rather than later.
The Scientific Method Applied to Software
Testing cannot prove that a solution works, but it can prove that a solution doesn’t work. Aim testing at proving the solution doesn’t work—fail fast, fail sooner rather than later. Make it easy to go “back to the drawing board” by scripting everything so you can push a button to rebuild.
When a design fails:
Repair requirements
Repair design
Regenerate
Try again
Practical Tools and Techniques
Factbases and Logic Programming
Create SCNs that transpile code to flat factbases—logical assertions like Prolog facts. Normalize everything to triples (relation-subject-object structures). Write logical inferencing rules to walk through and infer relationships, typing, etc. Let the machine do the heavy lifting through engines like Prolog, Datalog, or miniKanren.
Portability Through Syntactic Composition
Use tools like OhmJS and PEG to build many SCNs per project. The front-end SCN is human-readable; remaining SCNs are machine-readable intermediate representations. Compose solutions by bolting stages together with SCNs as couplers. Keep the front end common and specialize back ends for different targets.
Normalization
Normalization makes automation easier. RTL (Register Transfer Language, used by gcc) and OCG (Cordy’s Orthogonal Code Generator) both rely on normalization for portability. Even projectional editing is fundamentally about normalization.
The Bottom Line
After 50+ years of type-checking research, we still cannot guarantee correct programs. The current state of the art resists change because manual work is not recoverable—time spent is lost.
Automated work, by contrast, accommodates change. You regenerate all code at the push of a button. The system is never “finished” but becomes shippable at some point.
Failure Driven Development means:
Automate, automate, automate
Plan for failure, not just success
Build tooling that makes it easy to reason about and recover from failure
Accept that the number of failures will far exceed the number of successes
The paradox is that by planning for failure, we ultimately achieve better success. We learn faster, iterate more effectively, and build systems that can evolve rather than calcify.
⸻
What’s your experience with failure-driven approaches? How do you build systems that can recover from their own inevitable failures?
Previous Versions
https://guitarvydas.github.io/2021/04/23/Failure-Driven-Design.html
See Also
Email: ptcomputingsimplicity@gmail.com
Blog: guitarvydas.github.io
References: guitarvydas.github.io/2024/01/06/References.html
Videos: youtube.com/@programmingsimplicity2980
Discord: discord.gg/65YZUh6Jpq


