Revisiting the Natural Language Compiler
Dijkstra was right, but in a wrong way
1. Dijkstra's paradox
In 1978, a certain Edsger Dijkstra (yes, that Dijkstra), wrote a treatise “on the foolishness of natural language programming”:
It may be illuminating to try to imagine what would have happened if [..] our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that [...] computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable, and, in view of the history of mankind, it may not be overly pessimistic to guess that to do the job well enough would require again a few thousand years.
The irony is that over the next 47 years, the "black art" would instead consist of bootstrapping from formal systems to ones that could handle natural language.
The reason it didn't take thousands of years is precisely because formal systems abide by simple rule sets and produce verifiable output. To quote Dijkstra again, they are "an amazingly effective tool for ruling out all sorts of nonsense that, when we use our native tongues, are almost impossible to avoid."
Dijkstra frames this as a fatal flaw of natural language. In the age of coding agents, it seems the truth is more complicated.
2. The spectrum of coding languages
Within formal coding languages, we still have a spectrum:
Machine language <-- Assembly <-- Systems languages <-- Managed-Memory/OOP <-- Dynamic languages
--> increasing abstractness / expressiveness
<-- increasing power of specification
These languages "do exactly what you tell them to do" in the sense that they produce repeatable recipes. But the "expressiveness" of a high-level language allows describing high-level intent while being less prescriptive about details like memory management. Consider print("hello world"). This simple recipe requires an astonishing amount of work (syscalls, encoding, buffering, etc.) which is entirely delegated to the compiler and runtime.
But that's okay! This works, because you trust the priors baked in by the compiler designers, and it only matters that your intent was carried out.
To be clear: that is a lot of trust you are putting in the compiler, and it's worth reflecting on that. We can do this because compilation is a fundamentally mathematical operation based on the theory of equivalence of classes.
3. Extending the spectrum:
A coding agent is based on probability, not equivalence of classes. So it's not a compiler in the traditional sense. But you can plausibly extend the same coding spectrum to natural language:
Code <---[ "compiler" ]--- Implementation notes <-- Component design <-- Architectural vision
↑
|
<-- FORMAL LANGUAGE | NATURAL LANGUAGE -->
| more details "left unspecified" -->
Let's say you are a project lead and you independently delegate implementation of the same design to two engineers. Odds are, the implementations will likely vary -- and depending on the scope, they may vary a lot. It all depends on how much was left unspecified. These gaps are filled in by the engineer that implements the design, drawing on their past experience and judgment. Doing so requires you to trust the priors of the engineers.
The same is true of a coding agent: when you underspecify a prompt, the model's priors fill in the gaps. Those priors are mainly shaped by RLHF and other post-training.
Neither of these examples involve you writing code. But obviously, this is still engineering! You are specifying intent and trusting an inherently probabilistic system (biological or digital) to do the rest. Engineering skill is often measured in one's ability to specify a problem and know which pieces of the problem require the most specification.
4: The Art of Underspecification
But even a lack of specificity can be a good thing. An underrated property of documentation is what it doesn't say. For documentation to survive in an evolving system, it must take care not to overspecify.
This becomes especially clear with modern techniques like recursive language models, which decompose a high-level prompt into sub-problems, reason through each, and synthesize results. When you give such a model a specification, it generates intermediate specifications of its own: sub-goals, decomposition plans, internal constraints. Your top-level spec spawns a tree of machine-generated sub-specs, each one an incomplete formalism over the layer below it.
Overspecification is actively harmful here. Excessive detail in your top-level prompt doesn't just cause "context rot": it constrains the model's ability to decompose the problem intelligently. You end up fighting the model's planning rather than guiding it. A prescriptive spec that dictates how to decompose a problem often produces worse results than a precise spec that defines what the solution must satisfy.
There is skill in knowing what to specify: the constraints that matter, the invariants that must hold, the outcomes that should be verified. Everything else is delegated — not just to the model's priors, but to the model's own recursive formalization process.
5. Formal language as an output format
The thing that saves this from infinite regress is that the recursion eventually bottoms out in code — and code is verifiable.
Prompting a reasoning model with access to tools is a little bit like writing a scripting in natural language. However, it is still probabalistic in nature. Formal language, on the other hand, is remarkably useful as an output of a generative process since it's verifiable and repeatably executable. This was Dijkstra's insight from ~50 years ago still proving its worth.
Prompting a reasoning model is probabilistic, but the code it outputs is formal. This creates a verifiable loop: because code is executable, it can be tested, linted, and compiled.
NL Prompt
(intent + constraints)
│
│ held as context
▼
┌─────────────────┐
│ Coding Agent │◄──── error feedback ───┐
│ (harness / │ │
│ orchestrator) │ │
└────────┬────────┘ │
│ │
invokes with │
NL constraints │
│ │
▼ │
┌─────────────────┐ │
│ Coding Model │ │
│ (LLM / RLM) │ │
└───────┬─┬───────┘ │
│ │ │
generates │ │ generates │
┌───────────┘ └───────────┐ │
│ │ │
▼ ▼ │
┌──────────┐ ┌───────────┐ │
│ Code │ │ Tests │ │
└─────┬────┘ └─────┬─────┘ │
│ │ │
│ ┌──────────────┐ │ │
└───►│ Verify │◄───┘ │
│ compile / │ │
│ test / lint │ │
└──────┬───────┘ │
│ │
pass / fail │
│ │ │
│ └──────────────────────────────┘
▼
Grounded Output
If the agent generates "nonsense," the formal system (the compiler or the test suite) rejects it immediately. We can then pipe that deterministic error back into the probabilistic model, creating a self-correcting cycle. In this sense, formal language acts as a "grounding wire" for natural language. We use the "vibe" to describe the intent, but we use the unit test to prove the reality.
6. The next rung of the ladder
The explosion of "vibecoding" has made software creation accessible to people who have never written a function. The software they produce is real code: repeatable, verifiable, deployable. But it relies entirely on the priors of the coding models.
Consider what this looks like at the frontier. Sophisticated agentic implementations like OpenClaw or even Cursor's experimental web browser are basically entirely written by agents, running and checking their work. The generated code works. But it works because humans imparted information to narrow the scope. Humans the formal specs that define correctness, and reviewed the test suites that detect regression. And particularly in the case of OpenClaw, humans are now building sandboxes to contain the scope of failures or even rogue behavior.
Dijkstra predicted it would take "a few thousand years" to bootstrap from natural language to something usable, because you would "need all the intellect in the world to get the interface narrow enough." He was right about the narrowing. He was wrong about the timeline, and wrong about the direction. We didn't narrow natural language into formalism. We built formalism first, then used it to constrain natural language from the outside.
That constraint is itself incomplete — every specification is a partial model of intent, every test suite a partial model of correctness. This is the oldest problem in computer science, and in mathematics before it. But incomplete formalism has always been the game. A C compiler is an incomplete formalism over assembly. The type system is an incomplete formalism over runtime behavior. The specification is an incomplete formalism over intent.
We've added a rung to the abstraction ladder, but the stack is fundamentally the same. The question, as always, is whether this layer's incompleteness is manageable — whether the nonsense it fails to rule out is the kind we can live with, or the kind that accumulates.