Revisiting the Natural Language "Compiler"
thoughts on incomplete formalisms
1. Dijkstra's paradox
In 1978, Edsger Dijkstra wrote a treatise “on the foolishness of natural language programming”:
It may be illuminating to try to imagine what would have happened if [..] our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that [...] computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable, and, in view of the history of mankind, it may not be overly pessimistic to guess that to do the job well enough would require again a few thousand years.
The irony is that over the next 48 years, the "black art" would instead consist of bootstrapping from formal systems to ones that could handle natural language (NL).
The reason it didn't take thousands of years is precisely because formal systems abide by simple rule sets and produce verifiable output. To quote Dijkstra again, they are "an amazingly effective tool for ruling out all sorts of nonsense that, when we use our native tongues, are almost impossible to avoid."
And he's right! Natural language can be irrelevant, imprecise, or contradictory -- terrible for a system that demands reliability. Dijkstra frames this as a fatal flaw. In the age of coding agents, it seems the truth is more complicated.
2. The spectrum of coding languages
Let's start by revisiting formal languages like code, which have a simple set of consistent grammar rules. Coding languages build various abstraction levels for different purposes:
Machine language <-- Assembly <-- Systems languages <-- Managed-Memory/OOP <-- Dynamic languages
--> increasing abstractness / expressiveness
<-- increasing power of specification
But these languages "do exactly what you tell them to do" in the sense that they produce repeatable recipes. The "expressiveness" of a high-level language allows describing high-level intent while being less prescriptive about details like memory management. Consider print("hello world"). This simple recipe requires an astonishing amount of work (syscalls, encoding, buffering, etc.) which is entirely delegated to the compiler and runtime.
But that's okay! This works, because you trust the priors baked in by the compiler designers, and it only matters that your intent was carried out.
And to be clear, that is a lot of trust you are putting in the compiler, and it's worth reflecting on that. We put similar levels of trust into our libraries and runtimes. We can do this because compilation is a fundamentally mathematical operation based on the theory of equivalence classes, and execution always follows a more-or-less deterministic trajectory given some source code.
3. Extending the spectrum:
A coding agent is based on probability, not equivalence of classes. So it's not a compiler in the traditional sense. But you can plausibly extend the same coding spectrum to NL:
Code <---[ "compiler" ]--- Implementation notes <-- Component design <-- Architectural vision
↑
|
<-- FORMAL LANGUAGE | NATURAL LANGUAGE -->
| more details "left unspecified" -->
Natural language is inherently imprecise; but critically, you can dial that precision up and down.
Let's say two engineers independently implement the same design. Odds are, the code implementations will vary -- and depending on the scope, they may vary a lot. It all depends on how much was left unspecified. These gaps are filled in by the engineer that implements the design, drawing on their past experience and judgment. Doing this in practice requires you to trust the priors of the engineers.
The same is true of a coding agent: when you underspecify a prompt, the model's priors fill in the gaps. Those priors are mainly shaped by RLHF and other post-training; but there is nothing inherently different between the model and the human.
Neither of these examples involve you writing code. But obviously, this is still engineering! You are specifying intent and trusting an inherently probabilistic system (biological or digital) to do the rest. After all, engineering skill is often measured in one's ability to specify a problem and know which pieces of the problem require the most careful specification.
4: The Art of Underspecification
But even a lack of specificity can be a good thing. An underrated property of documentation is what it doesn't say. For documentation to survive in an evolving system, it must take care not to overspecify.
This is particularly true with modern techniques like recursive language models, which decompose a high-level prompt into sub-problems, reason through each, and synthesize results. When you give such a model a specification, it generates intermediate specifications of its own: sub-goals, decomposition plans, internal constraints. Your top-level spec spawns a tree of machine-generated sub-specs, each one an incomplete formalism over the layer below it.
Excessive detail in user prompts doesn't just cause "context rot": it can constrain the model's ability to decompose the problem intelligently. Sometimes, you can fight the model's planning rather than guiding it. An overly prescriptive spec that dictates how to decompose a problem can lead to more errors than a precise spec that defines what the solution must satisfy -- since, if there are contradictions/frictions between the prescriptive spec and the underlying state of the codebase, then the prescribed approach might not work and the agent may need to improvise.
This is doubly true for saved agentic memory: which, if it goes out of sync with the underlying code, can guide agents in obsolete directions and be challenging to debug. To solve this, we must keep these as high-level declarations of intent while allowing code to speak for itself regarding true behavior.
5. Formal language as an output format
With planning/reasoning models or RLMs: one reason they work so well is that the recursion eventually bottoms out in code. And code is a verifiable substrate for intent.
You can still write "scripts" in NL by providing the model with a prompt and access to tools. However, it is still probabilistic in nature. Formal language, on the other hand, is remarkably useful as an output of a generative process since it's verifiable and repeatably executable.
This creates a verifiable loop: because code is executable, it can be tested, linted, and compiled.
NL Prompt
(intent + constraints)
│
│ held as context
▼
┌─────────────────┐
│ Coding Agent │◄──── error feedback ───┐
│ (harness / │ │
│ orchestrator) │ │
└────────┬────────┘ │
│ │
invokes with │
NL constraints │
│ │
▼ │
┌─────────────────┐ │
│ Coding Model │ │
│ (LLM / RLM) │ │
└───────┬─┬───────┘ │
│ │ │
generates │ │ generates │
┌───────────┘ └───────────┐ │
│ │ │
▼ ▼ │
┌──────────┐ ┌───────────┐ │
│ Code │ │ Tests │ │
└─────┬────┘ └─────┬─────┘ │
│ │ │
│ ┌──────────────┐ │ │
└───►│ Verify │◄───┘ │
│ compile / │ │
│ test / lint │ │
└──────┬───────┘ │
│ │
pass / fail │
│ │ │
│ └──────────────────────────────┘
▼
Grounded Output
This was Dijkstra's insight from ~50 years ago, and it still proves its worth today.
If the agent generates "nonsense," the formal system (the compiler or the test suite) rejects it immediately. We can then pipe that deterministic error back into the probabilistic model, creating a self-correcting cycle. In this sense, formal language acts as a "grounding wire" for NL. We use the "vibe" to describe the intent, but tests provide an existence proof of the real behavior.
6. The next rung of the ladder
The explosion of "vibecoding" has made software creation accessible to people who have never written a function. The software they produce is real code: repeatable, verifiable, deployable. But it relies entirely on the priors of the coding models.
Consider what this looks like at the frontier. Sophisticated agentic implementations like OpenClaw or even Cursor's experimental web browser are basically entirely written by agents, running and checking their work. The generated code works. But it works because humans imparted information to narrow the scope. Humans wrote specs that define correctness, and reviewed the test suites that detect regression. And importantly, humans are now building sandboxes as failsafes to contain the scope of failures or even rogue behavior.
Dijkstra predicted it would take "a few thousand years" to bootstrap from natural language to something usable, because you would "need all the intellect in the world to get the interface narrow enough." Instead, we bootstrapped from formalism, then used it to constrain NL from the outside -- and that only took 50 years.
That constraint is itself incomplete — every specification is a partial model of intent, every test suite a partial model of correctness. This is the oldest problem in computer science, and in mathematics before it. But incomplete formalism has always been the game. A C compiler is an incomplete formalism over assembly. The type system is an incomplete formalism over runtime behavior. The specification is an incomplete formalism over intent.
We've added a rung to the abstraction ladder, but the stack is fundamentally the same. And the incompleteness of this layer raises a more profound question of what might come next.