hckrnws
> But like humans — and unlike computer programs — they do not produce the exact same results every time they are used. This is fundamental to the way that LLMs operate: based on the "weights" derived from their training data, they calculate the likelihood of possible next words to output, then randomly select one (in proportion to its likelihood).
This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed." Indeed, many APIs used to support a "temperature" parameter that, when set to 0, would result in fully deterministic output. These parameters were slowly removed or made non-functional, though, and the reason has never been entirely clear to me. My current guess is that it is some combination of A) 99% of users don't care, B) perfect determinism would require not just a seeded RNG, but also fixing a bunch of data races that are currently benign, and C) deterministic output might be exploitable in undesirable ways, or lead to bad PR somehow.
Deterministic output is incompatible with batching, which in turn is critical to high utilization on GPUs, which in turn is necessary to keep costs low.
Batching doesn't mean the computation suddenly becomes non-deterministic. Ideally, it just means you perform the same computation on multiple token streams in the batch simultaneously, without the values interacting with each other. Vectorization, basically.
Batching leads to cross-contamination in practice because of things like MoE load-balancing within the batch, or supporting different batch sizes with different kernels that have different numerical behavior. But a careful implementation could avoid such issues while still benefiting from the higher efficiency of batching.
Comment was deleted :(
> This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed."
This. Thanks for saying that, because now I don't need to read the article, since if the author doesn't even get that, I'm not interested in the rest.
LLMs are, fundamentally, compressed lookup tables that map input -> input + next token. Or, If you like, input -> input + list of possible next tokens with probabilities.
The temperature parameters largely went away when we moved towards reasoning models, which output lots of reasoning tokens before you get to the actual output tokens. I don’t know if it was found that reasoning works better with a higher temperature, or that having separate temperatures for reasoning vs. output wasn’t practical, but that’s my observation of the timing, anyway. And to the other commenter’s point, even a temperature of 0 is not deterministic if the batches are not invariant, which they’re not in production workloads.
Like many, the author seems to be confusing determinism with unrelated LLM phenomena. He talks about two entirely unrelated things:
1. Same input = same output. This can be called determinism, and it's technically rather trivial to achieve in the lifetime of a single model snapshot - it's just a matter of business need, because you pay extra for worse batching. It's harder if you need to extend the guarantee into the future, as you need to keep the snapshot and inference method the same. It's also a relatively niche thing, only required for build reproducibility, supply chain security, this kind of stuff.
2. Zero error rate with arbitrary inputs and outputs. This is not determinism and it's also NOT achievable in any model at all because the domain LLMs (and humans!) operate in is fundamentally ambiguous. If you want to enforce the formal rules, verify your inputs and outputs formally! Trying to solve it purely with intelligence (human or machine) is a fool's errand. You can keep the error rate low enough, but you can't guarantee the absence of errors due to the nature of intelligence.
I have been struggling with that. Thanks! Let me reword it - natural language lacks a strict semantics - so also programs for the llm machine ( I.e. prompts) cannot have it. LLMs always have to project from all possible semantics into one (are there any experiments with superpositions?)
At what point does this just wrap all the way back around to being genetic algorithms?
I'm also reminded of the old software called Formulize, which could take in a set of arbitrary data and find a function that described it. http://nutonian.wikidot.com/
The genetic algorithm comparison is actually pretty apt. Generate variations, evaluate fitness, keep the survivors. The main difference is that LLMs have a much richer prior about what "good" looks like, so the search space is dramatically smaller than random mutation.
But it raises an interesting question about where the fitness function comes from. In traditional GAs you define it explicitly. With LLM-generated code, the fitness function is often just "does it pass the tests" - which means the quality of your tests becomes the actual bottleneck, not the quality of the code generation.
I wonder if that shifts the core skill of programming from "write correct code" to "write correct specifications." And if so, is that actually a new problem, or is it the same problem formal methods people have been working on for decades, just wearing a different hat?
Taking the metaphor further, the traditional way of programming was to manually encode the logic, and the new way is to program the environment and context to let the correct program emerge through the constraints. The stricter and more precise the constraints, the closer the result is to what you want.
So then, as you say, being able to specify exactly what you want becomes the central skill of programming - I mean, describe the behavior not in terms of the final code, which is an implementation detail, but how it interacts with a given environment. That was always the case since in higher-level languages, including C, what we write is not the final code, which is technically the compiled result.
A difference I notice is that, now, even junior devs are expected to be the "mentor" to language models - teaching and guiding them to generate well-written code with plenty of tests, asserts, and other guardrails. In another comment someone said, breaking down a large program into smaller modules is useful - which is common sense, but we now have to guide an LLM to know and apply best practices, design patterns, useful tricks to improve code organization or performance, etc.
That means, it would be valuable to codify best practices, as documentation in Markdown as well as described in code, as specs and tests. Programming is becoming meta-programming. We're shifting emphasis from assembling genetic code manually to preparing the environment for such code to evolve.
If you extend this line of thinking a lot, given we traditionally author the software, everything kind of boils down to a genetic algorithm.
Or, we could just use deterministic seeds in our LLM calls and solve the problem at the root.
Obviously this won't work if your tools are not deterministic, but reproducible builds is a well-trodden discipline.
This is actually a feature that OpenAI offers via the API. It doesn't work the way you want it to though. It makes it less random, not deterministic and they even warn you of that in the docs.
> The Solution is Code-Checking Code
I'm finding code falls into two categories. Code that produces known results and code that produces results that are not known. For example, creating a table with a pagination component with a backend that loads the first 30 rows ordered by date descending from the database on page 1 and the second set of 30 rows on page 2. We know what the code is supposed to output, we know what the result looks like. On the other hand, there is code that does statistical analysis on the 30 rows of data. This is different because we don't know what the result is.
The known result code is easy to use an LLM with. I have a skill that will iterate with an OODA loop — observe, act, and validate. It will in the validate step take screenshots and even without telling it, it will query the database from the CLI, compare the rendered row data to the database data. It will more surprisingly make sure that all the components are responsive and render beautifully on mobile. I'm orders of magnitude past linting here which is solved with Biome.
The statistical analysis is different. The only way I can know for sure of the result is by writing the code painstakingly by hand. The LLM will always produce specious lies. It will fabricate and show me what I want to see, not the truth. This is because until it is written manually by hand, there is no ground truth. In this case, there is no code checking code.
OODA: Observe, Orient, Decide, Act.
> There is no need for determinism to guarantee the job will be done identically every time if we only plan to do it once.
So can't you just save the conversation transcript and replay it with the tools? Seems a lot more efficient that regenerating the whole thing. And, also, no risk of branching when a tool reply is slightly different. (Of course, errors can occur on subsequent runs.)
Here is my theory about weaving deterministic code and prompts: https://github.com/zby/llm-do/blob/main/docs/theory.md . Plus a library that realises the unified call space that I propose.
I think co-recursion between prompts and code is crucial, but I also think that the ephemeral nature of code in Recursive Language Models is impending deployment time learning (https://github.com/zby/llm-do/blob/main/kb/notes/deploy-time...).
I wrote a version of this post awhile back that gets into a bit more detail as to HOW to bolt on the determinism.
I'm glad to see others talking about it. One day we'll look back on this era the same way folks look back at the time before we validated inputs.
https://www.stevenathompson.com/effective-vibe-coding-best-p...
How does writing tests, or in the new fashion, stealing tests from somewhere else make anything deterministic?
LLMs really cause diminished reasoning, or in terms that LLM people might understand: Your minds have been quantized!
this is a long article that doesn't say much at all. likely generated by AI?
it goes on for ages just to reach the point of "write the tests first"
We really need to add "please don't write comments witch-hunting articles for AI usage" into the guidelines at this rate
It is useful for those of us always checking the comments first, to decide if the article is worth reading.
I'd argue that another key aspect is to break programs up into small independent units that can be verified in isolation, and to compose them into larger programs with contracts between them. I've had a pretty good experience using Claude with a framework where I express the program as a state graph, and each node is treated like a microservice that gets some input and produces some output. Then the workflow engine verifies that the output matches the declared schema and then decides which step to execute next. https://github.com/yogthos/mycelium
As the state travels across the graph, I keep a trace of the steps which were executed, which means that when an error happens, the agent has a lot more information than it normally would, it can see what decision points the code passed through already, it can cross references that with the declared workflow, and quickly find where it screwed up.
The idea of workflow engines has been around for a long time, but they feel too awkward to use when you're writing code by hand. Writing conditional logic directly in the code keeps you in your flow, and having to jump out and declare it in config somewhere feels awkward. Coding agents completely change the dynamic though because they don't have that problem. If the LLM is writing the code, then I can just focus on ensuring the code meets the contract, while the agent can deal with the implementation details.
Is English deterministic and/or predictable?
Aubergine. I'm guessing no one could have predicted that word would be next. If the universe is a deterministic simulation (of what?) that could be run backward and forward predictably, then of course the next word was always going to be "aubergine" with 100% certainty. In that case, all we need is the entire state of the universe to predict the next moment.
Maybe if we use a subset of English with a very specific set of rules to make it more deterministic? Some specific words or combination of words can have special meaning. And use symbols to make it a little bit easier to type the prompts and save on tokens/context size.
soon
[flagged]
The verifier doesn't need to be deterministic, just to output a proof artifact that can be independently validated for correctness.
[flagged]
Crafted by Rajat
Source Code