How LLMs Actually Work: Tokens, Context, and Probability
— A production-minded explanation of what LLMs actually do under the hood—and why tokens, context windows, and probability matter for cost, latency, and reliability.
Tokens, Context, and Probability — What Engineers Really Need to Know
TL;DR
Large Language Models (LLMs) do not “understand” language the way humans do. They do not reason, verify facts, or possess internal knowledge. At their core, LLMs perform one operation only: predicting the next token with the highest probability, given a sequence of previous tokens (the context). Everything that looks intelligent emerges from this probabilistic process.
1. LLMs Do Not “Understand” — and Why That Matters
When engineers first work with LLMs, it is natural to anthropomorphize them:
- “The model understands my request.”
- “It knows this concept.”
- “It reasoned its way to that answer.”
This assumption is the most dangerous mistake when building real systems with AI.
An LLM:
- Has no concept of truth
- Has no awareness of correctness
- Does not know when it is guessing
What it does instead is simple but powerful:
Given the tokens it has already seen, it predicts the most likely next token.
If you internalize this, many confusing AI behaviors stop being mysterious—and start becoming predictable.
2. What Is a Token (and Why Tokens Are Not Words)
LLMs do not operate on words. They operate on tokens—subword units learned during training.
A token can be:
- A full word
- Part of a word
- Punctuation
- Even whitespace
Example:
"unbelievable" → ["un", "believ", "able"]
Why tokens matter to engineers
- Cost Most LLM APIs charge per token (input + output).
- Latency More tokens generally mean slower inference.
- Context limits Context windows are measured in tokens, not characters or words.
- RAG and chunking Retrieval strategies must respect token boundaries, not human-friendly text length.
If you are designing AI systems without thinking in tokens, you are optimizing blindly.
3. Context Window: The Hard Limit of Every LLM
LLMs have no long-term memory.
The only information they can use is what exists inside the current context window:
- System instructions
- User input
- Conversation history
- Retrieved documents (RAG)
Once information falls outside that window, it effectively does not exist.
Practical implications
- Long conversations cause earlier details to be forgotten
- “Memory” in chatbots is just re-injecting data into context
- Stuffing more context does not always improve quality
A context window is a scarce resource, closer to RAM than to persistent storage.
4. Probability Is the Core of Every Answer
At each generation step, an LLM:
- Reads the entire context
- Assigns probabilities to all possible next tokens
- Selects one token based on a sampling strategy
There is no internal step for:
- Fact-checking
- Logical validation
- Cross-referencing knowledge
Why answers sound convincing
LLMs are trained on massive amounts of human-written text. They are extremely good at producing sequences that sound correct, even when they are factually wrong.
Fluency is not accuracy.
5. Sampling and the Illusion of Control
Engineers often assume:
“If I tune temperature or top-p correctly, the model will be more reliable.”
In reality:
- Sampling controls diversity, not correctness
- Lower temperature → more stable outputs, still potentially wrong
- Higher temperature → more creative outputs, higher risk
Sampling does not give the model new knowledge. It only changes how probabilities are explored.
6. Why Models Sound Confident Even When They Are Wrong
LLMs do not have a concept of uncertainty.
They do not know:
- When they are guessing
- When information is missing
- When an answer is unreliable
If a sequence of tokens has high probability, it will be generated—even if it is entirely fabricated.
This explains why:
- Hallucinations are confident and well-written
- Wrong answers rarely include hesitation
- “I don’t know” must be explicitly engineered
7. Design Consequences for Production Systems
For engineers, these facts lead to unavoidable conclusions:
- LLMs are not sources of truth They must be grounded via tools, retrieval, or rules.
- Outputs must be constrained and validated Free-form text is dangerous in production.
- Context must be managed deliberately More context is not always better.
- Prompts do not add knowledge They only shape probability distributions.
- LLMs are probabilistic components They must be treated as non-deterministic system parts.
Ignoring these realities leads to fragile systems and expensive failures.
8. Common Misplaced Expectations
You are likely misusing an LLM if you expect it to:
- Remember information across sessions
- Verify factual correctness on its own
- Replace deterministic business logic
- Be reliable without evaluation or monitoring
LLMs are powerful—but only when their limitations are respected.
Conclusion
Understanding how LLMs work does not reduce their usefulness. It makes them deployable. Every advanced technique—prompting, RAG, agents, evaluation—rests on one foundation:
An LLM predicts tokens based on probability, constrained by context.
If you design with that reality in mind, you will build AI systems that are more robust, predictable, and trustworthy.