this post was submitted on 08 Jun 2025

836 points (95.4% liked)

Technology

85720 readers

4257 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 3 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

836

Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. (archive.is)

submitted 1 year ago* (last edited 1 year ago) by Allah@lemm.ee to c/technology@lemmy.world

344 comments fedilink hide all child comments

LOOK MAA I AM ON FRONT PAGE

you are viewing a single comment's thread
view the rest of the comments

[–] sev@nullterra.org 49 points 1 year ago (2 children)

Just fancy Markov chains with the ability to link bigger and bigger token sets. It can only ever kick off processing as a response and can never initiate any line of reasoning. This, along with the fact that its working set of data can never be updated moment-to-moment, means that it would be a physical impossibility for any LLM to achieve any real "reasoning" processes.

[–] kescusay@lemmy.world 18 points 1 year ago (2 children)

I can envision a system where an LLM becomes one part of a reasoning AI, acting as a kind of fuzzy "dataset" that a proper neural network incorporates and reasons with, and the LLM could be kept real-time updated (sort of) with MCP servers that incorporate anything new it learns.

But I don't think we're anywhere near there yet.

[–] riskable@programming.dev 9 points 1 year ago

The only reason we're not there yet is memory limitations.

Eventually some company will come out with AI hardware that lets you link up a petabyte of ultra fast memory to chips that contain a million parallel matrix math processors. Then we'll have an entirely new problem: AI that trains itself incorrectly too quickly.

Just you watch: The next big breakthrough in AI tech will come around 2032-2035 (when the hardware is available) and everyone will be bitching that "chain reasoning" (or whatever the term turns out to be) isn't as smart as everyone thinks it is.

[–] homura1650@lemm.ee 2 points 1 year ago (1 children)

LLMs (at least in their current form) are proper neural networks.

[–] kescusay@lemmy.world 1 points 1 year ago

Well, technically, yes. You're right. But they're a specific, narrow type of neural network, while I was thinking of the broader class and more traditional applications, like data analysis. I should have been more specific.

[–] auraithx@lemmy.dbzer0.com 8 points 1 year ago (3 children)

Unlike Markov models, modern LLMs use transformers that attend to full contexts, enabling them to simulate structured, multi-step reasoning (albeit imperfectly). While they don’t initiate reasoning like humans, they can generate and refine internal chains of thought when prompted, and emerging frameworks (like ReAct or Toolformer) allow them to update working memory via external tools. Reasoning is limited, but not physically impossible, it’s evolving beyond simple pattern-matching toward more dynamic and compositional processing.

[–] spankmonkey@lemmy.world 6 points 1 year ago (1 children)

Reasoning is limited

Most people wouldn't call zero of something 'limited'.

[–] auraithx@lemmy.dbzer0.com 11 points 1 year ago (2 children)

The paper doesn’t say LLMs can’t reason, it shows that their reasoning abilities are limited and collapse under increasing complexity or novel structure.

[–] technocrit@lemmy.dbzer0.com 3 points 1 year ago

The paper doesn’t say LLMs can’t reason

Authors gotta get paid. This article is full of pseudo-scientific jargon.

[–] spankmonkey@lemmy.world 3 points 1 year ago (1 children)

I agree with the author.

If these models were truly "reasoning," they should get better with more compute and clearer instructions.

The fact that they only work up to a certain point despite increased resources is proof that they are just pattern matching, not reasoning.

[–] auraithx@lemmy.dbzer0.com 7 points 1 year ago (1 children)

Performance eventually collapses due to architectural constraints, this mirrors cognitive overload in humans: reasoning isn’t just about adding compute, it requires mechanisms like abstraction, recursion, and memory. The models’ collapse doesn’t prove “only pattern matching”, it highlights that today’s models simulate reasoning in narrow bands, but lack the structure to scale it reliably. That is a limitation of implementation, not a disproof of emergent reasoning.

[–] technocrit@lemmy.dbzer0.com -1 points 1 year ago (1 children)

Performance collapses because luck runs out. Bigger destruction of the planet won't fix that.

[–] auraithx@lemmy.dbzer0.com 2 points 1 year ago (2 children)

Brother you better hope it does because even if emissions dropped to 0 tonight the planet wouldnt stop warming and it wouldn't stop what's coming for us.

[–] MCasq_qsaCJ_234@lemmy.zip -1 points 1 year ago (1 children)

If the situation gets dire, it's likely that the weather will be manipulated. Countries would then have to be convinced not to use this for military purposes.

[–] auraithx@lemmy.dbzer0.com 2 points 1 year ago

This isn’t a thing.

[–] riskable@programming.dev 5 points 1 year ago

I'm not convinced that humans don't reason in a similar fashion. When I'm asked to produce pointless bullshit at work my brain puts in a similar level of reasoning to an LLM.

Think about "normal" programming: An experienced developer (that's self-trained on dozens of enterprise code bases) doesn't have to think much at all about 90% of what they're coding. It's all bog standard bullshit so they end up copying and pasting from previous work, Stack Overflow, etc because it's nothing special.

The remaining 10% is "the hard stuff". They have to read documentation, search the Internet, and then—after all that effort to avoid having to think—they sigh and start actually start thinking in order to program the thing they need.

LLMs go through similar motions behind the scenes! Probably because they were created by software developers but they still fail at that last 90%: The stuff that requires actual thinking.

Eventually someone is going to figure out how to auto-generate LoRAs based on test cases combined with trial and error that then get used by the AI model to improve itself and that is when people are going to be like, "Oh shit! Maybe AGI really is imminent!" But again, they'll be wrong.

AGI won't happen until AI models get good at retraining themselves with something better than basic reinforcement learning. In order for that to happen you need the working memory of the model to be nearly as big as the hardware that was used to train it. That, and loads and loads of spare matrix math processors ready to go for handing that retraining.

[–] vrighter@discuss.tchncs.de 2 points 1 year ago (1 children)

previous input goes in. Completely static, prebuilt model processes it and comes up with a probability distribution.

There is no "unlike markov chains". They are markov chains. Ones with a long context (a markov chain also kakes use of all the context provided to it, so I don't know what you're on about there). LLMs are just a (very) lossy compression scheme for the state transition table. Computed once, applied blindly to any context fed in.

[–] auraithx@lemmy.dbzer0.com 5 points 1 year ago (1 children)

LLMs are not Markov chains, even extended ones. A Markov model, by definition, relies on a fixed-order history and treats transitions as independent of deeper structure. LLMs use transformer attention mechanisms that dynamically weigh relationships between all tokens in the input—not just recent ones. This enables global context modeling, hierarchical structure, and even emergent behaviors like in-context learning. Markov models can't reweight context dynamically or condition on abstract token relationships.

The idea that LLMs are "computed once" and then applied blindly ignores the fact that LLMs adapt their behavior based on input. They don’t change weights during inference, true—but they do adapt responses through soft prompting, chain-of-thought reasoning, or even emulated state machines via tokens alone. That’s a powerful form of contextual plasticity, not blind table lookup.

Calling them “lossy compressors of state transition tables” misses the fact that the “table” they’re compressing is not fixed—it’s context-sensitive and computed in real time using self-attention over high-dimensional embeddings. That’s not how Markov chains work, even with large windows.

[–] vrighter@discuss.tchncs.de 2 points 1 year ago* (last edited 1 year ago) (1 children)

their input is the context window. Markov chains also use their whole context window. Llms are a novel implementation that can work with much longer contexts, but as soon as something slides out of its window, it's forgotten. just like any other markov chain. They don't adapt. You add their token to the context, slide the oldest one out and then you have a different context, on which you run the same thing again. A normal markov chain will also give you a different outuut if you give it a different context. Their biggest weakness is that they don't and can't adapt. You are confusing the encoding of the context with the model itself. Just to see how static the model is, try setting temperature to 0, and giving it the same context. i.e. only try to predict one token with the exact same context each time. As soon as you try to predict a 2nd token, you've just changed the input and ran the thing again. It's not adapting, you asked it something different, so it came up with a different answer

[–] auraithx@lemmy.dbzer0.com 6 points 1 year ago (1 children)

While both Markov models and LLMs forget information outside their window, that’s where the similarity ends. A Markov model relies on fixed transition probabilities and treats the past as a chain of discrete states. An LLM evaluates every token in relation to every other using learned, high-dimensional attention patterns that shift dynamically based on meaning, position, and structure.

Changing one word in the input can shift the model’s output dramatically by altering how attention layers interpret relationships across the entire sequence. It’s a fundamentally richer computation that captures syntax, semantics, and even task intent, which a Markov chain cannot model regardless of how much context it sees.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago (1 children)

an llm also works on fixed transition probabilities. All the training is done during the generation of the weights, which are the compressed state transition table. After that, it's just a regular old markov chain. I don't know why you seem so fixated on getting different output if you provide different input (as I said, each token generated is a separate independent invocation of the llm with a different input). That is true of most computer programs.

It's just an implementation detail. The markov chains we are used to has a very short context, due to combinatorial explosion when generating the state transition table. With llms, we can use a much much longer context. Put that context in, it runs through the completely immutable model, and out comes a probability distribution. Any calculations done during the calculation of this probability distribution is then discarded, the chosen token added to the context, and the program is run again with zero prior knowledge of any reasoning about the token it just generated. It's a seperate execution with absolutely nothing shared between them, so there can't be any "adapting" going on

[–] auraithx@lemmy.dbzer0.com 2 points 1 year ago* (last edited 1 year ago) (1 children)

Because transformer architecture is not equivalent to a probabilistic lookup. A Markov chain assigns probabilities based on a fixed-order state transition, without regard to deeper structure or token relationships. An LLM processes the full context through many layers of non-linear functions and attention heads, each layer dynamically weighting how each token influences every other token.

Although weights do not change during inference, the behavior of the model is not fixed in the way a Markov chain’s state table is. The same model can respond differently to very similar prompts, not just because the inputs differ, but because the model interprets structure, syntax, and intent in ways that are contextually dependent. That is not just longer context-it is fundamentally more expressive computation.

The process is stateless across calls, yes, but it is not blind. All relevant information lives inside the prompt, and the model uses the attention mechanism to extract meaning from relationships across the sequence. Each new input changes the internal representation, so the output reflects contextual reasoning, not a static response to a matching pattern. Markov chains cannot replicate this kind of behavior no matter how many states they include.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago (1 children)

an llm works the same way! Once it's trained,none of what you said applies anymore. The same model can respond differently with the same inputs specifically because after the llm does its job, sometimes we intentionally don't pick the most likely token, but choose a different one instead. RANDOMLY. Set the temperature to 0 and it will always reply with the same answer. And llms also have a fixed order state transition. Just because you only typed one word doesn't mean that that token is not preceded by n-1 null tokens. The llm always receives the same number of tokens. It cannot work with an arbitrary number of tokens.

all relevant information "remains in the prompt" only until it slides out of the context window, just like any markov chain.

[–] auraithx@lemmy.dbzer0.com 2 points 1 year ago (1 children)

Your conflating surface-level architectural limits with core functional behaviour. Yes, an LLM is deterministic at temperature 0 and produces the same output for the same input, but that does not make it equivalent to a Markov chain. A Markov chain defines transitions based on fixed-order memory and static probabilities. An LLM generates output by applying a series of matrix multiplications, activations, and attention-weighted context aggregations across multiple layers, where the representation of each token is conditioned on the entire input sequence, not just on recent tokens.

While the model has a maximum token limit, it does not receive a fixed-length input filled with nulls. It processes variable-length input sequences up to the context limit, and attention masks control which positions are used. These are not hardcoded state transitions; they are dynamically computed weightings over continuous embeddings, where meaning arises from the interaction of tokens, not from simple position or order alone.

Saying that output diversity is just randomness misunderstands why random sampling exists: to explore the rich distribution the model has learned from data, not to fake intelligence. The depth of its output space comes from how it models relationships, hierarchies, syntax, and semantics through training. Markov chains do not do any of this. They map sequences to likely next symbols without modeling internal structure. An LLM’s output reflects high-dimensional reasoning over the prompt. That behavior cannot be reduced to fixed transition logic.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago* (last edited 1 year ago) (1 children)

the probabilities are also fixed after training. You seem to be conflating running the llm with different input to the model somehow adapting. The new context goes into the same fixed model. And yes, it can be reduced to fixed transition logic, you just need to have all possible token combinations in the table. This is obviously intractable due to space issues, so we came up with a lossy compression scheme for it. The table itself is learned once, then it's fixed. The training goes into generating a huge markov chain. Just because the table is learned from data, doesn't change what it actually is.

[–] auraithx@lemmy.dbzer0.com 2 points 1 year ago* (last edited 1 year ago) (1 children)

This argument collapses the entire distinction between parametric modeling and symbolic lookup. Yes, the weights are fixed after training, but the key point is that an LLM does not store or retrieve a state transition table. It learns to approximate the probability of the next token given a sequence through function approximation, not by memorizing discrete transitions. What appears to be a "table" is actually a deep, distributed representation compressed into continuous weight matrices. It is not indexing state transitions, it is computing probabilities from patterns in the input space.

A true Markov chain defines transition probabilities over explicit states. An LLM embeds tokens into high-dimensional vectors, then transforms them repeatedly using self-attention and feedforward layers that can capture subtle syntactic, semantic, and structural features. These features interact in nonlinear ways that go far beyond what any finite transition table could express. You cannot meaningfully represent an LLM’s behavior as a finite Markov model, even in principle, because its representations are not enumerable states but regions of a continuous latent space.

Saying “you just need all token combinations in a table” ignores the fact that the model generalizes to combinations never seen during training. That is the core of its power. It doesn’t look up learned transitions-it constructs responses by interpolating through an embedding space guided by attention and weight structure. No Markov chain does this. A lossy compressor of a transition table still implies a symbolic map; a neural network is a differentiable function trained to fit a distribution, not to encode it explicitly.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago* (last edited 1 year ago) (1 children)

yes, the matrix and several levels are the "decompression". At the end you get one probability distribution, deterministically. And the state is the whole context, not just the previous token. Yes, if we were to build the table manually with only available data, lots of cells would just be 0. That's why the compression is lossy. There would actually be nothing stopping anyone from filling those 0 cells out, it's just infeasible. you could still put states you never actually saw, but are theoretically possible in the table. And there's nothing stopping someone from putting thought into it and filling them out.

Also you seem obsessed by the word table. A table is just one type of function mapping a fixed input to a fixed output. If you replaced it with a function that gives the same outputs for all inputs, then it's functionally equivalent. It being a table or some code in a function is just an implementation detail.

As a thought exercise imagine setting temperature to 0, passing all the combinations of tokens of input, and record the output for every single one of them. put them all in a "table" (assuming you have practically infinite space) and you have a markov chain that is 100% functionally equivalent to the neural network with all its layers and complexity. But it does it without the neural network, and gives 100% identical results every single time in O(1). Because we don't have infinite time and space, we had to come up with a mapping function to replace the table. And because we have no idea how to make a good approximation of such a huge function, we use machine learning to come up with a suitable function for us, given tons of data. You can introduce some randomness in the sampling of that, and you now have nonzero temperature again.

Ex. A table containing the digits of pi, in order, could be transparently replaced with a spigot algorithm that calculates the nth digit on-demand. Output would be exactly the same

[–] auraithx@lemmy.dbzer0.com 1 points 1 year ago* (last edited 1 year ago) (1 children)

This is an elegant metaphor, but it fails to capture the essential difference between symbolic enumeration and neural computation. Representing an LLM as a decompression function that reconstructs a giant transition table assumes that the model is approximating a complete, enumerable mapping of inputs to outputs. That’s not what is happening. LLMs are not trained to reproduce every possible sequence. They are trained to generalize over an effectively infinite space of token combinations, including many never seen during training.

Your thought experiment—recording the output for every possible input at temperature 0—would indeed give you a deterministic function that could be stored. But this imagined table is not a Markov chain. It is a cached output of a deep contextual function, not a probabilistic state machine. A Markov model, by definition, uses transition probabilities based on fixed state history and lacks internal computation. An LLM generates the distribution through recursive transformation of continuous embeddings with positional and attention-based conditioning. That is not equivalent to symbolically defining state transitions, even if you could record the output for every input.

The analogy to a spigot algorithm for pi misses the point. That algorithm computes digits of a predefined number. An LLM doesn't compute a predetermined output. It computes a probability distribution conditioned on a context it was never explicitly trained on, using representations learned across many dimensions. The model encodes distributed knowledge and compositional patterns. A Markov table does not. Even a giant table with manually filled hypothetical entries lacks the inductive bias, generalization, and emergent capabilities that arise from the structure of a trained network.

Equivalence in output does not imply equivalence in function. Replacing a rich model with an exhaustively recorded output set may yield the same result, but it loses what makes the model powerful: the reasoning behavior from structure, not just output recall. The function is not a shortcut to a table. It is the intelligence.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago* (last edited 1 year ago) (1 children)

"lacks internal computation" is not part of the definition of markov chains. Only that the output depends only on the current state (the whole context, not just the last token) and no previous history, just like llms do. They do not consider tokens that slid out of the current context, because they are not part of the state anymore.

And it wouldn't be a cache unless you decide to start invalidating entries, which you could just, not do.. it would be a table with token-alphabet-size^context length size, with each entry being a vector of size token_alphabet_size. Because that would be too big to realistically store, we do not precompute the whole thing, and just approximate what each table entry should be using a neural network.

The pi example was just to show that how you implement a function (any function) does not matter, as long as the inputs and outputs are the same. Or to put it another way if you give me an index, then you wouldn't know whether I got the result by doing some computations or using a precomputed table.

Likewise, if you give me a sequence of tokens and I give you a probability distribution, you can't tell whether I used A NN or just consulted a precomputed table. The point is that given the same input, the table will always give the same result, and crucially, so will an llm. A table is just one type of implementation for an arbitrary function.

There is also no requirement for the state transiiltion function (a table is a special type of function) to be understandable by humans. Just because it's big enough to be beyond human comprehension, doesn't change its nature.

[–] auraithx@lemmy.dbzer0.com 1 points 1 year ago (1 children)

You're correct that the formal definition of a Markov process does not exclude internal computation, and that it only requires the next state to depend solely on the current state. But what defines a classical Markov chain in practice is not just the formal dependency structure but how the transition function is structured and used. A traditional Markov chain has a discrete and enumerable state space with explicit, often simple transition probabilities between those states. LLMs do not operate this way.

The claim that an LLM is "just" a large compressed Markov chain assumes that its function is equivalent to a giant mapping of input sequences to output distributions. But this interpretation fails to account for the fundamental difference in how those distributions are generated. An LLM is not indexing a symbolic structure. It is computing results using recursive transformations across learned embeddings, where those embeddings reflect complex relationships between tokens, concepts, and tasks. That is not reducible to discrete symbolic transitions without losing the model’s generalization capabilities. You could record outputs for every sequence, but the moment you present a sequence that wasn't explicitly in that set, the Markov table breaks. The LLM does not.

Yes, you can say a table is just one implementation of a function, and from a purely mathematical perspective, any function can be implemented as a table given enough space. But the LLM’s function is general-purpose. It extrapolates. A precomputed table cannot do this unless those extrapolations are already baked in, in which case you are no longer talking about a classical Markov system. You are describing a model that encodes relationships far beyond discrete transitions.

The pi analogy applies to deterministic functions with fixed outputs, not to learned probabilistic functions that approximate conditional distributions over language. If you give an LLM a new input, it will return a meaningful distribution even if it has never seen anything like it. That behavior depends on internal structure, not retrieval. Just because a function is deterministic at temperature 0 does not mean it is a transition table. The fact that the same input yields the same output is true for any deterministic function. That does not collapse the distinction between generalization and enumeration.

So while yes, you can implement any deterministic function as a lookup table, the nature of LLMs lies in how they model relationships and extrapolate from partial information. That ability is not captured by any classical Markov model, no matter how large.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago (1 children)

yes you can enumerate all inputs, because thoy are not continuous. You just raise the finite number of different tokens to the finite context size and that's exactly the size of the table you would need. finite*finite=finite. You are describing training, i.e how the function is geerated. Yes correlations are found there and encoded in a couple of matrices. Those matrices are what are used in the llm and none of what you said applies. Inference is purely a markov chain by definition.

[–] auraithx@lemmy.dbzer0.com 1 points 1 year ago (1 children)

You can say that the whole system is deterministic and finite, so you could record every input-output pair. But you could do that for any program. That doesn't make every deterministic function a Markov process. It just means it is representable in a finite way. The question is not whether the function can be stored. The question is whether its behavior matches the structure and assumptions of a Markov model. In the case of LLMs, it does not.

Inference does not become a Markov chain simply because it returns a distribution based on current input. It becomes a sequence of deep functional computations where attention mechanisms simulate hierarchical, relational, and positional understanding of language. That does not align with the definition or behavior of a Markov model, even if both map a state to a probability distribution. The structure of the computation, not just the input-output determinism, is what matters.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago (1 children)

no, not any computer program is a markov chain. only those that depend only on the current state and ignore prior history. Which fits llms perfectly.

Those sophisticated methods you talk about are just a couple of matrix multiplications. Those matrices are what's learned. Anything sophisticated happens during training. Inference is so not sophisticated. sjusm mulmiplying some matrices together and taking the rightmost column of the result. That's it.

[–] auraithx@lemmy.dbzer0.com 1 points 1 year ago (1 children)

Yes, LLM inference consists of deterministic matrix multiplications applied to the current context. But that simplicity in operations does not make it equivalent to a Markov chain. The definition of a Markov process requires that the next output depends only on the current state. You’re assuming that the LLM’s “state” is its current context window. But in an LLM, this “state” is not discrete. It is a structured, deeply encoded set of vectors shaped by non-linear transformations across layers. The state is not just the visible tokens—it is the full set of learned representations computed from them.

A Markov chain transitions between discrete, enumerable states with fixed transition probabilities. LLMs instead apply a learned function over a high-dimensional, continuous input space, producing outputs by computing context-sensitive interactions. These interactions allow generalization and compositionality, not just selection among known paths.

The fact that inference uses fixed weights does not mean it reduces to a transition table. The output is computed by composing multiple learned projections, attention mechanisms, and feedforward layers that operate in ways no Markov chain ever has. You can’t describe an attention head with a transition matrix. You can’t reduce positional encoding or attention-weighted context mixing into state transitions. These are structured transformations, not symbolic transitions.

You can describe any deterministic process as a function, but not all deterministic functions are Markovian. What makes a process Markov is not just forgetting prior history. It is having a fixed, memoryless probabilistic structure where transitions depend only on a defined discrete state. LLMs don’t transition between states in this sense. They recompute probability distributions from scratch each step, based on context-rich, continuous-valued encodings. That is not a Markov process. It’s a stateless function approximator conditioned on a window, built to generalize across unseen input patterns.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago (1 children)

the fact that it is a fixed function, that only depends on the context AND there are a finite number of discrete inputs possible does make it equivalent to a huge, finite table. You really don't want this to be true. And again, you are describing training. Once training finishes anything you said does not apply anymore and you are left with fixed, unchanging matrices, which in turn means that it is a mathematical function of the context (by the mathematical definition of "function". stateless, and deterministic) which also has the property that the set of all possible inputs is finite. So the set of possible outputs is also finite and strictly smaller or equal to the size of the set of possible inputs. This makes the actual function that the tokens are passed through CAN be precomputed in full (in theory) making it equivalent to a conventional state transition table.

This is true whether you'd like it to or not. The training process builds a markov chain.

[–] auraithx@lemmy.dbzer0.com 1 points 1 year ago (1 children)

You’re absolutely right that inference in an LLM is a fixed, deterministic function after training, and that the input space is finite due to the discrete token vocabulary and finite context length. So yes, in theory, you could precompute every possible input-output mapping and store them in a giant table. That much is mathematically valid. But where your argument breaks down is in claiming that this makes an LLM equivalent to a conventional Markov chain in function or behavior.

A Markov chain is not simply defined as “a function from finite context to next-token distribution.” It is defined by a specific type of process where the next state depends on the current state via fixed transition probabilities between discrete states. The model operates over symbolic states with no internal computation. LLMs, even during inference, compute outputs via multi-layered continuous transformations, with attention mixing, learned positional embeddings, and non-linear activations. These mechanisms mean that while the function is fixed, its structure does not resemble a state machine—it resembles a hierarchical pattern recognizer and function approximator.

Your claim is essentially that “any deterministic function over a finite input space is equivalent to a table.” This is true in a computational sense but misleading in a representational and behavioral sense. If I gave you a function that maps 4096-bit inputs to 50257-dimensional probability vectors and said, “This is equivalent to a transition table,” you could technically agree, but the structure and generative capacity of that function is not Markovian. That function may simulate reasoning, abstraction, and composition. A Markov chain never does.

You are collapsing implementation equivalence (yes, the function could be stored in a table) with model equivalence (no, it does not behave like a Markov chain). The fact that you could freeze the output behavior into a lookup structure doesn’t change that the lookup structure is derived from a fundamentally different class of computation.

The training process doesn’t “build a Markov chain.” It builds a function that estimates conditional token probabilities via optimization over a non-Markov architecture. The inference process then applies that function. That makes it a stateless function, yes—but not a Markov chain. Determinism plus finiteness does not imply Markovian behavior.

[–] vrighter@discuss.tchncs.de 1 points 1 year ago

you wouldn't be "freezing" anything. Each possible combination of input tokens maps to one output probability distribution. Those values are fixed and they are what they are whether you compute them or not, or when, or how many times.

Now you can either precompute the whole table (theory), or somehow compute each cell value every time you need it (practice). In either case, the resulting function (table lookup vs matrix multiplications) takes in only the context, and produces a probability distribution. And the mapping they generate is the same for all possible inputs. So they are the same function. A function can be implemented in multiple ways, but the implementation is not the function itself. The only difference between the two in this case is the implementation, or more specifically, whether you precompute a table or not. But the function itself is the same.

You are somehow saying that your choice of implementation for that function will somehow change the function. Which means that according to you, if you do precompute (or possibly cache, full precomputation is just an infinite cache size) individual mappings it somehow magically makes some magic happen that gains some deep insight. It does not. We have already established that it is the same function.