The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models.
Code: https://github.com/shaochenze/calm
Project: https://shaochenze.github.io/blog/2025/CALM
Large Language Models (LLMs) have revolutionized the field of artificial intelligence, demonstrating unprecedented capabilities in understanding, generating, and reasoning with human language (Achiam et al., 2023; Google, 2025; DeepSeek-AI, 2025). However, this remarkable success is shadowed by a critical challenge: their immense computational demands. The training and inference of state-of-the-art LLMs demand massive computational resources, leading to prohibitive expenses and significant environmental concerns (Strubell et al., 2019; Bender et al., 2021). At the heart of this inefficiency lies the foundational paradigm of these models: an autoregressive generation process that operates on a sequence of discrete tokens. Because the computational cost scales with the length of the sequence, generating long-form text or processing extensive contexts remains a fundamental bottleneck, limiting the scalability and accessibility of these powerful models.
The now-ubiquitous use of discrete tokens in LLMs is the result of a pivotal evolution from earlier modeling paradigms. Initially, models that operated at the character level struggled with the computational burden of extremely long sequences (Sutskever et al., 2011; Kim et al., 2016). The subsequent shift to modern subword tokenization (Sennrich et al., 2016) was driven by a crucial insight: increasing the information density of each text unit reduces sequence length and dramatically boosts model efficiency. This historical success suggests a clear path for unlocking the next order of magnitude in efficiency: continue to increase the semantic bandwidth of each predictive unit.
We argue, however, that this path has reached a fundamental limit, constrained by the very nature of discrete representation. With typical vocabularies in modern LLMs ranging from approximately 32,000 to 256,000 entries, each token carries a surprisingly small amount of information—merely 15 to 18 bits (e.g., ). To increase this capacity—for instance, to represent a whole phrase—the vocabulary size would need to grow exponentially, making the final softmax computation over this vocabulary an untenable bottleneck. This reveals a critical limitation: the information density of discrete tokens is not scalable. Consequently, a profound mismatch has emerged: while model capacity has scaled to unprecedented levels, the task itself—predicting low-information discrete units one at a time—has not evolved. We are now deploying models of immense representational power on a task that fundamentally limits their throughput, forcing them to laboriously predict simple, low-information tokens one by one.
In this work, we confront this limitation directly by introducing a paradigm shift from discrete tokens to a continuous-domain representation. Central to our approach is an autoencoder trained to compress a chunk of K tokens into a single, dense continuous vector and, crucially, reconstruct the original tokens from this vector with high fidelity. Unlike the discrete paradigm, where increasing information density requires an exponential growth in vocabulary size, our continuous representation offers a scalable path forward: the vector’s information capacity can be gracefully expanded by simply increasing its dimensionality to accommodate a larger K. This design directly reduces the number of autoregressive steps by a factor of K. Ultimately, it allows us to reframe language modeling from a task of next-token prediction on discrete token sequences to next-vector prediction on continuous vector sequences, as conceptually illustrated in Figure 1.
Conventional LM: Next-Token PredictionThecatsatonthematSequence Length = TCALM: Next-Vector PredictionVector 1Vector 2Sequence Length = T/KAutoencoder (K=3 tokens to 1 vector)Figure 1: Comparison between conventional token-by-token generation and our proposed vector-by-vector framework (CALM). By compressing K tokens into a single vector, we reduce the sequence length K-fold, fundamentally improving computational efficiency.
However, shifting to the continuous domain introduces a significant challenge: without a finite vocabulary, a model cannot compute an explicit probability distribution over all possible outcomes using a standard softmax layer. To address this, we develop a comprehensive, likelihood-free framework for our Continuous Autoregressive Language Models (CALM). Our primary contributions, which structure the remainder of this paper, are as follows:
•
A Powerful and Lightweight Autoencoder (Section 2): We first introduce an efficient autoencoder architecture designed to produce robust vector representations. We demonstrate that this model can be both compact and powerful, ensuring high-fidelity reconstruction of the original tokens, which is a prerequisite for the downstream language modeling task.
•
Likelihood-Free Language Modeling (Section 3): To perform generative modeling in the continuous vector space, we employ a lightweight generative head that conditions on the last hidden state to generate the output vector. While the generative head can be any continuous generative model, options like Diffusion (Ho et al., 2020; Li et al., 2024) or Flow Matching (Lipman et al., 2023) rely on an iterative sampling process, re-introducing a significant inference bottleneck. Our framework therefore specifically adopts the Energy Transformer (Shao et al., 2025b), a recent architecture designed for efficient, single-step generation of continuous vectors, while empirically demonstrating superior generation quality.
•
Likelihood-Free LM Evaluation (Section 4): The absence of explicit likelihoods makes traditional metrics like Perplexity inapplicable. We address this by proposing BrierLM, a novel metric for language modeling based on the Brier score (Brier, 1950). We show that BrierLM is strictly proper, theoretically ensuring a fair comparison of model capabilities. Crucially, BrierLM can be estimated unbiasedly by only drawing samples from the model, making it perfectly suited for CALM where likelihoods are intractable.
•
Likelihood-Free Temperature Sampling (Section 5): Controlled generation via temperature sampling is an indispensable feature of modern LLMs, yet it relies on the explicit manipulation of a probability distribution. We introduce a principled, likelihood-free sampling algorithm that can, in theory, draw samples from the exact temperature distribution, and we accompany it with a highly efficient batch approximation.
We empirically validate our CALM framework on standard language modeling benchmarks, which demonstrates a superior performance-compute trade-off. For instance, a CALM grouping K=4 tokens delivers performance comparable to strong discrete baselines, but at a significantly lower computational cost. This findings highlight a new design axis for language models: rather than solely scaling parameters and data for performance, one can now scale the information capacity of each step as a powerful new lever for computational efficiency.
The foundational component of our CALM framework is an autoencoder tasked with learning a bijective mapping between a chunk of discrete tokens and a continuous vector. Formally, we seek an encoder and a decoder , where is the vocabulary, such that for a given token sequence , the reconstruction closely approximates . For simplicity and computational efficiency, we design our autoencoder to be context-free, meaning it processes each token chunk independently of its surrounding sequence. A context-aware autoencoder that also conditions on previous vector representations is a natural and promising next step, which we leave for future exploration.
The encoder begins by mapping the input sequence to embeddings. Each embedding is independently processed by a position-wise feed-forward network (FFN). The resulting hidden states are then flattened and compressed by a linear layer. This unified representation is passed through a second FFN and a linear projection to produce the -dimensional latent vector .
The decoder architecture mirrors the encoder. It first transforms using a linear layer and an FFN to obtain a -dimensional hidden state, which is then expanded by another linear layer to dimension and reshaped into a sequence of hidden states. Each of these states is passed through a second FFN, followed by a projection to vocabulary logits using the tied input embedding matrix. Finally, the tokens are reconstructed by applying an argmax operation over these logits.
The autoencoder is trained to minimize the reconstruction error by optimizing the standard cross-entropy loss across all token positions:
| (1) |
|---|