Predicting the Order of Upcoming Tokens Improves Language Modeling

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP’s exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP’s multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

1 Introduction

Refer to caption

Figure 1: An overview of Token Order Prediction (TOP). Given an input token sequence, a vocabulary, a sequence length of 4 and window size of 4, a TOP target sequence is constructed via Algorithm 1. The output hidden representation of the final layer goes to two separate unembedding heads for NTP and TOP. The final loss to optimize is a sum of the NTP and TOP loss.

Current large language models (LLMs) are trained to predict the next token in a sequence during training, an unsupervised learning task often referred to as next-token prediction (NTP). Although simple, NTP has been very successful in creating powerful language models that can solve complex tasks and even reason over their context.

However, NTP has received various criticisms over the past few years. A notable argument by LeCun (2024) claims that NTP at inference time accumulates errors over every time step and inevitably falls off greatly in accuracy. This was however refuted by Bachmann & Nagarajan (2024), where it is argued that the main issue of NTP lies not in inference time error accumulation; rather, that teacher-forcing is unable to learn an accurate next-token predictor in the first place.

Building off ideas such as ProphetNet (Qi et al., 2020), Multi-Token Prediction (MTP) (Gloeckle et al., 2024) has emerged as a relatively successful auxiliary learning task to improve NTP in LLM training. MTP adds multiple heads to the end of a transformer that each predict a different offset of tokens ahead. All MTP heads share the same trunk of transformer layers, with the hope that having these auxiliary heads leads to the model learning better internal representations that are considerate of not only the next immediate token, but also future tokens that may come after it. It has been shown that MTP improves performance of language models on certain generative tasks that require some level of look-ahead, such as coding and summarization. This method was used in the training of DeepSeek-V3 (DeepSeek-AI et al., 2024), although with sequential instead of parallel MTP heads.

However, the original MTP paper (Gloeckle et al., 2024) showed that MTP does not generally improve language modeling performance on every standard NLP task, as evident from the downstream task performance seen in their Appendix G. Also important to note is that MTP does not seem to work on smaller models. In generative tasks, MTP harms performance and only starts to gain advantage over NTP for models with over 1-3 billion parameters. The number of future tokens to predict is also a hyperparameter that needs to be set before training. This is critical because increasing the future token count requires adding more heads, which adds more parameters to train and more compute required. Furthermore, the paper shows that increasing the number of future tokens does not guarantee better performance even on the benefiting tasks, where it is shown that 4 future tokens performs better than 8 in coding.

We aim to improve upon MTP by introducing a different auxiliary training objective with the same goal as MTP: enhancing next-token prediction performance through better internal representations. However, instead of exactly predicting multiple future tokens, we propose that a better training objective is to predict the order of upcoming tokens in the sequence with a learning-to-rank loss. In this paper, we contribute the following:

We introduce Token Order Prediction (TOP), a novel auxiliary training loss in addition to NTP to improve language modeling in general.
For each of the three training strategies—NTP, MTP, and TOP—we pretrain language models with sizes of 340M, 1.8B, and 7B parameters to better understand the impact of these strategies on models of different scales.
We evaluate these models on standard NLP benchmarks and show that TOP improves upon NTP and MTP even at scale.

Next-token prediction (NTP) is the standard training objective for present-day language models. This task is learned by optimizing the cross-entropy loss over the sequence length. Given sequence length , model dimension , vocabulary size and as the input token sequence, this loss is written as

			(1)

where is the output probability given by the language model with parameters . The probability of the next token given this model is written as

			(2)

where the hidden representation is generated by a transformer up to the final layer conditioned on , and the NTP head is a linear unembedding layer to project onto the vocabulary. The probability is taken at the index of the target token .

Multi-Token Prediction (MTP) (Gloeckle et al., 2024) was proposed as an architectural modification that adds additional MTP heads1 in the form of parallel, singular transformer layers that each output a future token prediction at offset positions. Given as the number of future tokens to predict (including the next token), the MTP loss can be written as

			(3)

If we define as the hidden representation before the last transformer layer and have for as the MTP heads in the form of singular transformer layers for each future token, and all heads share the same unembedding layer or NTP head , then