
A bit of signal processing swiftly reveals that diffusion models and autoregressive models aren’t all that different: diffusion models of images perform approximate autoregression in the frequency domain!
This blog post is also available as a Python notebook in Google Colab , with the code used to produce all the plots and animations.
Last year, I wrote a blog post describing various different perspectives on diffusion. The idea was to highlight a number of connections between diffusion models and other classes of models and concepts. In recent months, I have given a few talks where I discussed some of these perspectives. My talk at the EEML 2024 summer school in Novi Sad, Serbia, was recorded and is available on YouTube. Based on the response I got from this talk, the link between diffusion models and autoregressive models seems to be particularly thought-provoking. That’s why I figured it could be useful to explore this a bit further.
In this blog post, I will unpack the above claim, and try to make it obvious that this is the case, at least for visual data. To make things more tangible, I decided to write this entire blog post in the form of a Python notebook (using Google Colab). That way, you can easily reproduce the plots and analyses yourself, and modify them to observe what happens. I hope this format will also help drive home the point that this connection between diffusion models and autoregressive models is “real”, and not just a theoretical idealisation that doesn’t hold up in practice.
In what follows, I will assume a basic understanding of diffusion models and the core concepts behind them. If you’ve watched the talk I linked above, you should be able to follow along. Alternatively, the perspectives on diffusion blog post should also suffice as preparatory reading. Some knowledge of the Fourier transform will also be helpful.
Below is an overview of the different sections of this post. Click to jump directly to a particular section.

Autoregression and diffusion are currently the two dominant generative modelling paradigms. There are many more ways to build generative models: flow-based models and adversarial models are just two possible alternatives (I discussed a few more in an earlier blog post).
Both autoregression and diffusion differ from most of these alternatives, by splitting up the difficult task of generating data from complex distributions into smaller subtasks that are easier to learn. Autoregression does this by casting the data to be modelled into the shape of a sequence, and recursively predicting one sequence element at a time. Diffusion instead works by defining a corruption process that gradually destroys all structure in the data, and training a model to learn to invert this process step by step.
This iterative refinement approach to generative modelling is very powerful, because it allows us to construct very deep computational graphs for generation, without having to backpropagate through them during training. Indeed, both autoregressive models and diffusion models learn to perform a single step of refinement at a time – the generative process is not trained end-to-end. It is only when we try to sample from the model that we connect all these steps together, by sequentially performing the subtasks: predicting one sequence element after another in the case of autoregression, or gradually denoising the input step-by-step in the case of diffusion.
Because this underlying iterative approach is common to both paradigms, people have often sought to connect the two. One could frame autoregression as a special case of discrete diffusion, for example, with a corruption process that gradually replaces tokens by “mask tokens” from right to left, eventually ending up with a fully masked sequence. In the next few sections, we will do the opposite, framing diffusion as a special case of autoregression, albeit approximate.