Importance-based Token Merging for Diffusion Models

In Submission


Haoyu Wu, Jingyi Xu, Hieu Le, Dimitris Samaras
Stony Brook University

Paper    Code

Abstract

Diffusion models excel at high-quality image and video generation. However, a major drawback is their high latency. A simple yet powerful way to speed them up is by merging similar tokens for faster computation, though this can result in some quality loss. In this paper, we demonstrate that preserving important tokens during merging significantly improves sample quality. Notably, the importance of each token can be reliably determined using the classifier-free guidance magnitude, as this measure is strongly correlated with the conditioning input and corresponds to output fidelity. Since classifier-free guidance incurs no additional computational cost or requires extra modules, our method can be easily integrated into most diffusion-based frameworks. Experiments show that our approach significantly outperforms the baseline across various applications, including text-to-image synthesis, multi-view image generation, and video generation.

Method

fea-evo

Overview. We propose an importance-based token merging method. The importance of each token is determined by the magnitude of classifier-free guidance. These scores, visualized with colors ranging from light to dark (indicating less to more important tokens), are used to construct a pool of important tokens. We randomly select a set of destination (dst) tokens from this pool and the remaining important tokens become source (src) tokens. Bipartite soft matching is then performed between the dst tokens and src tokens. src tokens without a suitable match are considered independent tokens (ind.). All other src tokens and unimportant tokens are merged with the destination tokens for subsequent computational steps.

Results

fea-evo

Qualitative comparison of text-to-image generation. The first row shows results from Stable Diffusion (SD), while the subsequent rows show SD combined with various token merging methods including ATC, ToMeSD, and our proposed method. Our approach consistently produces finer details with coherent structures. Note that ATC requires minutes to generate an image, whereas other methods, including ours, complete the task in seconds. The token merging ratio is 0.7. Please refer to the supplementary for detailed prompts. Best viewed with zoom-in for clarity.

Videos

Comparison of text-to-video diffusion.

Acknowledgements

  • This work was supported in part by the NASA Biodiversity Program (Award 80NSSC21K1027), and NSF Grant IIS-2212046.
  • The website template was borrowed from Mip-NeRF 360 and VolSDF.