SkipSR: Faster Super-Resolution with Token Skipping

Anonymous Authors

Browser Note: If videos don't load properly in Chrome, please try Safari for the best viewing experience.

Abstract

Diffusion-based super-resolution (SR) is a key component in video generation and video restoration, but is slow and expensive, limiting scalability to higher resolutions and longer videos. Our key insight is that many regions in video are inherently low-detail and gain little from refinement, yet current methods process all pixels uniformly. To take advantage of this, we propose SkipSR, a simple framework for accelerating video SR by identifying low-detail regions directly from low-resolution input, then skipping computation on them entirely, only super-resolving the areas that require refinement. This simple yet effective strategy preserves perceptual quality in both standard and one-step diffusion SR models while significantly reducing computation. In standard SR benchmarks, our method achieves up to 60% faster end-to-end latency than prior models on 720p videos with no perceptible loss in quality.

Method

In this work, we focus on accelerating diffusion-based super-resolution, which is particularly expensive due to the larger size of the input and output, but crucial to video generation pipelines. Most prior works speed up diffusion transformers by reducing the number of steps or modify the attention mechanism. W We instead find that we can skip computation altogether for certain `simple' patches, accelerating computation significantly. We identify these patches with a lightweight CNN that predicts a binary mask, which is then used to route these patches entirely around the transformer. The un-skipped patches that are refined retain their relative positions through a mask-aware rotary positional encoding. Extensive experiments and ablations demonstrate that our method preserves visual quality while significantly accelerating end-to-end generation time. Our method is illustrated in the figure below, and several examples, along with the predicted mask, are visualized at the bottom of the page.

Detailed Video Comparisons

On AI-generated videos, SkipSR is able to detect and skip a large proportion of patches, preserving visual quality while drastically speeding up super-resolution. This capability is helpful for cascaded diffusion pipelines, since AI-generated videos tend to contain simpler textures and smoother motion.

Real-World Examples (VideoLQ)

On more dynamic, real-world videos, SkipSR still reduces a large proportion of tokens, and is able to preserve quality even when scenes are dynamic and contain rapid motion or camera shake.

Failure Cases

In the first example with the city skyline, we observe the mask flickering, even though the night sky is not obviously changing; it's likely that the mask predictor could identify significantly more patches to skip. On the second example with the old-timey video, on close inspection, seams are visible from the patch skipping. This is a case where the mask predictor detects patches where it shouldn't, and the diffusion super-resolution model is unable to sufficiently compensate. However, this phenomenon is empirically very infrequent.