Seeing Fast and Slow:
Learning the Flow of Time in Videos

Yen-Siang Wu1,2 Rundong Luo2 Jingsen Zhu2 Tao Tu2 Ali Farhadi3
Matthew Wallingford3 Yu-Chiang Frank Wang1 Steve Marschner2 Wei-Chiu Ma2
1National Taiwan University 2Cornell University 3University of Washington

This project explores how to perceive and manipulate the flow of time in videos through four complementary tasks:

  • Speed-change detection locates the exact moments when playback speed shifts.
  • Video speed estimation infers how much a video has been sped up or slowed down.
  • Extreme Temporal super-resolution converts low-FPS, blurry videos into high-FPS, clear counterparts.
  • Speed-conditioned video generation synthesizes the same event at user-specified temporal speeds.

Together, these capabilities highlight fine-grained temporal perception alongside controllable video generation.

We also introduce SloMo-44K, the largest generic slow-motion video dataset to date. It consists of 44,632 slow-motion videos, each ranging from 5 seconds to several minutes in duration, totaling approximately 167 hours and 18 million frames. The dataset is collected from YouTube, Vimeo, covering a wide variety of scenarios and motion patterns recorded using high-speed cameras.

Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos.

We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Finally, using this rich, diverse data, we develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback rates, and temporal super-resolution, which transforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

Learning to Detect Temporal Speed Changes — Applying Speed Change Detector to X-men Kitchen Scene

By training on this speed-change data, we build a detector sensitive to temporal speed changes. When applied to the iconic X-men kitchen scene, it successfully identifies the time-freeze moments.

Learning to Infer the Speed of Time — Applying Speed Estimator to Slow-Motion Videos

Using these two techniques, we can reliably predict a video's playback speed.

Learning to Infer the Speed of Time — The Speed-Guess Game

To better appreciate the difficulty in estimating playback speed, we encourage you to try the interactive Speed-Guess Game below.

Downstream Task: Conventional Temporal Super-Resolution

The diversity, scale, and slow-motion content of SloMo-44K also benefit conventional temporal super-resolution. We further train an 8× temporal super-resolution model, which generates videos with smoother and more natural motion dynamics compared to existing approaches.
Left
vs
Right (Baseline)