PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Seeing Fast and Slow:
Learning the Flow of Time in Videos

Yen-Siang Wu^1,2 Rundong Luo² Jingsen Zhu² Tao Tu² Ali Farhadi³
Matthew Wallingford³ Yu-Chiang Frank Wang¹ Steve Marschner² Wei-Chiu Ma²

¹National Taiwan University ²Cornell University ³University of Washington

Paper (Coming soon) Code Dataset Speed-Guess Game

This project explores how to perceive and manipulate the flow of time in videos through four complementary tasks:

Speed-change detection locates the exact moments when playback speed shifts.
Video speed estimation infers how much a video has been sped up or slowed down.
Extreme Temporal super-resolution converts low-FPS, blurry videos into high-FPS, clear counterparts.
Speed-conditioned video generation synthesizes the same event at user-specified temporal speeds.

Together, these capabilities highlight fine-grained temporal perception alongside controllable video generation.

We also introduce SloMo-44K, the largest generic slow-motion video dataset to date. It consists of 44,632 slow-motion videos, each ranging from 5 seconds to several minutes in duration, totaling approximately 167 hours and 18 million frames. The dataset is collected from YouTube, Vimeo, covering a wide variety of scenarios and motion patterns recorded using high-speed cameras.

Speed-Change Detection Video Speed Estimation Our SloMo-44K Dataset
Speed-conditioned Video Generation Temporal Super-Resolution

Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos.

We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Finally, using this rich, diverse data, we develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback rates, and temporal super-resolution, which transforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

Learning to Detect Temporal Speed Changes — Leveraging the Principle of Time-Frequency Scaling

To collect speed-change training samples efficiently, we leverage cues encoded in audio of videos. Specifically, for videos with original audio tracks (i.e., not muted, dubbed or overlaid with background music), speeding up the video results in a higher pitch, and vice versa when slowing it down. These salient audio cues (pitch shift) allow us to locate speed-change events accurately and scalably, laying a foundation for training a speed-change detector.

🔊 Turn the volume on!
⚠️ Audio is only for collecting training samples; the final detector is purely visual!

Learning to Detect Temporal Speed Changes — Applying Speed Change Detector to X-men Kitchen Scene

By training on this speed-change data, we build a detector sensitive to temporal speed changes. When applied to the iconic X-men kitchen scene, it successfully identifies the time-freeze moments.

Learning to Infer the Speed of Time — Leveraging the Equivariance of Speed Estimation

Since there is no speed-labeled dataset for supervised learning, we introduce a self-supervised method to infer playback speed along with an iterative prediction approach to enhance model predictions.

Training

Our key insight is that the speed estimation model should be equivariant to temporal resampling, i.e., if we speed up a video by a factor of k, the predicted speed should scale by k. By enforcing this proportional relationship between the model’s input and output, we can transform temporal resampling into a powerful self-supervised training signal.

Inference

During initial inference, the model tends to underestimate the speeds of extremely slow videos. We hypothesize that this is because the motion differences in ultra-slow videos are often too subtle to be detected. To mitigate the issue, we adjust the speeds of these videos, bringing them closer to normal speed based on the previous estimates. This allows the model to produce more precise predictions in the subsequent iterations.

Learning to Infer the Speed of Time — Applying Speed Estimator to Slow-Motion Videos

Using these two techniques, we can reliably predict a video's playback speed.

Learning to Infer the Speed of Time — The Speed-Guess Game

To better appreciate the difficulty in estimating playback speed, we encourage you to try the interactive Speed-Guess Game below.

Try the Speed-Guess Game

The SloMo-44K Dataset

The ability to detect speed changes and infer the speed of time not only helps models perceive the flow of time, but also serves as valuable tools for transforming in-the-wild slow-motion videos into a speed-annotated dataset, SloMo-44K. This dataset encompasses a wide range of real-world scenes, motion types, and playback speeds. Below are several sample videos from the dataset.

Downstream Task: Speed-conditioned Video Generation

The SloMo-44K dataset enables a novel downstream task: speed-conditioned video generation. This task extends image-to-video generation by incorporating an additional speed condition that controls the playback rate of the generated videos. This allows users to generate content at a desired temporal granularity to observe real-world motion from specific perspectives. In contrast, the base model always produces videos at a similar speed, regardless of the text prompts.

We demonstrate that speed-conditioned video generation is only possible when the model is trained on a slow-motion dataset that captures authentic real-world dynamics. In contrast, models trained solely on standard-FPS videos with artificially applied slowdowns fail to produce smooth, realistic motion and instead exhibit the same stuttering artifacts introduced by slowing down standard-FPS footage.

Downstream Task: Video-to-Slow-Motion Generation

To show the potential of our SloMo-44K dataset, we train a model on this dataset to convert low-frame-rate, synthetically motion-blurred videos into sharp, high-frame-rate footage. Unlike conventional temporal super-resolution, our approach accounts for the strong motion blur in real low-frame-rate footage. As a result, our model can generate natural slow-motion videos that closely resemble footage captured by a high-speed camera.

Downstream Task: Conventional Temporal Super-Resolution

The diversity, scale, and slow-motion content of SloMo-44K also benefit conventional temporal super-resolution. We further train an 8× temporal super-resolution model, which generates videos with smoother and more natural motion dynamics compared to existing approaches.

Seeing Fast and Slow:Learning the Flow of Time in Videos