SceneMI: Motion In-betweening for Modeling Human-Scene Interaction

ICCV 2025, Highlight

Inwoo Hwang¹, Bing Zhou^2✉, Young Min Kim¹, Jian Wang², Chuan Guo^2✉

¹Seoul National University
²Snap Inc.

^✉Co-corresponding author

arXiv Code

SceneMI's Featured Videos

Input 1

Sparse Keyframes

Output 1

Scene-Aware Motion In-betweening

Input 2

Noisy Keyframes

Output 2

Robustness on Noisy Keyframes

Input 3

Noisy Real-World Data

Output 3

Generalization to Real-World

Input 4

Monocular RGB Video

Output 4

Video-based HSI Reconstruction

TL;DR

SceneMI modeling Human-Scene Interaction (HSI) as Scene-aware Motion In-betweening, supports several practical applications:

1. Keyframe-guided character animation in 3D scenes. 2. Robust performance with noisy keyframes, enhancing motion quality of imperfect HSI data and generalizing to real-world scenarios. 3. Assisting HSI reconstruction using only monocular video.

Abstract

Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening---a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.

SceneMI synthesizes physically plausible transitions (purple) that simultaneously satisfy keyframe constraints (blue) and environmental affordances in challenging scenarios. The model exhibits robust generalization with noisy keyframes in phone-scanned scenes from real-world data (third figure). SceneMI can further assist realistic 3D human-scene interaction reconstruction only from monocular video inputs (rightmost figure).

Scene-Awareness

Keyframes - Sparse, Clean | Scene - Hand-Crafted

Dataset - TRUMANS (Clean)

We visualize the results of SceneMI on the clean TRUMANS dataset using the classical motion in-betweening setting with surrounding environments. Here, we employ a sparse keyframe interval of 60 frames, corresponding to 2-second motion sequences between consecutive keyframes. This presents a challenging scenario that requires scene awareness for motion synthesis.

SceneMI is the first to tackle motion in-betweening within 3D scenes, demonstrating the ability to generate realistic transitions that adhere to both keyframe and environmental constraints.

Comparisons

MDM

CondMDI

Without Scene-awareness

Ours (SceneMI)

MDM

CondMDI

Without Scene-awareness

Ours (SceneMI)

Further Results

Ours (SceneMI) Case #1

Ours (SceneMI) Case #2

Ours (SceneMI) Case #3

Ours (SceneMI) Case #4

Robustness on Noisy Keyframes

Keyframes - Dense, Noisy | Scene - Hand-Crafted

Dataset - TRUMANS (Synthetic Noisy)

We provide video comparisons with baselines for scene-aware motion in-betweening under noisy keyframe conditions, using a synthetic noisy TRUMANS test set. Here, we visualize performance in dense, noisy keyframe settings (interval of 3), an extreme case for motion in-betweening that requires handling significant noise.

SceneMI demonstrates robustness to keyframe quality by effectively handling noise during the diffusion process.

Comparisons

Input Noisy Keyframes

MDM

CondMDI

Without Noise-awareness

Ours (SceneMI)

Input Noisy Keyframes

MDM

CondMDI

Without Noise-awareness

Ours (SceneMI)

Generalization to Real-World

Keyframes - Noisy (Real-World Noise from IMU Sensor) | Scene - Phone-Scanned

Dataset - GIMO (Real-World Data)

We demonstrate the robust generalization of SceneMI on the real-world GIMO dataset. The GIMO motion sequences are affected by IMU sensor inaccuracies, and its phone-scanned scenes differ significantly from the training set. Here, we uniformly select keyframes at intervals of 30 or 60 frames, corresponding to one keyframe every 1 or 2 seconds.

By recognizing human-scene interactions from real-world data, SceneMI synthesizes higher-quality motions that enhance the raw sequences while faithfully preserving the original semantics of the interactions.

Seminar Room

Noisy Real-World Data

SceneMI Results

Bedroom-1

Noisy Real-World Data

SceneMI Results

Bedroom-2

Noisy Real-World Data

SceneMI Results

Classroom

Noisy Real-World Data

SceneMI Results

Lab - front

Noisy Real-World Data

SceneMI Results

Lab - side

Noisy Real-World Data

SceneMI Results

Video-based Human-Scene Interaction (HSI) Reconstruction

Keyframes - Noisy (Real-World Noise from RGB Estimation) | Scene - Video-Based Reconstruction

Dataset - PROX (Real-World Video)

We highlight the potential of SceneMI within a complete monocular video-based human-scene interaction (HSI) reconstruction pipeline. This demonstration showcases human-scene interactions derived from monocular video inputs and underscores SceneMI’s crucial role in the process.

Initially, the reconstructed scenes and motions often exhibit penetration artifacts and jitter due to independent estimation and occlusions in the input video. By integrating SceneMI with the reconstructed scene geometry and extracted keyframes, we enhance physical plausibility, generate natural motions, and complete the video-based HSI reconstruction pipeline.

Input Monocular Video

Initial reconstruction

Final reconstruction (Applying SceneMI)

SceneMI results with input

Input Monocular Video

Initial reconstruction

Final reconstruction (Applying SceneMI)

SceneMI results with input

BibTeX


        @misc{hwang2025scenemimotioninbetweeningmodeling,
            title={SceneMI: Motion In-betweening for Modeling Human-Scene Interactions}, 
            author={Inwoo Hwang and Bing Zhou and Young Min Kim and Jian Wang and Chuan Guo},
            year={2025},
            eprint={2503.16289},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2503.16289}, 
        }