Input 1
Sparse Keyframes
Output 1
Scene-Aware Motion In-betweening
Input 2
Noisy Keyframes
Output 2
Robustness on Noisy Keyframes
Input 3
Noisy Real-World Data
Output 3
Generalization to Real-World
Input 4
Monocular RGB Video
Output 4
Video-based HSI Reconstruction
Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening---a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.
Keyframes - Sparse, Clean | Scene - Hand-Crafted
Dataset - TRUMANS (Clean)
We visualize the results of SceneMI on the clean TRUMANS dataset using the classical motion in-betweening setting with surrounding environments. Here, we employ a sparse keyframe interval of 60 frames, corresponding to 2-second motion sequences between consecutive keyframes. This presents a challenging scenario that requires scene awareness for motion synthesis.
SceneMI is the first to tackle motion in-betweening within 3D scenes, demonstrating the ability to generate realistic transitions that adhere to both keyframe and environmental constraints.
Comparisons
MDM
CondMDI
Without Scene-awareness
Ours (SceneMI)
MDM
CondMDI
Without Scene-awareness
Ours (SceneMI)
Further Results
Ours (SceneMI) Case #1
Ours (SceneMI) Case #2
Ours (SceneMI) Case #3
Ours (SceneMI) Case #4
Keyframes - Dense, Noisy | Scene - Hand-Crafted
Dataset - TRUMANS (Synthetic Noisy)
We provide video comparisons with baselines for scene-aware motion in-betweening under noisy keyframe conditions, using a synthetic noisy TRUMANS test set. Here, we visualize performance in dense, noisy keyframe settings (interval of 3), an extreme case for motion in-betweening that requires handling significant noise.
SceneMI demonstrates robustness to keyframe quality by effectively handling noise during the diffusion process.
Comparisons
Input Noisy Keyframes
MDM
CondMDI
Without Noise-awareness
Ours (SceneMI)
Input Noisy Keyframes
MDM
CondMDI
Without Noise-awareness
Ours (SceneMI)
Keyframes - Noisy (Real-World Noise from IMU Sensor) | Scene - Phone-Scanned
Dataset - GIMO (Real-World Data)
We demonstrate the robust generalization of SceneMI on the real-world GIMO dataset. The GIMO motion sequences are affected by IMU sensor inaccuracies, and its phone-scanned scenes differ significantly from the training set. Here, we uniformly select keyframes at intervals of 30 or 60 frames, corresponding to one keyframe every 1 or 2 seconds.
By recognizing human-scene interactions from real-world data, SceneMI synthesizes higher-quality motions that enhance the raw sequences while faithfully preserving the original semantics of the interactions.
Seminar Room
Noisy Real-World Data
SceneMI Results
Bedroom-1
Noisy Real-World Data
SceneMI Results
Bedroom-2
Noisy Real-World Data
SceneMI Results
Classroom
Noisy Real-World Data
SceneMI Results
Lab - front
Noisy Real-World Data
SceneMI Results
Lab - side
Noisy Real-World Data
SceneMI Results
Keyframes - Noisy (Real-World Noise from RGB Estimation) | Scene - Video-Based Reconstruction
Dataset - PROX (Real-World Video)
We highlight the potential of SceneMI within a complete monocular video-based human-scene interaction (HSI) reconstruction pipeline. This demonstration showcases human-scene interactions derived from monocular video inputs and underscores SceneMI’s crucial role in the process.
Initially, the reconstructed scenes and motions often exhibit penetration artifacts and jitter due to independent estimation and occlusions in the input video. By integrating SceneMI with the reconstructed scene geometry and extracted keyframes, we enhance physical plausibility, generate natural motions, and complete the video-based HSI reconstruction pipeline.
Input Monocular Video
Initial reconstruction
Final reconstruction (Applying SceneMI)
SceneMI results with input
Input Monocular Video
Initial reconstruction
Final reconstruction (Applying SceneMI)
SceneMI results with input
@misc{hwang2025scenemimotioninbetweeningmodeling,
title={SceneMI: Motion In-betweening for Modeling Human-Scene Interactions},
author={Inwoo Hwang and Bing Zhou and Young Min Kim and Jian Wang and Chuan Guo},
year={2025},
eprint={2503.16289},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.16289},
}