ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

1Seoul National University 2Snap Inc. 3Meta Reality Labs

1. Text-to-Motion Generation

1.1. Motion Generation via Next-Scale Token Map Prediction

Given a prompt $c$, we autoregressively predict the next-scale token maps $\{q^v\}_{v=0}^{V}$ conditioned on accumulated coarser-scale token maps. Note that the 2D skeletal-temporal token maps are flattened into 1D for ease of visualization.

1.3. Comparisons to Previous Works

2. Zero-Shot Text-Driven Motion Editing

2.1. Motion Editing Overview

Starting from the source prompt $c_s$, we autoregressively generate the source token maps $\{q_s^v\}_{v=0}^{V}$. With additional target prompt $c_t$ with a source-token preservation mask $\{\mathcal{M}^v\}_{v=0}^{V}$, we predict edited token maps $q'^{(v)}_t$ conditioned on the remaining source motion context and $c_t$. The target token maps $\{q_t^v\}_{v=0}^{V}$ are generated by blending the source token maps $q_s^v$ with the predicted token maps $q'^{(v)}_t$.

2.2. Motion Editing Gallery

We show the original source motion corresponding to the source text and edit prompt, along with its edited motion. Our method supports a variety of editing operations, including semantic alterations, joint-level modifications, and temporal changes.



Original Motion
Edited Motion

The person walks forward at a relaxed pace with an upright posture. Their arms swing naturally at their sides in opposition to the legs. The steps are even and steady, with a smooth heel-to-toe motion. The head faces forward, and the torso remains stable throughout the walk.

The person walks forward at a relaxed pace with an upright posture. Their arms lift upward away from the sides, raising the hands to about shoulder height or higher. The steps are even and steady, with a smooth heel-to-toe motion. The head faces forward, and the torso remains stable throughout the walk.

Original Motion
Edited Motion

The person walks forward at a steady pace with arms swinging naturally at the sides. Both legs move evenly with a smooth heel-to-toe motion, and the torso remains upright and stable throughout.

The person walks forward at a steady pace with arms swinging naturally at the sides. The right leg moves with a heavy limp, landing with less weight and a shortened stride, while the left leg carries more of the body's load. The torso sways gently to the left with each step to compensate.

Original Motion
Edited Motion

The person walks forward steadily in a relaxed pace, pauses in the middle, stands still briefly, then resumes walking forward at the same steady pace.

The person walks forward steadily in a relaxed pace, then bends the knees and lowers into a squat position midway, rises back upright, and resumes walking forward at the same steady pace.

Original Motion
Edited Motion

The person stands in a fighting stance and throws a quick jab forward with the left fist, fully extending the left arm, then immediately follows with a straight cross punch with the right fist, extending the right arm forward with more power. Both fists then retract back to guard position.

The person stands in a fighting stance and throws a quick jab forward with the right fist, fully extending the right arm, then immediately follows with a straight cross punch with the left fist, extending the left arm forward with more power. Both fists then retract back to guard position.

3. Multi-Scale Skeletal-Temporal Token Map


Given an input motion sequence $\mathbf{m}$, the encoder $\mathcal{E}$ maps it to a continuous skeletal-temporal latent grid $f$. The latent is decomposed into a hierarchy of residual components $\{q^v\}_{v=0}^V$ via binary multi-scale residual quantization, where each scale has its own temporal resolution and skeletal partition. The quantized residuals are then upsampled and accumulated to form the reconstructed latent $\hat{f}$, which the decoder $\mathcal{D}$ converts back into a full-resolution motion sequence.

$\sum_{v=0}^{\mathbf{2}}\mathcal{I}_{up}^{(v)} \left( q^v,n,j \right)$
(Scale 3 / 11)
$\sum_{v=0}^{\mathbf{5}}\mathcal{I}_{up}^{(v)} \left( q^v,n,j \right)$
(Scale 6 / 11)
$\sum_{v=0}^{\mathbf{8}}\mathcal{I}_{up}^{(v)} \left( q^v,n,j \right)$
(Scale 9 / 11)
$\sum_{v=0}^{\mathbf{10}}\mathcal{I}_{up}^{(v)} \left( q^v,n,j \right)$
(Full Scale)

To understand our skeletal-temporal, multi-scale motion representation, we visualize the step-by-step reconstruction of motions by progressively accumulating token maps across the skeletal-temporal hierarchy. In the earlier, coarse-level token maps, the motions of paired limbs—such as both arms or both legs—are grouped and encoded into shared representations. As we process rightward towards finer scales, the token maps recursively split according to our predefined skeletal hierarchy.
Consequently, the coarse-scale token maps capture global semantic movements, which gradually disentangle into highly articulated dynamics for individual joints at finer scales. This hierarchical progression restores the full realism and fine-grained details of the original motion (the rightmost character).

4. Citation


If you find our work useful, please consider citing:

@misc{hwang2026scalemogenautoregressivenextscaleprediction,
      title={ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation},
      author={Inwoo Hwang and Hojun Jang and Bing Zhou and Jian Wang and Young Min Kim and Chuan Guo},
      year={2026},
      eprint={2605.11704},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.11704},
}