Motion: An egg rolling from the right to left on the table.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
Motion: A helicopter gracefully descending to land.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
Numeracy: Three bears fish in a river surrounded by mountains.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
Numeracy: Six penguins waddle together across an icy landscape.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
Spatial: A gorilla sitting on the left side of a vending machine in a forest.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
Spatial: A child building a sandcasle on the right of a beach umbrella.
CogVideoX-5B
Ours (Video Sketch)
Ours (Final Video)
@article{li2025video-msg,
author = {Jialu Li and Shoubin Yu and Han Lin and Jaemin Cho and Jaehong Yoon and Mohit Bansal},
title = {Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization},
year = {2025},
journal = {ArXiv2504.08641},
}