Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Weifeng Chen Jie Wu Pan Xie Hefeng Wu Jiashi Li Xin Xia Xuefeng Xiao Liang Lin

Sun Yat-Sen University

→

"a monet rabbit"

→

"a robotic camel in snowy day"

→

"a robotic bull in heavy rain."

→

"a man handstand, van goah style."

→

"a train in lava."

→

aluminum swan."

→

frozen city."

Depth control

"a robot play basketabll, cartoon"

"a bear walking through stars"

"an elephant's back, oil painting"

Canny(edge) control

"a bus ,cyperpunk style"

"a robot walking, star night"

"a dog, comicbook style"

Hed(soft-edge) control

"Chinese ink dog runing"

"person riding horse, watercolor"

"a flamingo in snowy day"

Abstract

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality.

Paper

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
[paper]

Bibtex


    @misc{chen2023controlavideo,
        title={Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models}, 
        author={Weifeng Chen and Jie Wu and Pan Xie and Hefeng Wu and Jiashi Li and Xin Xia and Xuefeng Xiao and Liang Lin},
        year={2023},
        eprint={2305.13840},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }