ChangS.F. x 1

bookRéférences 1

Vx2text: End-to-end learning of video-based text generation from multimodal inputs

We present Vx2Text, a framework for text generation from multimodal inputs consisting of video plus ...

2026-01-20 00:00:00

WangJ.LinX.BertasiusG.ChangS.F.

Mots-clés associés