HuangP.Y. x 2

bookRéférences 2

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach l...

2022-12-15 00:00:00

Large vision-language models are generally applicable to many downstream tasks, but come at an exorb...