UML

Tencent & Sydney U’s GPT4Video: A Unified Multimodal Large Language Significantly Elevates LMs’ Video Generative Capabilities


In recent strides within the field of Multimodal Large Language Models (MLLMs), while notable progress has been made in input-side multimodal comprehension, a notable void persists in the domain of multimodal content generation.

Addressing this gap, a collaborative effort between Tencent AI Lab and The University of Sydney introduces GPT4Video in a new paper GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. GPT4Video stands as a unified multi-model framework that endows Large Language Models (LLMs) with the unique ability for both video understanding and generation.

The primary contributions of the research team can be summarized as follows:

  1. Introduction of GPT4Video, a versatile framework enriching LLMs with capabilities for both video understanding and generation.
  2. Proposal of a straightforward and effective fine-tuning method designed to enhance the safety of video generation, providing an appealing alternative to the commonly-employed RLHF approach.
  3. Release of datasets to facilitate future endeavors in the realm of multimodal LLMs.

GPT4Video arises as a response to the limitations of existing Multimodal Large Language Models (MLLMs), which, despite excelling at processing multimodal inputs, display shortcomings in generating multimodal outputs. The architecture of GPT4Video comprises three integral components:

  1. A video understanding module utilizing a video feature extractor and a video abstractor to encode and align video information within the LLM’s word embedding space.
  2. The LLM body, drawing from the structure of LLaMA and employing Parameter-Efficient Fine-Tuning (PEFT) methods, specifically LoRA, while preserving the original pre-trained parameters.
  3. A video generation component conditioning the LLM to generate prompts for a model from the Textto-Video Model Gallery through a meticulously constructed instruction-following dataset.

The team initiates the process by utilizing a frozen ViT-L/14 model to capture raw video features, followed by employing a video abstraction module to condense video information across temporal and spatial axes. GPT4Video’s core is driven by a frozen LLaMA model, efficiently fine-tuned via LoRA with custom video-centric and safety-aligned data. This equips it to comprehend videos and generate appropriate video prompts, subsequently used to produce videos from the Textto-Video Model Gallery.

Experimental results across various multimodal benchmarks, encompassing open-ended question-answer, video captioning, and text-to-video generation, validate the effectiveness and universality of GPT4Video. Moreover, GPT4Video showcases its ability to harness the robust contextual summarization and textual expression capabilities of LLMs to generate detailed prompts for videos.

In essence, GPT4Video significantly elevates Large Language Models by integrating advanced video understanding and generative functions. Its effectiveness is further emphasized by its superior performance across multimodal benchmarks.

The code is available on project’s GitHub. The paper GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.



Source link

Most Popular

To Top