Logo MMEvol

Empowering MLLMs with Evol-Instruct

Run Luo1,2*, Haonang Zhang3*, Longze Chen1,2*, Ting-En Lin3*, Xiong Liu3, Yuchuan Wu3, Min Yang1,2, Minzheng Wang2, Pengpeng Zeng4, Lianli Gao5, Heng Tao Shen4, Yunshui Li1,2, Xiaobo Xia6, Fei Huang3, Jingkuan Song4, Yongbin Li3,
1SIAT, 2UCAS, 3Alibaba, 4TONGJI, 5Independent, 6USYD

Abstract

The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become increasingly marginal. Conversely, data-driven methods, which scale up image-text instruction data, are more effective but face limited data diversity and complexity challenges. The absence of high-quality data constitutes a significant development barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a novel multimodal instruction data evolution framework. This framework iteratively improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that empowers MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broaden the diversity of instruction types, extend visual reasoning steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained information within images to enhance visual understanding and robustness. To comprehensively evaluate the effectiveness of our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
teaser_tasks

Overview of MMEvol: Instruction evolution and instruction elimination synergistically collaborate through multiple rounds to enhance the diversity and complexity of instruction data.

Methodology

Prompt Template

The Prompt Template of Cognitive Reasoning Evolution, Interactive Evolution, Fine-grained Perceptual Evolution and Instruction Elimination.

teaser_tasks

Evolution Case: MMEvol continuously enhances instruction data complexity and diversity over evol-instruct. The sample is from SEED-163K. We mark fine-grained visual information in red, new instructions form in green, and longer reasoning steps in blue.

Instruction Diversity Comparison

teaser_tasks

Task Categories: The root verbs (inner circle) and their top noun objects (outer circle) of the seed data in Left and the evolved data in Right. MMEvol can significantly enhance the diversity of instruction data.

Fine-grained Visual Objects Comparison

data-composition

The long-tail distribution of 200 visual objects between seed and evolved data. : MMEvol significantly improves the long-tail distribution of visual objects in the seed data, providing more fine-grained visual information, thereby boosting the model's generalization ability and robustness against hallucinations.

Instruction Complexity and Difficulty Comparison

Experiment Results

e

Comparison with state-of-the-art methods on 13 visual-language benchmarks : Our MMEvol consistently improve LLaVA-NeXT under a head-to-head comparison, using the same prompts and the same base LLM, showing the effectiveness of enhanced pretraining data quality. We mark the best performance bold and the second-best underlined.

e

Examples of image-text dialogue with our Evol-8B MLLM : Our model trained on evolved data exhibits strong visual reasoning, instruction following, and fine-grained perception capabilities. Additionally, it identifies nuances in meme content, validating the effectiveness and efficiency of MMEvol.

Citation


      @article{run2024mmevol,
              title={MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct},
              author={Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li},
              journal={arXiv preprint arXiv:2409.05840},
              year={2024}
      }