Visual Prompting in LLMs for Enhancing Emotion Recognition

¹Australian National University ²Quriosity Pty Ltd
³Webumate Pty Ltd ⁴Curtin University ⁵Yale University
EMNLP Main 2024

Abstract

Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing; however, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. We propose a novel Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through comprehensive experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.

Introduction

Proposed Set-of-Vision (SoV) prompting approach for enhancing facial expression recognition in Vision-Language Large Models (VLLMs). SoV progressively incorporates (1) bounding boxes to identify and locate faces, (2) numbered boxes to ground and differentiate faces, and (3) facial landmarks to analyze spatial relationships for fine-grained emotion classification. This multi-stage visual prompting strategy enables VLLMs to accurately detect and recognize emotions in real-world images while preserving global context.

In the bottom of Figure, the use of SoV prompts, such as numbering each face, placing bounding boxes, and identifying facial landmarks, allows for a more precise analysis. The correct number of faces is identified (18), and the emotions are accurately categorized into more nuanced groups: `Neutral Emotion', `Mildly Positive Emotion', and `Smiling or Happy'. This method provides a clearer and more detailed breakdown of each individual's emotional state based on visible facial expressions. This comparison highlights the importance and effectiveness of integrating visual prompts in VLLMs analysis for more accurate and detailed recognition and categorization of human emotions in images.

BibTeX

@inproceedings{zhang2024visual, title={Visual Prompting in LLMs for Enhancing Emotion Recognition}, author={Zhang, Qixuan and Wang, Zhifeng and Zhang, Dylan and Niu, Wenjia and Caldwell, Sabrina and Gedeon, Tom and Liu, Yang and Qin, Zhenyue}, booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing}, pages={4484--4499}, year={2024} }

Visual Prompting in LLMs for Enhancing Emotion Recognition

Abstract

Introduction

Results

First image description.

Second image description.

Third image description.

Video Presentation

Poster

Paper

BibTeX