Visual Prompting in LLMs for Enhancing Emotion Recognition

1Australian National University 2Quriosity Pty Ltd
3Webumate Pty Ltd 4Curtin University 5Yale University
EMNLP Main 2024
Online Image

Workflow diagram for enhanced face recognition and emotion analysis using the Set-of-Vision (SoV) prompting approach: a multi-step process involving face detection, face numbering, landmark extraction, and spatial relationship analysis for emotion classification. Each detected face is analyzed and identified by facial landmarks on the face, such as the positions of the nose, eyes, mouth, and other facial features.

Abstract

Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing; however, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. We propose a novel Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through comprehensive experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.

Introduction

Proposed Set-of-Vision (SoV) prompting approach for enhancing facial expression recognition in Vision-Language Large Models (VLLMs). SoV progressively incorporates (1) bounding boxes to identify and locate faces, (2) numbered boxes to ground and differentiate faces, and (3) facial landmarks to analyze spatial relationships for fine-grained emotion classification. This multi-stage visual prompting strategy enables VLLMs to accurately detect and recognize emotions in real-world images while preserving global context.

In the bottom of Figure, the use of SoV prompts, such as numbering each face, placing bounding boxes, and identifying facial landmarks, allows for a more precise analysis. The correct number of faces is identified (18), and the emotions are accurately categorized into more nuanced groups: `Neutral Emotion', `Mildly Positive Emotion', and `Smiling or Happy'. This method provides a clearer and more detailed breakdown of each individual's emotional state based on visible facial expressions. This comparison highlights the importance and effectiveness of integrating visual prompts in VLLMs analysis for more accurate and detailed recognition and categorization of human emotions in images.

Performance Comparison Chart

Results

Video Presentation

Poster

Paper

BibTeX

@inproceedings{zhang2024visual,
  title={Visual Prompting in LLMs for Enhancing Emotion Recognition},
  author={Zhang, Qixuan and Wang, Zhifeng and Zhang, Dylan and Niu, Wenjia and Caldwell, Sabrina and Gedeon, Tom and Liu, Yang and Qin, Zhenyue},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  pages={4484--4499},
  year={2024}
}