Prompting Visual-Language Models for Dynamic Facial Expression Recognition
A CLIP-based visual-language model called DFER-CLIP for in-the-wild dynamic facial expression of emotion recognition.
The expression descriptors related to facial behaviour were adopted as the textual input to capture the relationship between facial expressions and their underlying facial behaviour.
Facial expression is an important aspect of daily conversation and communication and the recognition of dynamic facial expressions has become an increasingly important research area within the field of computer vision and affective computing. This work presents a novel visual-language model called DFER-CLIP, which is based on the CLIP model and designed for in-the-wild Dynamic Facial Expression Recog- nition (DFER). Specifically, the proposed DFER-CLIP consists of a visual part and a textual part. For the visual part, based on the CLIP image encoder, a temporal model consisting of several Transformer encoders is introduced for extracting temporal facial expression features, and the final feature embedding is obtained as a learnable "class" token. For the textual part, we use as inputs textual descriptions of the facial behaviour that is related to the classes (facial expressions) that we are interested in recognising – those descriptions are generated using large language models, like ChatGPT. This, in contrast to works that use only the class names and more accurately captures the relationship between them. Alongside the textual description, we introduce a learnable token which helps the model learn relevant context information for each expression during training. Extensive experiments demonstrate the effectiveness of the proposed method and show that our DFER-CLIP also achieves state-of-the-art results compared with the current su- pervised DFER methods on the DFEW, FERV39k, and MAFW benchmarks.