Using AI to Analyze and Quantify Facial Expressions in Video

Written by SMPTE Content | Apr 14, 2022 4:31:58 PM

In the April issue of the SMPTE Motion Imaging Journal Dominic Rüfenacht and Appu Shanji from Berlin, Germany’s Mobius Labs present their work on “Customized Facial Expression Analysis in Video” using AI and machine learning. While existing computer vision-based emotion recognition systems are trained to classify images of faces into a very limited number of emotions, their takes a different approach, training a convolutional neural network (CNN) to distinguish subtle differences in facial expressions. Key to their work is adapting the model to work reliably on video content by filtering out faces unsuited for expression analysis using a novel automatic face quality assessment.

Facial expressions are one of the most important forms of non-verbal human expression and a key driver of social interaction. But as the authors note, while an emotion is an internal feeling, facial expression is external and objective, referring to the positions and motions of the muscles beneath the skin of a face. Expressions can indicate our emotion but do not always reflect our emotion, such as when we smile although we may be sad. For this reason, Rüfenacht and Shanji set out to train a system to distinguish facial expressions rather than emotions.

Their aim was to objectively distinguish expressions at a very fine granularity, by training a system to model the 32 facial muscle actions called Action Units (AU) as defined by the Facial Action Coding System, and then combine them into facial expressions. Since it takes a trained human expert one hour to score one minute of video for 32 AUs, the most popular automatic approaches still resort to classifying faces into only six or seven basic expressions. The authors address this complexity by instead using a Facial Expression Comparison framework that trains the system using triplets of faces, each annotated with which face is most dissimilar to the other two. This approach provides an intrinsic way to distinguish different combinations of AUs without the need to explicitly label the faces.

Their approach enables them to:

Detect faces and filter out those unsuited for expression analysis such as frames where compression and resolution artifacts distort the face pixels or where the angle of the head is too far sideways. After using the recently developed RetinaFace to detect the faces, a novel automatic face quality assessment enables accurate and stable expression analysis in video. Only faces that pass both the sharpness and head yaw tests are passed onto subsequent facial expression analysis.
Analyze facial expressions using a simplified architecture to produce state-of-the-art triplet prediction accuracy. They use a facial landmark detector to extract the location of the eyes, nose tip, and both sides of the mouth. The landmarks are then aligned by undoing rotation and scaling, to result in a standardized distance between the eyes. The aligned face is input into a convolutional neural network, outputting a 16-dimensional feature vector which is then normalized into a facial expression embedding network capable of clustering the faces into distinct identities. The result is lightweight (4.7M parameters) and obtains a triplet prediction accuracy of 84.5% -- very close to the average human performance of 86.2%.
Distinguish facial expressions on a fine-grained level to be able to quantify the intensity of the expression while running efficiently enough to be suited for video analysis. For instance, rather than just being able to tell that someone is smiling, they can reliably quantify the intensity of the smile.
Offer the ability to add custom facial expressions without the need to retrain the model. Adding new facial expressions requires only one “reference face” depicting the facial expression of interest, together with the tags describing the expression.

To demonstrate its practical usage, the authors applied their approach to searching and tagging video content for specific expressions, using a single face as the query. They were also able to generate expression statistics and summaries for various individuals appearing in a video, comparing the appearance rate of specific facial expressions of the two candidates in the 2020 U.S. presidential debates. These facial expression features can be extended to other diverse applications by adding further classification, ranking, or clustering layers. They suggest applications such as advanced visual similarity search for video that can match and rank actor expressions, creating automatic highlights of video segments by identifying expressions that particularly capture viewer interest.

Read the complete article in this month’s SMPTE Motion Imaging Journal https://ieeexplore.ieee.org/document/9749801to learn more about Rüfenacht and Shanji’s pioneering work.

#AI, #artificial intelligence, #machine learning, #convolutional network, #CNN, #facial expression analysis, #action unit, #AU, #triplet prediction

View full post