Sebastijan Trojer, Zoja Anžur, Mitja Luštrek and Gašper Slapničar
Abstract
This paper presents a comparative analysis of feature- and embedding-based approaches for audio-visual emotion classification. We compared the performance of traditional handcrafted features, using MediaPipe for visual features and Mel-frequency cepstral coefficients (MFCCs) for audio features, against neural network (NN)-based embeddings obtained from pretrained models suitable for emotion recognition (ER). The study employs separate uni-modal datasets for audio and visual modalities to
rigorously assess the performance of each feature set on each modality. Results demonstrate that in the case of visual data NNbased embeddings significantly outperform handcrafted features in terms of accuracy and F1 score when training a traditional classifier. However, for audio data the performance is similar on all feature sets. Handcrafted features, such as facial blendshapes, computed from MediaPipe keypoints and MFCCs, remain relevant in resource-constrained settings due to their lower computational demands. This research provides insights into the trade-offs between traditional feature extraction methods and modern deep learning techniques, offering guidance for the development of future emotion classification systems.