Human personality traits (PT) reflect individual differences in patterns of thinking, feeling, and behaving. Knowledge on PT may be useful in many applied tasks in our everyday live. In this paper, we present a first open-source multimodal framework called OCEAN-AI for PT assessment (PTA) and HR-processes automatization. Our framework performs PTA analyzing three modalities, including audio, video, and text, and includes three processing modules. All the modules extract heterogeneous (deep neural and hand-crafted) features and use them for a com- plex analysis of human’s behavior. The final fourth module aggregates these six feature sets by a Siamese neural network with a gated attention mechanism. Our framework was tested on two free-available corpora, including First Impressions v2 and our MuPTA, and achieved the best results. Applying our framework, a user can automate solutions of some practical applied tasks, such as ranking potential candidates by professional responsibilities, forming efficient work teams and so on.
People tend to judge others assessing their personality traits relying on life experience. This fact is especially evident when making an informed hiring decision, which should consider not only skills, but also match a company’s values and culture. Based on this assumption, we use the Siamese Network (SN) for assessing five personality traits by pairwise analyzing and comparing people simultaneously. For this, we propose the OCEAN-AI framework based on Gated Siamese Fusion Network (GSFN), which comprises six modules and enables the fusion of hand-crafted and deep features across three modalities (video, audio, and text). We use the ChaLearn First Impressions v2 (FIv2) and Multimodal Personality Traits Assessment (MuPTA) corpora and identify that all six feature sets and their combinations due to different information content allow the framework to adjust to heterogeneous input data flexibly. The experimental results show that the pairwise comparison of people with the same or different Personality Traits (PT) during the training enhances the proposed framework performance. The framework outperforms the State-of-the-Art (SOTA) systems based on three modalities (video-face, audio and text) by the relative value of 1.3% (0.928 vs. 0.916) in terms of the mean accuracy (mACC) on the FIv2 corpus. We also outperform the SOTA system in terms of the Concordance Correlation Coefficient (CCC) by the relative value of 8.6% (0.667 vs. 0.614) using two modalities (video and audio) on the MuPTA corpus. We make our framework publicly available to integrate it into various applications such as recruitment, education, and healthcare.
A Compound Expression Recognition (CER) as a subfield of affective computing is a novel task in intelligent human-computer interaction and multimodal user interfaces. We propose a novel audio-visual method for CER. Our method relies on emotion recognition models that fuse modalities at the emotion probability level while decisions regarding the prediction of compound expressions are based on the pair-wise sum of weighted emotion probability distributions. Notably our method does not use any training data specific to the target task. Thus the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. We achieved F1 scores of 32.15% and 25.56% for the AffWild2 and C-EXPR-DB test subsets without training on target corpus and target task respectively. Therefore our method is on par with methods developed training target corpus or target task.
This article presents a research methodology for audio-visual speech recognition (AVSR) in driver assistive systems. These systems necessitate ongoing interaction with drivers while driving through voice control for safety reasons. The article introduces a novel audio-visual speech command recognition transformer (AVCRFormer) specifically designed for robust AVSR. We propose (i) a multimodal fusion strategy based on spatio-temporal fusion of audio and video feature matrices, (ii) a regulated transformer based on iterative model refinement module with multiple encoders, (iii) a classifier ensemble strategy based on multiple decoders. The spatio-temporal fusion strategy preserves contextual information of both modalities and achieves their synchronization. An iterative model refinement module can bridge the gap between acoustic and visual data by leveraging their impact on speech recognition accuracy. The proposed multi-prediction strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the model’s adaptability across diverse audio-visual contexts. The transformer proposed has achieved the highest values of speech command recognition accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW corpora, respectively. This research has significant implications for advancing human–computer interaction. The capabilities of AVCRFormer extend beyond AVSR, making it a valuable contribution to the intersection of audio-visual processing and artificial intelligence.
Psychological and neurological studies earlier suggested that a personality type can be determined by the whole face as well as by its sides. This article discusses novel research using deep neural networks that address the features of both sides of the face (hemifaces) to assess the human’s Big Five personality traits (PT). For this, we have developed a real-time approach called EmoFormer with cross-hemiface attention. The novelty of the presented approach lies in the confirmation that each hemiface exhibits high predictive capabilities in terms of human’s PT distinction. Our approach is based on a novel mid-level emotional feature extractor for each hemiface and a cross-hemiface attention fusion strategy for hemiface feature aggregation. The consequent fusion of both hemifaces has outperformed the use of the whole face by the relative value of 3.6% in terms of Concordance Correlation Coefficient (0.634 vs. 0.612) on the ChaLearn First Impressions V2 corpus. The proposed approach has also outperformed all the existing state-of-the-art approaches for PT assessment based on the face modality. We have also analyzed the “best hemiface”, the one that predicts PT more accurately in terms of demographic characteristics (gender, ethnicity, and age). We have found that the best hemiface for two of the five PT (Openness to experience and Non-Neuroticism) is different depending on demographic characteristics. For the other three traits, the right hemiface is dominant for Extraversion, while the left one is more indicative of Conscientiousness and Agreeableness. These findings support previous psychological and neurological research. Besides, we provide an open-source framework referred to as OCEAN-AI that can be seamlessly integrated into expert systems with practical applications in various domains including healthcare, education, and human resources.
Audio-visual speech recognition (AVSR) gains increasing attention as an important part of human-machine interaction. However, the publicly available corpora are limited, particularly in driving conditions with prevalent background noise. Research so far has been collected in constrained environments, and thus cannot reflect the true performance of AVSR systems in real-world scenarios. Moreover, data for languages other than English is often unavailable. To meet the request for research on AVSR in unconstrained driving conditions, this paper presents a corpus collected ‘in-the-wild’. We propose a cross-modal attention method enhancing multi-angle AVSR for vehicles, leveraging visual context to improve accuracy and noise robustness. Our proposed model achieves state-of-the-art (SOTA) results with 98.65% accuracy in recognizing driver voice commands.
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
Automatic personality traits assessment (PTA) provides high-level, intelligible predictive inputs for subsequent critical downstream tasks, such as job interview recommendations and mental healthcare monitoring. In this work, we introduce a novel Multimodal Personality Traits Assessment (MuPTA) corpus. Our MuPTA corpus is unique in that it contains both spontaneous and read speech collected in the midly-resourced Russian language. We present a novel audio-visual approach for PTA that is used in order to set up baseline results on this corpus. We further analyze the impact of both spontaneous and read speech types on the PTA predictive performance. We find that for the audio modality, the PTA predictive performances on short signals are almost equal regardless of the speech type, while PTA using video modality is more accurate with spontaneous speech compared to read one regardless of the signal length.
DAVIS is a driver’s audio-visual assistive system intended to improve accuracy and robustness of speech recognition of the most frequent drivers’ requests in natural driving conditions. Since speech recognition in driving condition is highly challenging due to acoustic noises, active head turns, pose variation, distance to recording devices, lightning conditions, etc. We rely on multimodal information and use both automatic lip-reading system for visual stream and ASR for audio stream processing. We have trained audio and video models on own RUSAVIC dataset containing in-the-wild audio and video recordings of 20 drivers. The recognition application comprises a graphical user interface and modules for audio and video signal acquisition, analysis, and recognition. The obtained results demonstrate rather high performance of DAVIS and also the fundamental possibility of recognizing speech commands by using video modality, even in such difficult natural conditions as driving.
In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.
In this paper, we present a novel multimodal interaction application to help car drivers and increase their road safety. MIDriveSafely is a mobile application that provides the following functions: (1) detect dangerous situations based on video information from a smartphone front-facing camera, such as drowsiness/sleepiness, phone usage while driving, eating, smoking, unfastened seat belt, etc.; gives a feedback to the driver (2) provide entertainment (e.g. rock-paper-scissors game, based on automatic speech recognition), (3) provide voice control capabilities to navigation/multimedia systems of a smartphone (potentially vehicle systems such as lighting conditions/climate control). Speech recognition in driving conditions is highly challenging due to acoustic noises, active head turns, pose variations, distance to recording devices, etc. MIDriveSafely incorporates driver’s audio-visual speech recognition (DAVIS) system and uses it for multimodal interaction. Along with this, the original DriveSafely system is used for dangerous state detection. MIDriveSafely improves upon existing driver monitoring applications using multimodal (mainly audio-visual) information. MIDriveSafely motivates people to drive in a safer manner by providing the feedback to the drivers and by creating a fun user experience.
Visual speech recognition or automated lip-reading is a field of growing attention. Video data proved its usefulness in multimodal speech recognition, especially when acoustic data is heavily noised or even inaccessible. In this paper, we present a novel method for visual speech recognition. We benchmark it on the famous LRW lip-reading dataset by outperforming the existing approaches. After a comprehensive evaluation, we adapt the developed method and test it on the collected RUSAVIC corpus we recorded in-the-wild for vehicle driver. The results obtained demonstrate not only the high performance of the proposed method, but also the fundamental possibility of recognizing speech only by using video modality, even in such difficult natural conditions as driving.
We present a new audio-visual speech corpus (RUSAVIC) recorded in a car environment and designed for noise-robust speech recognition. Our goal was to produce a speech corpus which is natural (recorded in real driving conditions), controlled (providing different SNR levels by windows open/closed, moving/parked vehicle, etc.), and adequate size (the amount of data is enough to train state-of-the-art NN approaches). We focus on the problem of audio-visual speech recognition: with the use of automated lip-reading to improve the performance of audio-based speech recognition in the presence of severe acoustic noise caused by road traffic. We also describe the equipment and procedures used to create RUSAVIC corpus. Data are collected in a synchronous way through several smartphones located at different angles and equipped with FullHD video camera and microphone. The corpus includes the recordings of 20 drivers with minimum of 10 recording sessions for each. Besides providing a detailed description of the dataset and its collection pipeline, we evaluate several popular audio and visual speech recognition methods and present a set of baseline recognition results. At the moment RUSAVIC is a unique audio-visual corpus for the Russian language that is recorded in-the-wild condition and we make it publicly available.
This paper introduces a new methodology aimed at comfort for the driver in-the-wild multimodal corpus creation for audio-visual speech recognition in driver monitoring systems. The presented methodology is universal and can be used for corpus recording for different languages. We present an analysis of speech recognition systems and voice interfaces for driver monitoring systems based on the analysis of both audio and video data. Multimodal speech recognition allows using audio data when video data are useless (e.g. at nighttime), as well as applying video data in acoustically noisy conditions (e.g., at highways). Our methodology identifies the main steps and requirements for multimodal corpus designing, including the development of a new framework for audio-visual corpus creation. We identify the main research questions related to the speech corpus creation task and discuss them in detail in this paper. We also consider some main cases of usage that require speech recognition in a vehicle cabin for interaction with a driver monitoring system. We also consider other important use cases when the system detects dangerous states of driver’s drowsiness and starts a question-answer game to prevent dangerous situations. At the end based on the proposed methodology, we developed a mobile application that allows us to record a corpus for the Russian language. We created RUSAVIC corpus using the developed mobile application that at the moment a unique audiovisual corpus for the Russian language that is recorded in-the-wild condition.
In this paper, a new Russian sign language multimedia database TheRuSLan is presented. The database includes lexical units (single words and phrases) from Russian sign language within one subject area, namely, “food products at the supermarket”, and was collected using MS Kinect 2.0 device including both FullHD video and the depth map modes, which provides new opportunities for the lexicographical description of the Russian sign language vocabulary and enhances research in the field of automatic gesture recognition. Russian sign language has an official status in Russia, and over 120,000 deaf people in Russia and its neighboring countries use it as their first language. Russian sign language has no writing system, is poorly described and belongs to the low-resource languages. The authors formulate the basic principles of annotation of sign words, based on the collected data, and reveal the content of the collected database. In the future, the database will be expanded and comprise more lexical units. The database is explicitly made for the task of creating an automatic system for Russian sign language recognition.