J. Tan
National University of Singapore,
Singapore
Keywords: explainable artificial intelligence, emotion recognition
Summary:
In an increasingly digitised world, data has become more accessible than ever. With this greater availability, learning-based artificial intelligence (AI) has made great strides across fields like computer vision, natural language processing and speech processing. However, such capabilities can be complex—limiting their real-world applications. By making these technologies easier to explain and understand, they can be more widely deployed. Specifically, the field of audio prediction requires explanations that are more relatable and familiar to users. Currently, explanation techniques for audio present saliency maps on audiograms or spectrograms that are technical and may be difficult for lay-users to understand. As audio applications become more common in smart homes and healthcare institutions, there is a growing need for such AI models and their predictions to be relatable to examples that users can understand. To truly secure trust from users, the AI models must also mimic how humans think and draw inspiration from human decision-making. As such, new systems must be easy to relate to, explain and understand. RexNet, or relatable explanation network, is a modular multi-task deep learning model with modules for multiple explanation types. The technology improves prediction performance and reasonable explanations for audio-based prediction models and can potentially be applied to image-based or other AI-based perception predictions. Designed for vocal emotion recognition, RexNet can be implemented in smart homes and mental health settings to identify stress, user engagement and more.