X. Liu, R. Seiilova-Olson, A. Shamei, H. Sinan
Tenvos Inc.,
United States
Keywords: voice biomarkers, large acoustic models, impairment detection, occupational safety
Summary:
Human speech is a complex system that involves synthesis and synergy between specific brain areas, vocal cords, audio-sensory processing, and knowledge of a language. The effect of intoxication and hazardous fatigue on motor function is a well known phenomenon. Speech production - a fine motor skill that involves over 100 vocal articulator muscles is affected by the speaker's state as well. For instance, production of sounds such as “b”, “d”, “g” has been found to change as a result of reduced sleep, mispronouncing “s” as “sh” has been observed under alcohol intoxication. In addition to affecting the specific phoneme production, higher level speech characteristics such as speech rate and pause duration are affected by the speaker state as well. While classifying voice samples into various states have been extensively studied by leveraging explainable features and classical machine learning approaches, exploring large acoustic models (LAMs) has been done in a limited manner. While less transparent, LAMs outperform classical ML models in most cases. In this paper we dive under the hood and analyze the embeddings to glean insights into what exactly is encoded in the 1024-dimensional vectors in the context of alcohol intoxication. We use Alcohol Language Corpus (ALC) and Wav2Vec2.0 self-supervised learning model for our experiments. According to the Bureau of Labor Statistics 38% of worker’s compensation claims are due to substance use. The National Safety Council estimates that 13% of workplace injuries can be attributed to fatigue. Voice-based brief, non-invasive and affordable impairment screening solutions that can be used before every shift can significantly improve safety by ensuring fitness for duty for safety-sensitive workers.