Skip to main content
SPIIRAS ProceedingsVolume 3, Issue 58, 2018, Pages 27-52

HMM-based whisper recognition using μ-law frequency warping(Article)(Open Access)

  • Galić, J.N.,
  • Jovičić, S.T.,
  • Delić, V.D.,
  • Marković, B.R.,
  • Šumarac Pavlović, D.S.,
  • Grozdić, D.T.
  Save all to author list
  • aSchool of Electrical Engineering, University of Belgrade, Serbia
  • bFaculty of Electrical Engineering, University of Banja Luka, 5, Patre Republic of Srpska, Banja-Luka, 78000, Bosnia and Herzegovina
  • cDepartment of Telecommunications, School of Electrical Engineering, University of Belgrade, 73, Bul. Kralja Aleksandra, Belgrade, 11120, Serbia
  • dDepartment of Power, Electronic and Telecommunications Engineering, Faculty of technical sciences, University of Novi Sad, 6, Trg Dositeja Obradovića, Novi Sad, 21000, Serbia
  • eSchool of Electrical Engineering, University of Belgrade, Čačak Technical College, 65, Svetog Save, Čačak, 32000, Serbia
  • fSchool of Electrical Engineering, University of Belgrade, 73, Bul. Kralja Aleksandra, Belgrade, 11120, Serbia
  • gFincore Ltd., 7, Mutapova, Belgrade, 11000, Serbia

Abstract

Due to the lack of sufficient amount of whisper data for training, whispered speech recognition is a serious challenge for state-of-the-art Automatic Speech Recognition (ASR) systems. Because of great acoustic mismatch between neutral and whispered speech, ASR systems are faced with significant drop of performance when applied to whisper. In this paper, we give an analysis of neutral and whispered speech recognition based on traditional Hidden Markov Models (HMM) framework, in a Speaker Dependent (SD) and Speaker Independent (SI) cases. Special attention is paid to the neutral-trained recognition of whispered speech (N/W scenario). The ASR system is developed for recognition of isolated words from a real database (Whi-Spe) of neutral-whisper speech pairs. In the N/W scenario, a meaningful gain in robustness is achieved with the proposed frequency warping, originally developed for speech signal compression and expanding in digital telecommunication systems. Simultaneously, good performances in recognition of neutral speech are retained. Compared to baseline recognition with Mel-frequency Cepstral Coefficients (MFCC), word recognition accuracy with cepstral coefficients using proposed frequency warping (denoted as μFCC) is improved for 7.36% (SD) and 3.44% (SI), absolute. As well, the F-measure (harmonic mean of the precission and recall) for μFCC feature vectors is increased for 6.90% (SD) and 3.59 (SI). Statistical tests confirm significance of the achieved improvement in recognition accuracy. © 2018 St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences. All rights reserved.

Author keywords

Automatic speech recognitionFeature extractionHidden Markov modelsHuman voiceSpeech processingWhisper
  • ISSN: 20789181
  • Source Type: Journal
  • Original language: English
  • DOI: 10.15622/sp.58.2
  • Document Type: Article
  • Publisher: St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences


© Copyright 2018 Elsevier B.V., All rights reserved.

Cited by 2 documents

Galic, J. , Marković, B.
The Recognition of Bimodal Produced Speech based on Multi-style Training
(2020) 2020 Zooming Innovation in Consumer Technologies Conference, ZINC 2020
Galić, J. , Popović, B. , Pavlović, D.Š.
Whispered speech recognition using hidden markov models and support vector machines
(2018) Acta Polytechnica Hungarica
View details of all 2 citations
{"topic":{"name":"Speech Communication; Neural Network; Audio Signal Processing","id":61698,"uri":"Topic/61698","prominencePercentile":58.277027,"prominencePercentileString":"58.277","overallScholarlyOutput":0},"dig":"4832916b6dc7d4b4103fbbac074af3278a403b124a759dd76c9eef34642b6cae"}

SciVal Topic Prominence

Topic:
Prominence percentile: