Integration of Spatial Cue-Based Noise Reduction and Speech Model-Based Source Restoration for Real Time Speech Enhancement
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Online ISSN : 1745-1337
Print ISSN : 0916-8508
Regular Section
Integration of Spatial Cue-Based Noise Reduction and Speech Model-Based Source Restoration for Real Time Speech Enhancement
Tomoko KAWASEKenta NIWAMasakiyo FUJIMOTOKazunori KOBAYASHIShoko ARAKITomohiro NAKATANI
Author information
JOURNAL RESTRICTED ACCESS

2017 Volume E100.A Issue 5 Pages 1127-1136

Details
Abstract

We propose a microphone array speech enhancement method that integrates spatial-cue-based source power spectral density (PSD) estimation and statistical speech model-based PSD estimation. The goal of this research was to clearly pick up target speech even in noisy environments such as crowded places, factories, and cars running at high speed. Beamforming with post-Wiener filtering is commonly used in many conventional studies on microphone-array noise reduction. For calculating a Wiener filter, speech/noise PSDs are essential, and they are estimated using spatial cues obtained from microphone observations. Assuming that the sound sources are sparse in the temporal-spatial domain, speech/noise PSDs may be estimated accurately. However, PSD estimation errors increase under circumstances beyond this assumption. In this study, we integrated speech models and PSD-estimation-in-beamspace method to correct speech/noise PSD estimation errors. The roughly estimated noise PSD was obtained frame-by-frame by analyzing spatial cues from array observations. By combining noise PSD with the statistical model of clean-speech, the relationships between the PSD of the observed signal and that of the target speech, hereafter called the observation model, could be described without pre-training. By exploiting Bayes' theorem, a Wiener filter is statistically generated from observation models. Experiments conducted to evaluate the proposed method showed that the signal-to-noise ratio and naturalness of the output speech signal were significantly better than that with conventional methods.

Content from these authors
© 2017 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top