A Novel Unsupervised Audio Clustering Approach in Noisy Environments

A Novel Unsupervised Audio Clustering Approach in Noisy Environments
Navy SBIR FY2011.2

Sol No.:	Navy SBIR FY2011.2
Topic No.:	N112-163
Topic Title:	A Novel Unsupervised Audio Clustering Approach in Noisy Environments
Proposal No.:	N112-163-0451
Firm:	SIGNAL PROCESSING, INC. 13619 Valley Oak Circle ROCKVILLE, Maryland 20850-3563
Contact:	Chiman Kwan
Phone:	(240) 505-2641
Web Site:	http://www.signalpro.net
Abstract:	Detection of conversations in a noisy environment is challenging. We propose the following novel framework for audio clustering. First, we propose to apply computational auditory scene analysis (CASA) as a front-end to separate speech signals from non-speech background noise. Inspired by auditory perception, CASA typically segregates speech from noise by producing a binary time-frequency mask. The binary masks are then used to reconstruct clean speeches. Second, since the reconstructed clean speeches may contain more than one speaker's voice, we propose an unsupervised audio clustering approach to perform speech separation. Unreliable time-frequency (T-F) units in simultaneous streams are reconstructed using a speech prior, and cepstral features are subsequently derived for clustering. We search for two clusters exhibiting the biggest speaker difference, i.e. the trace of the between- and within-cluster scatter matrix ratio. To speed up the search process, a genetic algorithm (GA) is employed. Third, after we extract the audio streams of each speaker, we go one more step. We propose to apply the latest speaker identification algorithm developed by our team for each separated voice stream. The reason to apply robust algorithms is that there may still be residual noise in the separated voice streams.
Benefits:	The proposed audio processing system has great potential in separating mixed voices and achieving high performance speaker identification and speech recognition. Besides the above application, our system can be applied to speech enhancement in communication centers, conference rooms, aircraft cockpit, cars, buses, etc. It can be used for security monitoring in airport terminals, bus and train stations. The system can pick up multiple conversations from different people and different angles. It can also be used as a front-end processor to all automatic speech recognition system. We expect that this new system will significantly increase speech quality in noisy and multi-speaker environments. The combined market for the above mentioned applications in this paragraph will easily exceed 10 million dollars. This is based on unit cost of $200 and a market size of 50,000 units.

Return