Automated Audio Clustering
Navy SBIR 2011.2 - Topic N112-163 ONR - Mrs. Tracy Frost - [email protected] Opens: May 26, 2011 - Closes: June 29, 2011 N112-163 TITLE: Automated Audio Clustering TECHNOLOGY AREAS: Information Systems, Sensors ACQUISITION PROGRAM: PM Intel OBJECTIVE: Provide a system that can autonomously cluster a large database of audio files by speaker. DESCRIPTION: Advances in open availability and collection technology for audio data is contributing to the overall large data problem for the DoD. As the difference between collection capacity and analytic throughput grows, so does the need for automated analysis. An important enabler of this is an ability to use sound characteristics to cluster audio files by unique speaker. There is both military and commercial value to being able to rapidly search for and retrieve all additional comments made by a newly discovered specific speaker of interest from a large library of previously untagged audio files. Related technology exists such as voice print matching which is used as a biometric to establish identity using text dependent matchers. Reliable speaker ID algorithms that are text independent have limitations in that they generally rely on the availability of training data collected under controlled conditions. The goal of the topic is to support research that can cluster a large data store of audio files by unique speaker using sound characteristics without the availability of training data. The topic will require a performer to demonstrate that algorithms such as vector quantization, mixture models, self organizing maps or artificial intelligence can be used to cluster very noisy frequency based data can be successfully employed. It is possible that sound will first have to be automatically translated to phonemes or words before clustering algorithms can be applied. A successful performer will develop a system that can cluster files with a useful true and false positive rate. Both text dependent and independent techniques can be considered but if text dependent algorithms are used the system must utilize one standard set of phonemes/words that can be identified automatically with high confidence. The objective system should assign a unique ID to each cluster. When new audio data is discovered by the system, those new audio files should be automatically be assigned to an existing cluster or designated a new assignment. Periodically the system should re-run clustering across the entire data set. Challenges for this topic include 1) optimization of extractable voice features for downstream clustering 2) implementation of the optimized text independent audio feature extraction algorithms in both batch and streaming data architectures. 3) development of a reliable word list that can be easily and reliably recognized that are also useful for extracting voiceprints 4) Demonstration the viability of vector quantization, self organizing maps, mixture models or a related technique to perform accurate audio clustering using either or both text independent and dependent features without training data 5) Extraction of features from a cluster of audio files that can be used as training data for subsequent matches. Advances in voice print matching and speaker ID technology can be leveraged along with recent work in clustering multi-dimensional data to provide a capability responsive to the topic. PHASE I: Complete a feasibility study, research plan and component algorithm testing in order to mature an approach for the development an audio file clustering system that can be run in batch mode and kept current in streaming mode. Identify the critical technology issues that must be overcome to achieve success. Technical work should focus on the reduction of key risk areas. For a constrained set of audio files, demonstrate that phase 1 risk reduction work has shown that a full implementation of the approach is technically tractable. Prepare a revised research plan for Phase 2 that addresses critical issues. PHASE II: Produce a prototype audio file clustering service that can produce accurate clusters with defining metadata. The prototype should enable a demonstration of the capability to be conducted using relevant data sources, some of which may be classified. The prototype should be capable of operating in both batch and real time streaming mode. The prototype should be relevant to both DoD and commercial use cases. PHASE III: Produce a system capable of deployment in an operational setting of interest against relevant data loading. Test the system in a relevant setting in a stand-alone mode and as a component of larger system (programs of record). The work should focus on tailoring the developed capability in order to achieve a transition to a program of record in one or more of the military Services. The system should provide metrics for performance assessment. REFERENCES: 2. S. Arya, D. M. Mount. "Algorithms for Fast Vector Quantization". Proc. Data Compression Conference, J. A. Storer and M. Cohn, eds., Snowbird, Utah, 1993, IEEE Computer Society Press, 381-390 3. D.A. Reynolds, T.F. Quatieri, and R.B. Dunn. "Speaker Verification Using Adapted Gaussian Mixture Models��. Digital Signal Processing, 10, pp. 19-41 (2000). KEYWORDS: Clustering, Audio, Speaker ID, Voice Prints, Vector Quantization, Self Organizing Maps, Mixture Models
|