APE Segmentation for Variable Speed Speech Synthesis
Navy SBIR FY2008.2


Sol No.: Navy SBIR FY2008.2
Topic No.: N08-149
Topic Title: APE Segmentation for Variable Speed Speech Synthesis
Proposal No.: N082-149-0097
Firm: Perceptral LLC
1436 Monroe Ave
Racine, Wisconsin 53405-3248
Contact: Jonathan Pearl
Phone: (805) 259-9665
Web Site: www.perceptral.com
Abstract: We propose to apply proprietary technology to meet the needs of variable speed speech synthesis. Our approach involves a new method of sound segmentation that provides not only greater control of speech timing, but extensive control and variability of all manner of prosodic features, including intonation and accent, without undue degradation to sound quality and intelligibility. Current systems fail to provide these adequately, because of the methods by which units are selected, stored, and recalled for concatenation; the means by which these units are manipulated; and the absence of requisite markup standards for control of prosody in synthesis. The final deliverable of this project will be a stand-alone software package capable of providing the end-user variable control of speech rate in synthesis, as well as integration of this functionality as a module in a new generation of off-the-shelf speech synthesis engines.
Benefits: The applications that result from Perceptral's technological advancements will provide the end-user the ability to vary the speed, timbre, and accent of synthesized voice outputs. They will provide a wealth of flexibility to the producers of educational and training platforms, permitting a virtually infinite range of utterances to be created from a single database, as well as blended utterances to be produced from multiple databases. They will allow a user to alter the structural components of linguistic units, facilitating changes in rate of speech, accent, timbre, and intonation. These capabilities will lead to a new generation of speech synthesis engines that move beyond the text, to incorporate naturalistic speech prosody in ways unimaginable today. The implications for team training, education (including second language learning and accent reduction), informational services (email readers, even leading to developments in automated dubbing in multiple dialects or languages), and accessibility (screen readers and customized prosthetic voices) are immense. In addition to training, educational, and accessibility applications, one further yet untapped arena is animated feature films, and the yet unsaturated area of voice and sound design in video games. Our technology will make it possible to produce customized synthetic voices that shadow the capabilities of graphic animation, greatly extending the range of voices and associated sounds available for animated feature films, shorts, and video games. It will become possible not only to morph a single voice or blend two human voices, but also to produce intermediate, perceptually coherent, blends based on a human voice and animal calls or environmental sounds.

Return