Yesterday, in a blog post by Microsoft chief research officer Rick Rashid, the company demoed spoken English translated into spoken Mandarin Chinese in real-time.
The Chinese text-to-speech translation software utilized elements of the user’s own voice. Cheers are heard from the crowd of 2000 mostly Chinese students.
The first step takes spoken words and finds the Chinese equivalents. The second step reorders the words to be appropriate for Chinese, and does text to speech.
Microsoft’s technology is based on a new translation technique called Deep Neural Networks (DNN) which uses human brain modeling to develop better speech recognizers. That has helped Microsoft reduce translation error by over 30 percent, compared to the Markov method, according to Rick Rashid.
Until recently, even the best speech systems still had word error rates of 20-25% on arbitrary speech. Whereas older models make errors once in every four or five words, DNN’s error rate is one word out of seven or eight.