Microsoft Creates History; Reports Speech Recognition System With Human-Level Accuracy

Microsoft, spearheaded by Satya Nadella, the Indian-born American business executive, just created history by announcing a breakthrough in the field of speech recognition systems. The team of researchers and engineers at Microsoft Artificial Intelligence and Research have reported that they have successfully created a technology that recognized the words in a conversation as well as a person does.

The team behind the breakthrough image source: Microsoft)
The team behind the breakthrough image source: Microsoft)

The team of researchers and engineers, in a paper published on Monday, reported a speech recognition system that makes the “same or fewer errors” than professional transcriptionists. Well, how exactly is the accuracy of these speech recognition systems measured? The accuracy is generally measured using a common metric called Word Error Rate (WER). It’s quite difficult to measure the performance of a speech recognition system because the recognized word sequence can have the different lengths from the reference word sequence (reference word sequence is the correct one).

The breakthrough lies in the fact that the team at Microsoft achieved Word Error Rate (WER) of 5.9 percent, down from 6.3 percent the same team reported last month. Remember, when it comes to WER, the lower the percentage the better, because we are talking about “error” rate. An improvement of 0.4% in just one month is indeed significant. And the significance of the technology does not lie in the 0.4% improvement the team achieved in the last one week – but instead in the overall WER of 5.9% – which is about equal to that of people who were asked to transcribe the same conversation. “We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”

The WER of 5.9% is also the “lowest” ever recorded against the industry standard Switchboard speech recognition task. This means that – for the first time ever – a computer can recognize the words in a conversation as well as person would. We must repeat that again – “as well as person would”. Getting a machine to recognize speech as well as person would is indeed a significant step in the field of Artificial Intelligence. Speech recognition has decades of research history associated with it, beginning in the 1970s with DARPA. Later, many major tech companies and research organizations made speech recognition their single most important goal and began their pursuit of a technology so powerful that it can match human speech recognition abilities.

Deep Neural Networks Behind The Breakthrough

Deep neural networks use large amounts of data to teach computer systems to recognize patterns from inputs such as images or sounds. To reach WER of 5.9%, the research team at Microsoft used a homegrown system for deep learning called the Computational Network Toolkit (CNTK). CNTK’s ability to quickly process deep learning algorithms across multiple computers running GPU improved the speed at which they were able to deliver results, eventually reaching human parity. The researchers used neural language models in which words are represented as continuous vectors in space and words like “fast” and quick” are close together.

Implications of The Breakthrough

This breakthrough will have broad implications for consumer and business products – including but not limited to consumer entertainment devices, accessibility tools and personal digital assistants like Cortana – that can be significantly augmented by speech recognition, claims Microsoft blog. “Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group. “This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum added. However, the breakthrough doesn’t mean that the computer recognized every word perfectly. It just means that the error rate is the same as you would expect from a person hearing the same conversation.

Shum noted that we are now moving away from the world where people must understand computers to the world in which computers must understand us. However, he added that artificial intelligence is still on the distant horizon. AI will eventually reach levels humanity has never imagined. And when it does, we better be prepared!