IBM this week said its speech recognition system set an industry record of 5.5% word error rate, a percentage that lets a computer understand human conversation almost as well as the average person does.
According to IBM human parity was considered a 5.9% word error rate but IBM who partnered with Appen, a speech and technology service provider, reassessed the industry benchmark and determined that human parity is lower than what anyone has yet achieved: 5.1%.
“Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9% as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1%,” wrote George Saon principal research scientist with IBM in a blog post on the subject.
That reassessment however might ruffle some feathers as in October Microsoft Artificial Intelligence and Research group said its speech recognition system had attained “human parity” and made fewer errors than a human professional transcriptionist.
“The error rate of professional transcriptionists is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state-of-the-art, and edges past the human benchmark. This marks the first time that human parity has been reported for conversational speech,” the researchers wrote in their paper. Switchboard is a standard set of conversational speech and text used in speech recognition tests.
The 5.9% error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task, Microsoft wrote on its web site.
IBM’s Saon wrote: “We also realized finding a standard measurement for human parity across the industry is more complex than it seems. Beyond SWITCHBOARD, another industry corpus, known as “CallHome,” offers a different set of linguistic data that can be tested, which is created from more colloquial conversations between family members on topics that are not pre-fixed. Conversations from CallHome data are more challenging for machines to transcribe than those from SWITCHBOARD, making breakthroughs harder to achieve. (On this corpus we achieved a 10.3 percent word error rate – another industry record – but again, with Appen’s help, measured human performance in the same situation to be 6.8 percent).”
Also from the IBM blog, Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, commented on the challenge of speech recognition:
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” she shared.
Speech recognition breakthroughs come after decades of research in speech recognition, beginning in the early 1970s with DARPA, Microsoft wrote. Over time, most major technology companies and many research organizations have developed speech recognition technologies including BBN, Google, Microsoft, Hewlett Packard and IBM.