Professor Steve Young's pioneering speech technology work recognised.
For pioneering contributions to the theory and practice of automatic speech recognition and statistical spoken dialogue systems.
The award citation
Professor Steve Young is the 2015 recipient of the IEEE James L. Flanagan Speech and Audio Processing Award.
The annual prize is given to an individual or teams of up to three for “an outstanding contribution to the advancement of speech and/or audio signal processing”
Professor Young is the Senior Pro-Vice-Chancellor, responsible for Planning and Resources, and Professor of Information Engineering in the Information Engineering Division.
The award citation reads: “For pioneering contributions to the theory and practice of automatic speech recognition and statistical spoken dialogue systems.”
Professor Young works in speech technology, focussing in particular on developing systems which allow a human to interact with a machine using voice.
This involves machines like mobile phones recognising the user’s words, understanding what the words mean, deciding what to do and how to respond and then converting the response in textual form back into speech.
One speech recognition “toolkit” called HTK developed by Professor Young more than 20 years ago became a global standard for benchmarking systems, and the basis of many commercial systems and is still widely used.
David Stevenson, assistant editor at National Health Executive magazine, interviewed Professor Young. The following extracts are taken from this interview article which originally appeared in the National Health Executive September/October 2014 magazine and on the website.
Big data helping speech recognition become mainstream
Steve Young, Professor of Information Engineering at the University of Cambridge, and a global expert in speech recognition technologies, gives his thoughts on the advances and challenges facing this ‘growing’ research area. David Stevenson reports.
He told NHE that research in this area made steady but not spectacular progress from the mid-1980s to the mid-2000s. “But over the last five to 10 years we’ve seen really quite significant acceleration in progress,” he said. “And that is why we are now seeing speech recognition coming into the mainstream with services like Apple Siri and Google Now, and the new smart watches that do speech recognition.”
Prof Young added that modern systems are built on the notion of building statistical models that represent the data.
“So the way you build a speech recogniser, essentially, is that you get some data, which is people speaking, you transcribe the data and then you try to model the data and find a way to automatically generate the transcriptions yourself – and then you have a speech recogniser,” he said. “The key to all of that is some quite sophisticated statistical modelling algorithms and the availability of the data.”
Big data
The expert told us that it is the nature of data, and its wide availability nowadays, that has changed the speech recognition landscape. “When you speak into your phone, the signal is being routed to a server farm somewhere in North Carolina if you’re Apple or South Carolina if you’re Google, and it is being processed there and the result is being fed-back to your phone,” said Prof Young.
This allows two things to happen. Firstly, it unleashes the possibility of using some very powerful computing to recognise people’s voices. Then secondly, and more importantly, the companies are capturing the data.
“When Siri was first launched, for example, it wasn’t that great,” said Prof Young, “but as more people started using it the company was capturing huge amounts of data. And then by using and collecting the data and upgrading the models, people found the recognition improved so they used the system more, so they gave more data. That has happened over a wide range of fields, and it is the ‘big data’ paradigm that we are hearing a lot about.”
Dictation and voice recognition in healthcare
Dictation in the medical area has been one of the mainstays, certainly for commercial dictation applications, he added.
He noted that doctors have persevered, and because they have persevered, “and in some cases had to – particularly in the US where everything has to be recorded – the dictation systems have made progress and been widely used”.
He stated that advances have also been made in transcription and more general conversation systems (where a computer can listen to two humans having a conversation or it can be one of the participants).
In fact, Prof Young feels that the challenges in developing these technologies are now moving more from transcribing the audio into words, which has been the focus for the last 30 years, into actually understanding what the words mean and the semantics behind them, especially with regards to conversational systems.
“I would expect that what has been started by Siri and Google Now is going to expand and we’re going to see a whole plethora of agents being available for having conversations about booking hotels and restaurants,” he said, “but particularly in healthcare, as this is the field which is ripe for providing this type of service.
“I think we’ll start to see these coming in within the next few years in focused application areas and then becoming more and more general and widely acceptable over the next decade.”
Conversational systems
Currently, Prof Young is working on developing conversational systems – not specifically in healthcare yet – to access tourist information.
“For example, finding a restaurant or hotel,” he told us, “and we’ve been working with some automobile companies to develop in-car voice recognition.”
He outlined that with many people getting used to satnavs, in the future people may be able to talk to their cars and say: ‘I’d like to stop off and have a meal, what is there in the local area?’ The car would then be able to search and book into wherever it finds appropriate, after a conversation with the driver about the available options.
“We’re working on that now and many of the algorithms we’re starting to develop are not rule based,” said Prof Young.
“Traditionally these types of things have been developed by a programmer sitting down and writing rules, such as ‘what would the user ask?’ And ‘how should the system respond?’ But this doesn’t scale and the system you deploy doesn’t get any better. What we want to do is deploy systems that learn from their own users and get better and more competent automatically, and that really is the focus of my work now.”
He added that conversational systems are particularly interesting, and believes that the use of automation if it is done “sensibly and effectively”, could make a big impact in the future care of the elderly and in managing an ageing population.
Despite dedicating 35 years to research in the field of speech recognition, and with his research helping to set global standards for benchmarking systems and being the basis of many commercial systems, Prof Young remains modest about his award, joking that organisations sometimes feel they have to give them out “just because someone has been around long enough”.
Nevertheless, he said he is “humbled” to become the 2015 recipient of the IEEE James L Flanagan Speech and Audio Processing Award.