It is becoming more common to interact with our technology using voice interfaces, but the way we speak to computers – and the way they respond – can be a laboured process.
It also tries to offer as much information as possible in as few turns as possible because users generally don’t like to hang around and talk to dialogue systems forever.
Milica Gašić
Dr Milica Gašić of the Machine Intelligence Group is looking for a way for this interaction to be more natural and to accommodate different types of speech. She spoke with Graihagh Jackson of the University of Cambridge’s Naked Scientists programme about her work.
Milica: You’ve probably have all heard of Siri on iPhone or other personal systems but these systems can be more widely used in situations like banking, or they can be used for providing healthcare information for elderly people, for instance. These systems normally have three components. The first component which is called speech understanding is trying to extract the meaning from the speech. The second component which is called dialog management, tries to decide what is the best response, or what we call action, to take to say to the user and then the final component generates this response into speech.
Graihagh: None of this is trivial. Putting speech into text, understanding that and then deciding what the best action is and turning text back into speech: it’s quite complicated, especially that middle step of understanding and actioning.
Currently, systems like Siri and Google all operate on a series of rules. Someone has literally sat down and thought about all the possible things you could ever want to ask your smartphone, written it into a code and: voila! Sounds painstakingly protracted …
Milica: Now this is obviously suboptimal because a human can’t think of all possible situations and it’s very expensive to develop such systems so, what we are doing is trying to use machine learning to tackle this problem and to make the systems better.
Graihagh: When you say machine learning, what do you mean? Are you literally sitting down a computer and saying this is X and this is Y?
Milica: Not really. The idea of machine learning is that the machine could analyse data and then try to build a model and then, based on that model, make predictions. So the prediction could be what the user wants or the prediction could be what the system should say back to the user. A particular machine learning method which is very useful for building dialogue systems is based on reinforcement learning and reinforcement learning is all about trial and error. In machine learning we normally have two ways of learning. One is supervised learning and you can think of that as you have a teacher who is teaching you. So the teacher is showing you how to do something and then what you try to do is try to imitate your teacher and do it as good as the teacher is doing it.
Reinforcement learning is very different. In reinforcement learning, you explore the different possibilities. You don’t have the teacher but you have say a parent who will give you occasional rewards based on what you did and if you did something well. In similar ways like children are trying to get as many presents from their parents, the system is also trying to maximise its reward and, indeed, many reinforcement algorithms were inspired from nature and how biological systems learn.
Graihagh: You just made me think of kids wanting treats. I mean is there a chocolate equivalent for a machine?
Milica: Oh absolutely! The reward is numerical and it is an equivalent of chocolate.
And speaking of food, we decided to dial up the dialog system in search of some lunch
CRIS: What kind of food would you like?
Milica: Hi, I’m looking for a Chinese restaurant in the centre please.
CRIS: HK Fusion is a nice place: it serves Chinese food.
Milica: Is it in the centre?
CRIS: HK Fusion is a nice place. Their address is 21 Burner Street city centre.
Milica: Can you tell me their phone number?
CRIS: HK Fusion is a nice place. Their phone number is 01223 355 909.
Milica: Thank you. Goodbye.
CRIS: Thank you. Goodbye.
Graihagh: You can very easily see that to and fro. That conversation; it’s got context of what you’re asking there. So, from that conversation it’s learning every time someone dials and asked for a specific food (whether that’s Chinese or pizza), but what will it be learning from that conversation or how will it be learning from that conversation?
Milica: So, from time to time it would be asking somebody to enter their feedback. At other times, it’s trying to estimate feedback on its own and then what it does is it tries to analyse which actions it took and what feedback it got. One thing it wants to maximise is the chance of success. When it provides all the information that the users asked for it’s counted as a successful dialogue but that is not the only component it is trying to optimise. It also tries to offer as much information as possible in as few turns as possible because users generally don’t like to hang around and talk to dialogue systems forever. So it tries to adjust its actions so that it optimises these two objectives.
Graihagh: So it sort of almost goes away and reflects? Not unlike a human; what was good and what was bad about that conversation.
Milica: Yes exactly, that’s a very good comparison.
Graihagh: So in the future do you envisage this being much more broader than just ordering Chinese food in the city centre?
Milica: Yes. My goal is to model a more richer conversation. In particular, one idea that I have is to build a dialogue system that can be used for the prevention of mental health illnesses and the idea would be to develop a dialogue system that everybody could access on their phone, whenever they like, whenever they have a problem they could get anonymous instant support. So I think that would certainly have a huge impact but also from a scientific point of view, these dialogues would be much richer so it wouldn’t be about ordering Chinese food but rather about trying to model real conversation.
The interview can be heard on The Naked Scientists website.