https://www.youtube.com/watch?v=3-JKvP7eBXc
(TRANSCRIPT) ‘Deep Learning for Language Understanding’
Quoc Le, Research Scientist, Google
Manually transcripting talks and lectures can really focus the mind and drive home the difficulties faced with natural language processing systems with understanding words spoken in different accents, punctuation, stresses and corrections which are best omitted in target text. Here is a transcript of Quoc Le’s talk at the Deep Learning Summit 2015, SF.
Le gives an insight into some of the work that the Google Brain team is doing with neural networks for language understanding and the application of the technologies that they are building.
“I joined the Google Brain project in 2011 and our goal was to take machine perception as far as possible, so we wanted to use Google infrastructure to take some big strides in machine perception and if you look closely into some of the algorithms that you use with image recognition, you take an image and you crop it into 200 x 200 pixels and then you try to map it into a specific category, for example, ‘cat’ or for a special condition you take some way form and then you crop it and then you try to classify it into whether the person is speaking about the word ‘cat’ or not, so if you look inside the box you see it’s trying to map some kind of fixed LAN input, in the case of image or in the case of way form into some scale of value.
There is one area that I’m very excited about but we didn’t have a lot of progress on is in the area of language understanding. In language understanding, most often what you see is that you require the mapping from a sequence, like a sequence of words maybe, to another sequence of words. Let’s take machine machine translation as an example, you want to map a sequence of words like ‘I love music’ to another sequence of words in French or in questions and answers you also want to map another sequence of words to another sequence of words which is the answer.
It’s the kind of problem that we would like to solve but then we don’t have the technology to solve it yet so we would like to develop technology to solve that. If you look closely at machine translation as it’s implemented today, the way it works is that you try to write a program that translates one word at a time from the source language to the target language. Then because there is some kind of grammar structure in the target language you might write another computer program to reorder the target sentence.
Then you have to design a bunch of rules to cope with corner cases, for example the phrase ‘New York’ is not ‘new’ and ‘York’ it’s actually ‘New York’.. the city so the translation is different. Then the machine translation becomes really complicated because there are a lot of these complicated rules that you need to deal with in natural languages.
So we would like to take a very different approach to this, so the way that we deal with this problem is that we use a recurring neural network to encode the source sentence and then we use another recurring network to decode the target sentence. The take home picture that you would like to see is something like this (go to 4:00). You need to have a recurring network, for example if you want to map the sequence ABC to the sequence WXYZ, then what you have to do is have a recurring neural network in ABC and whenever it hits the token ‘end of sequence’ then it will start producing WXYZ in order.
In that context the first part of the network is called the encoder network, the second part is the decoder network. During training time we can just use back propagation to learn all the connectivity inside the network. You present the source sentence and the target sentence and then try to learn as best you can to map from ABC to WXYZ but at test time, you don’t have the ground truth so what you do is whenever you see the end sequence token, you start making a prediction and then you take that prediction and put it back inside the loop, then predict the next word and then you can keep mapping the output into the input of a system until the end until you hit the end of sequence symbol, then you stop producing.
So the program has to learn how to haunt itself, so it will understand the structure of natural language until at some point it will stop producing.
You can be bit more clever by using the idea of bim search so whenever you hit end of sequence in the input side, what you can do is maybe start producing five candidate, you take these five candidate and put them back in one candidate at a time in the input, and then for every candidate you’re going to produce more output so it’s going to expand your bim but you can trim your output by just keeping the sequence that has the highest probability and it turns out using such a simple idea we did very well with some of the tasks we experimented with.
For example, in a machine translation experiment, we benchmarked on a data set called the WMT data set, it’s very small compared to the Google dataset. The state of the art approach which I explained earlier using a lot of rules took many years to develop, I would say about twenty years. They use a metric called BLEU score which tries to compare the truth with the machine output and compare how many pairs of words that you match. The state of the art approach got 37% which is a very remarkable number but with our method, three people in one year we got 37.5% which is better than state of the art.
There are two things that I want to emphasise behind this result, number one is that it took one year and three people to develop a very general technology that can be used in many other applications in language understanding but at the sam time when it comes to a specific application like machine translation it also works better than a programme that took many years to develop and I think it is a very exciting new way to represent variable sized input and generate variable sized output.
If you use this technology and then you take the hidden state of the neural network for image classification, you can use it for image captioning, so a couple of months back there was a New York Times article on image captioning, taking an image and generating a description on that image. The technology behind that is very simple, you train a convolutional net up to the top and you take that hidden state and start generating. The structure to do generation is based on the technology that I explained earlier and because this technology is so general you can use it for other applications like image captioning.
One thing a lot of people are very concerned about with this method is that it tries to read a variable length structure into one fixed length vector. We’ll use the red vector that I am showing you here, and you use that vector to slowly decode the output. One concern is what if you sentence is short and some other sentence is extremely long, it turns out that we don’t have that problem at all. When we have one very long sentence it will still happily decode the language.
The good thing is that we can also visualise the structure that we see in that red vector that I am showing, which captures the semantics of the whole sentence and if you do that, so here’s one example of visualisation that we have – so we give the following sentence ‘I was given a card by her in the garden’ and then a different sentence ‘in the garden she gave me a card’ or a different way is ‘she was given a card by me in the garden’, to change the subject and the object. The chronological way of machine understanding makes all the representations all look the same. WIth our method there is a clear distinction between ‘I was given a card by her in the garden’ and ‘she was given a card by me in the garden’ so you see different paraphrase structure in this picture.
You can check some more of these results in the paper that we reported last winter, Sequence to Sequence Learning for Neural Networks (PDF) and you’ll see some more visualisation and results in that paper.