NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Overview

Goal of neural machine translation is to have a singular neural network that can be tuned to maximize translation performance

The models that have been proposed to perform the action of machine translation are under the umbrella of the encoder-decoder model family

What do the encoder-decoder family have in common: they encode the input into a fixed-length vector which the decoder uses to generate a translation

Goal of the paper is to prove that the use of a fixed length vector hinders the performance of these encoder-decoder models and proposes a new architecture which allows the model to softsearch???? parts of a source sentence that are relevant without creating hard embeddings

The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence.

Issues with this approach: the NN needs to be able to compress the entire source input into a single vector. The bottleneck then becomes apparent when dealing with large sentences. The model is only as good as the longest sentence that its trained on.

The fix to the bottleneck: Learning to align and translate jointly: Translate and figure out where in a sentence a word is likely to be in at the same time.

For each generated translation for a word in a input sentence:

search for a set of positions in source where the most relevant info is concentrated??

Uses context vectors associated with these source positions to predict target word

So with this improvement the paper says that the performance of this model will not degrade when the input length increases

Paper states that the most important difference is that it does not use a singular vector.

Background for base of neural machine translation:

RNN Encoder-Decoder Model:

Once the source input has been turned into a vector the decoder predicts the next word based on a context vector and the previously predicted words.

Decoder job: Define the probability of the next word of the translation of a source sentence based on joint probability of all the previous words in the sentence.

The context vector aka hidden state that is being passed to the decoder at each step in the translation process is the bottleneck.

Learning to Align and Translate:

Biderectional RNN as an encoder

Decoder that emulates searching while decoding

The biderectional RNn which acts as the encoder

Forward RNN calculates the forward hidden states going from start to end

Backwards RNN does the opposite to calculate the backward hidden states

These are concatenated and used by the decoder to generate the output

Experiment in the paper:

trained each model twice—> with length=30 and length = 50

Major takeaway: The new proposed model that uses alignment scores based on the computations of the RNN allows for the most relevant parts of an input to be focused on, it also improves performance for larger translation jobs, maintaining the message and meaning of input sentences much better.

Proj Idea: Train model on a bunch of album covers scraped from spotify and then use it to generate captions for pictures