CMU 11-731(MT&Seq2Seq) Neural MT(2): Attentional Neural MT

problems
- long-distance dependencies
- store information of any arbitrary long sentences in a hidden vector of fixed size

Dot product
- advantage
  - no additional parameters
  - efficient by concatenating (GPU)
- disadvantage
  - force the input and output encodings to be in the same space
  - highly dependent on the size of the hidden vector, and the distribution will be peakier
Scaled dot product
- advantage
  - reduce peakier effect from Dot product, making training more stable at all hidden vector sizes.
Bilinear functions
- advantage
  - linear transform makes the input and output space could be different
- disadvantage
  - introduce quite a few parameters
    $$ \left|h^{(f)}\right|\times\left|h^{(e)}\right| $$
Multi-layer perceptrons
- advantage
  - fewer parameters than the bilinear method, and generally provides good results.
Others
- RNN
- Tree-structured networks
- CNN
- Structured models

input mapping to output, replace “unk” words with highest attention in source sentence
use alignment models to obtain a translation dictionary, according to P(e|f), where f is word in the source sentences

Purpose: Improve accuracy of estimating the attention using prior probabilities
Position Bias
- intuition: have similar word order, alignments should fall along the diagonal
Markov Condition
- intuition: if two words in the target are contiguous, the aligned words in the source will also be contiguous
- basic thinking: discourages large jumps and encourages local steps in attention
- instance: local attention model
Fertility
- specfic mapping(some words in language will be tranlated into certain words in another)
- intuition: penalize too few/too much attention
Bilingual Symmetry
- alignment symmetry
- effective amongest above