CMU 11-731(MT&Seq2Seq) Neural MT(2): Attentional Neural MT

Problems of Representation in Encoder-Decoders

  • problems
    • long-distance dependencies
    • store information of any arbitrary long sentences in a hidden vector of fixed size

Attention

  • step:
      1. Hidden Matrix
        $$ \overrightarrow h_j^{(f)}=RNN(embed(f_j),\;\overrightarrow h_{j-1}^{(f)}) $$
        $$ \overleftarrow h_j^{(f)}=RNN(embed(f_j),\;\overleftarrow h_{j+1}^{(f)}) $$
        $$ h_j^{(f)}\;=\lbrack\overleftarrow h_j^{(f)},\overrightarrow h_j^{(f)}\rbrack $$
        $$ H^{(f)}=concatcol(h_1^{(f)},\;…,\;h_{\left|F\right|}^{(f)}) $$
      1. attention vector(sum up to 1)
        $$ c_t=H^{(f)}\alpha_t $$
        • intuition: how much we are “focusing” on a particular source word at a particular time step
      1. Calculating Attention Scores
        $$ h_t^{(e)}=enc(\lbrack embed(e^{t-1});c_{t-1}\rbrack,\;h_{t-1}) $$
        $$ a_{t,j}\;=\;attn\;score(h_j^{(f)},\;h_t^{(e)}) $$
        $$ \alpha_t\;=\;softmax(a_t) $$
        $$ p_t^{(e)}\;=\;softmax(W_{hs}\lbrack h_t^{(e)};c_t\rbrack\;+b_s) $$

Ways of Calculating Attention Scores

  • Dot product

    • advantage
      • no additional parameters
      • efficient by concatenating (GPU)
    • disadvantage
      • force the input and output encodings to be in the same space
      • highly dependent on the size of the hidden vector, and the distribution will be peakier
  • Scaled dot product

    • advantage
      • reduce peakier effect from Dot product, making training more stable at all hidden vector sizes.
  • Bilinear functions

    • advantage
      • linear transform makes the input and output space could be different
    • disadvantage
      • introduce quite a few parameters
        $$ \left|h^{(f)}\right|\times\left|h^{(e)}\right| $$
  • Multi-layer perceptrons

    • advantage
      • fewer parameters than the bilinear method, and generally provides good results.
  • Others

    • RNN
    • Tree-structured networks
    • CNN
    • Structured models

Copying and Unknown Word Replacement

  • input mapping to output, replace “unk” words with highest attention in source sentence
  • use alignment models to obtain a translation dictionary, according to P(e|f), where f is word in the source sentences

Intuitive Priors on Attention

  • Purpose: Improve accuracy of estimating the attention using prior probabilities

  • Position Bias

    • intuition: have similar word order, alignments should fall along the diagonal
  • Markov Condition

    • intuition: if two words in the target are contiguous, the aligned words in the source will also be contiguous
    • basic thinking: discourages large jumps and encourages local steps in attention
    • instance: local attention model
  • Fertility

    • specfic mapping(some words in language will be tranlated into certain words in another)
    • intuition: penalize too few/too much attention
  • Bilingual Symmetry

    • alignment symmetry
    • effective amongest above

Futher Reading

  • Hard Attention

    • focus or not
  • Supervised Training of Attention

    • penalize when align incorrectly
  • Other Ways of Memorizing Input

    • memory networks

More paper

  • structured attention networks
Share