- reference:
Problems of Representation in Encoder-Decoders
- problems
- long-distance dependencies
- store information of any arbitrary long sentences in a hidden vector of fixed size
Attention
- step:
- Hidden Matrix
$$ \overrightarrow h_j^{(f)}=RNN(embed(f_j),\;\overrightarrow h_{j-1}^{(f)}) $$
$$ \overleftarrow h_j^{(f)}=RNN(embed(f_j),\;\overleftarrow h_{j+1}^{(f)}) $$
$$ h_j^{(f)}\;=\lbrack\overleftarrow h_j^{(f)},\overrightarrow h_j^{(f)}\rbrack $$
$$ H^{(f)}=concatcol(h_1^{(f)},\;…,\;h_{\left|F\right|}^{(f)}) $$
- Hidden Matrix
- attention vector(sum up to 1)
$$ c_t=H^{(f)}\alpha_t $$- intuition: how much we are “focusing” on a particular source word at a particular time step
- attention vector(sum up to 1)
- Calculating Attention Scores
$$ h_t^{(e)}=enc(\lbrack embed(e^{t-1});c_{t-1}\rbrack,\;h_{t-1}) $$
$$ a_{t,j}\;=\;attn\;score(h_j^{(f)},\;h_t^{(e)}) $$
$$ \alpha_t\;=\;softmax(a_t) $$
$$ p_t^{(e)}\;=\;softmax(W_{hs}\lbrack h_t^{(e)};c_t\rbrack\;+b_s) $$
- Calculating Attention Scores
Ways of Calculating Attention Scores
Dot product
- advantage
- no additional parameters
- efficient by concatenating (GPU)
- disadvantage
- force the input and output encodings to be in the same space
- highly dependent on the size of the hidden vector, and the distribution will be peakier
- advantage
Scaled dot product
- advantage
- reduce peakier effect from Dot product, making training more stable at all hidden vector sizes.
- advantage
Bilinear functions
- advantage
- linear transform makes the input and output space could be different
- disadvantage
- introduce quite a few parameters
$$ \left|h^{(f)}\right|\times\left|h^{(e)}\right| $$
- introduce quite a few parameters
- advantage
Multi-layer perceptrons
- advantage
- fewer parameters than the bilinear method, and generally provides good results.
- advantage
Others
- RNN
- Tree-structured networks
- CNN
- Structured models
Copying and Unknown Word Replacement
- input mapping to output, replace “unk” words with highest attention in source sentence
- use alignment models to obtain a translation dictionary, according to P(e|f), where f is word in the source sentences
Intuitive Priors on Attention
Purpose: Improve accuracy of estimating the attention using prior probabilities
Position Bias
- intuition: have similar word order, alignments should fall along the diagonal
Markov Condition
- intuition: if two words in the target are contiguous, the aligned words in the source will also be contiguous
- basic thinking: discourages large jumps and encourages local steps in attention
- instance: local attention model
Fertility
- specfic mapping(some words in language will be tranlated into certain words in another)
- intuition: penalize too few/too much attention
Bilingual Symmetry
- alignment symmetry
- effective amongest above
Futher Reading
Hard Attention
- focus or not
Supervised Training of Attention
- penalize when align incorrectly
Other Ways of Memorizing Input
- memory networks
More paper
- structured attention networks