Encoder-decoder Models
- formula
$$ m_t^{(f)}=M_{\cdot,f_t}^{(f)} $$
$$ h_t^{(f)}=\begin{array}{lc}RNN^{(f)}(m_t^{(f)},h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
$$ m_t^{(e)}=M_{\cdot,e_{t-1}}^{(e)} $$
$$ h_t^{(e)}=\begin{array}{lc}RNN^{(e)}(m_t^{(e)},h_{t-1}^{(e)})&\;if\;t\geq1\end{array}\;else\;h_{\left|F\right|}^f $$
$$ p_t^{(e)}=softmax(W_{hs}h_t^{(e)}+b_s) $$
Generating Output
Random Sampling
- usage:Get a variety of outputs for a particular input(dialogue system)
- Ancestral sampling, sample a distribution from $$ P(e_t\vert\widehat e_1^{t-1}) $$
- Calculate sentence probabilty
$$ P\left(\widehat E\vert F\right)=\prod_t^{\left|\widehat E\right|}P({\widehat e}_t\vert F,\widehat E_1^{t-1}) $$- problem: numberical precision, so just use log and add them together
Greedy 1-best Search
- just like Ancestral Sampling, except sampling.Use :
$$ \widehat{e_t}\;=\;\underset i{argmax}\;P_{t,i}^{(e)} $$ - not guaranteed to find the translation with the highest probability(as Greedy often do this)
- just like Ancestral Sampling, except sampling.Use :
Beam Search
- pruning
- heuristic search
Length normalization
- problem
- tend to prefer shorter sentences
- beam search with a larger beam size has a significant length bias towards short sentence
- solution
- Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
$$ P(|E|\;|\;|F|) $$
$$ \widehat E\;=\underset E{\;argmax}\;\log(P(\left|E\right|\;\vert\;\left|F\right|)\;+\;\log(P(E\;\vert\;F)) $$ - how to get prior?
$$ P(\vert E\vert\;\vert\;\vert F\vert)\;=\;\frac{c(\vert E\vert,\;\vert F\vert)}{c(\left|F\right|)} $$
$$ or $$
$$ \widehat E\;=\underset E{\;argmax}\;\frac{\;\log(P(E\;\vert\;F))}{\left|E\right|}(highest\;average\;\log\;probability\;per\;word\;) $$
- Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
- problem
Bidirectional Encoders
- reverse encoder
- motivation: langauge with similar ordering(English-French)
bi-directional encoder(more robust with typologically distinct languages)
$$ \overset\rightharpoonup h_t^{(f)}=\begin{array}{lc}\overset\rightharpoonup{RNN}^{(f)}(m_t^{(f)},\overrightarrow h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
$$ \overleftarrow h_t^{(f)}=\begin{array}{lc}\overleftarrow{RNN}^{(f)}(m_t^{(f)},\overleftarrow h_{t+1}^{(f)})&\;if\;t\;\leq\left|F\right|\end{array}\;else\;0 $$flexible combination of hidden states vector
$$ h_0^{(e)}=tanh\;(W_1{\overrightarrow h}_{\vert F\vert}+W_2{\overleftarrow h}_1\;+\;b_e) $$
Sentence Embedding Methods
- Auto-encoding
- Semi-supervised Sequence Learning
- re-generate
- Language modeling
- Predicting context
- Skip-Thought Vectors
- surrounding sentence
- fixed length embedding
- prevent overfitting
- pretraining
- differenct software
- fine-tune embedding
- increasing expressivity
- Predicting paraphrases
- Towards universal paraphrastic sentence embeddings
- similar sentence has similar embeddings
- PARANMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
Predicting sentence features
Contextual embedding
Misc
Further Reading
Several studies on natural language and back-propagation
- first proposed the idea of performing translation using neural networks
Learning recursive distributed representations for holistic computation
- further expanded to recurrent networks
Sequence to Sequence Learning with Neural Networks
- further expanded this to recurrent networks
Recurrent Continuous Translation Models
- first example of fully neural models for translation
Sequence to Sequence Learning with Neural Networks
- popularized neural MT due to impressive empirical performance
Learning to Decode for Future Success
- about search
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches