CMU 11-731(MT&Seq2Seq) SeqtoSeq Model

Encoder-decoder Models

  • formula
    $$ m_t^{(f)}=M_{\cdot,f_t}^{(f)} $$
    $$ h_t^{(f)}=\begin{array}{lc}RNN^{(f)}(m_t^{(f)},h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
    $$ m_t^{(e)}=M_{\cdot,e_{t-1}}^{(e)} $$
    $$ h_t^{(e)}=\begin{array}{lc}RNN^{(e)}(m_t^{(e)},h_{t-1}^{(e)})&\;if\;t\geq1\end{array}\;else\;h_{\left|F\right|}^f $$
    $$ p_t^{(e)}=softmax(W_{hs}h_t^{(e)}+b_s) $$

Generating Output

  • Random Sampling

    • usage:Get a variety of outputs for a particular input(dialogue system)
    • Ancestral sampling, sample a distribution from $$ P(e_t\vert\widehat e_1^{t-1}) $$
    • Calculate sentence probabilty
      $$ P\left(\widehat E\vert F\right)=\prod_t^{\left|\widehat E\right|}P({\widehat e}_t\vert F,\widehat E_1^{t-1}) $$
      • problem: numberical precision, so just use log and add them together
  • Greedy 1-best Search

    • just like Ancestral Sampling, except sampling.Use :
      $$ \widehat{e_t}\;=\;\underset i{argmax}\;P_{t,i}^{(e)} $$
    • not guaranteed to find the translation with the highest probability(as Greedy often do this)
  • Beam Search

    • pruning
    • heuristic search
  • Length normalization

    • problem
      • tend to prefer shorter sentences
      • beam search with a larger beam size has a significant length bias towards short sentence
    • solution
      • Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
        $$ P(|E|\;|\;|F|) $$
        $$ \widehat E\;=\underset E{\;argmax}\;\log(P(\left|E\right|\;\vert\;\left|F\right|)\;+\;\log(P(E\;\vert\;F)) $$
      • how to get prior?
        $$ P(\vert E\vert\;\vert\;\vert F\vert)\;=\;\frac{c(\vert E\vert,\;\vert F\vert)}{c(\left|F\right|)} $$
        $$ or $$
        $$ \widehat E\;=\underset E{\;argmax}\;\frac{\;\log(P(E\;\vert\;F))}{\left|E\right|}(highest\;average\;\log\;probability\;per\;word\;) $$

Bidirectional Encoders

  • reverse encoder
    • motivation: langauge with similar ordering(English-French)
  • bi-directional encoder(more robust with typologically distinct languages)
    $$ \overset\rightharpoonup h_t^{(f)}=\begin{array}{lc}\overset\rightharpoonup{RNN}^{(f)}(m_t^{(f)},\overrightarrow h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
    $$ \overleftarrow h_t^{(f)}=\begin{array}{lc}\overleftarrow{RNN}^{(f)}(m_t^{(f)},\overleftarrow h_{t+1}^{(f)})&\;if\;t\;\leq\left|F\right|\end{array}\;else\;0 $$

    • flexible combination of hidden states vector

      $$ h_0^{(e)}=tanh\;(W_1{\overrightarrow h}_{\vert F\vert}+W_2{\overleftarrow h}_1\;+\;b_e) $$

Sentence Embedding Methods

Further Reading
