CMU 11-731(MT&Seq2Seq) RNN language models

11731, Machine Translation, course notes

reference
- Language Models 4: Recurrent Neural Network Language Models

Long Distance Dependencies in Language

example
- grammatical constraint
- selectional preferences
- topic / register

RNN

$$ h_t\;=\begin{array}{lc}\tan h\left(W_{xh}x_t\;+\;W_{hh}h_{t-1}\;+\;b_h\right)&if\;t\;\geqslant\:1\end{array}else\;0 $$

language model using RNN

$$ m_t=M_{\cdot,e_{t-1}} $$
$$ h_t\;=\;RNN\left(m_t,\;h_{t-1}\right) $$
$$ p_t\;=\;softmax\left(W_{hs}h_t\;+\;b_s\right) $$

The Vanishing Gradient and Long Short-term Memory

RNN problems
- vanishing gradient
- exploding gradient
  - gradient clip
- why ?
  $$ \prod_{j=k+1}^t\frac{\partial h_t}{\partial h_{t-1}}=\prod_{j=k+1}^t\tan h’\;\times\;W_{hh}\;where\;\tan h’\leqslant1 $$
- how to solve? -> LSTM (derivative of the recurrent function is exactly one)
  - add memory cell
    $$ \frac{\partial c_t}{\partial c_{t-1}}=\;1 $$
  - add two gates
    - input gate
    - output gate
  - key equation
    $$ c_t=i_t\odot u_t+c_{t-1} $$

Other RNN Variants

Add forget gate (easily clear its memory when justified)
- problem
  - f is set to zero, then the model will forget everything.
    - solution: set bias b to be large value (e.g. 1)
    - idea: gradually forget during training
GRU
- fewer parameters
- no concept of “cell”
- Recurrent highway networks
stack RNNs
- recursive both on time and output ?
- progressively extract more abstract feautures(e.g. POS -> voice/tense)
- problem
  - vanishing gradient problem in the vertical direction
  - solution: Residual Network
    - idea: add output of previous lay to next layer

Online, Batch, and Minibatch Training

online learning( perform updates a single example at a time)
batch learning
Minibatching (medium between online and batch)
- benefit
  - use efficient vector processing instructions
- When doing batching, how to solve sentences of different sizes ?
  - sentence padding (add “eos” to shorter sentence)
  - masking
  - problem
    - waste a lot of computation on these padded symbols
    - solution
      - sort the sentences in the corpus by length
- pytorch(match box)

Further Reading

What can recurrent neural networks learn?
RNN Regularization
- why dropout not a good idea for RNN?
  - dropout the most after some time steps and lose remember functions
- generalize
  - Recurrent dropout
    - Sample a single dropout mask for each sentence separately
  - Regularizing and Optimizing LSTM Language Models
  - DropConnect
Other RNN architectures
- LSTM: A Search Space Odyssey
- Neual Architecture Search With Reinforcement Learning
Sensitivity to Hyper-parameters
- On the State of the Art of Evaluation in Neural Language Models
Normalization
- Batch normalization
- Layer normalization
  - Machine Translation most used

Murmur

So busy recently …….