Long Distance Dependencies in Language
- example
- grammatical constraint
- selectional preferences
- topic / register
RNN
$$ h_t\;=\begin{array}{lc}\tan h\left(W_{xh}x_t\;+\;W_{hh}h_{t-1}\;+\;b_h\right)&if\;t\;\geqslant\:1\end{array}else\;0 $$
- language model using RNN
$$ m_t=M_{\cdot,e_{t-1}} $$
$$ h_t\;=\;RNN\left(m_t,\;h_{t-1}\right) $$
$$ p_t\;=\;softmax\left(W_{hs}h_t\;+\;b_s\right) $$
The Vanishing Gradient and Long Short-term Memory
- RNN problems
- vanishing gradient
- exploding gradient
- gradient clip
- why ?
$$ \prod_{j=k+1}^t\frac{\partial h_t}{\partial h_{t-1}}=\prod_{j=k+1}^t\tan h’\;\times\;W_{hh}\;where\;\tan h’\leqslant1 $$ - how to solve? -> LSTM (derivative of the recurrent function is exactly one)
- add memory cell
$$ \frac{\partial c_t}{\partial c_{t-1}}=\;1 $$ - add two gates
- input gate
- output gate
- key equation
$$ c_t=i_t\odot u_t+c_{t-1} $$
- add memory cell
Other RNN Variants
- Add forget gate (easily clear its memory when justified)
- problem
- f is set to zero, then the model will forget everything.
- solution: set bias b to be large value (e.g. 1)
- idea: gradually forget during training
- f is set to zero, then the model will forget everything.
- problem
GRU
- fewer parameters
- no concept of “cell”
- Recurrent highway networks
stack RNNs
- recursive both on time and output ?
- progressively extract more abstract feautures(e.g. POS -> voice/tense)
- problem
- vanishing gradient problem in the vertical direction
- solution: Residual Network
- idea: add output of previous lay to next layer
Online, Batch, and Minibatch Training
- online learning( perform updates a single example at a time)
- batch learning
- Minibatching (medium between online and batch)
- benefit
- use efficient vector processing instructions
- When doing batching, how to solve sentences of different sizes ?
- sentence padding (add “eos” to shorter sentence)
- masking
- problem
- waste a lot of computation on these padded symbols
- solution
- sort the sentences in the corpus by length
- pytorch(match box)
- benefit
Further Reading
- What can recurrent neural networks learn?
- RNN Regularization
- why dropout not a good idea for RNN?
- dropout the most after some time steps and lose remember functions
- generalize
- Recurrent dropout
- Sample a single dropout mask for each sentence separately
- Regularizing and Optimizing LSTM Language Models
- DropConnect
- Recurrent dropout
- why dropout not a good idea for RNN?
Other RNN architectures
Sensitivity to Hyper-parameters
Normalization
- Batch normalization
- Layer normalization
- Machine Translation most used
Murmur
So busy recently …….