CMU 11-731(MT&Seq2Seq) Log-linear Language Models

2018-09-03

11731, Machine Translation, course notes

reference:
- Language Models 2: Log-linear Language Models

Model Formulation

$$ P(e_t | e^{t-1}_{t-n+1}) $$

Calculating Features
- feature functions
  $$ \phi(e_{t-n+1}^{t-1})\;=\;x\in\;R^N $$
Calculating Scores
- bias vector (how likely each word in the vocabulary is overall)
  $$ b\;\in\;R^{\left|V\right|} $$
- weight matrix (relationship between feature values and scores)
  $$ W\;=\;R^{\left|V\right|\;\times\;N} $$
- equation
  $$ s\;=\;Wx\;+\;b $$
  - specai case: for one-hot/sparse vectors, just look up the vector for the features active for this instance, and add them together
    $$ \sum_{j:x_j\neq0}W_{\cdot,j}x_j\;+\;b $$
Calculating Probabilities
- softmax
  $$ p=softmax(s) $$

Learning Model Parameters

Loss function
- negative log likelihood
  $$ -\sum_{E\in\varepsilon_{train}}\log P(E;\theta) $$
  - word level
    $$ l(e_{t-n+1}^t,\;\theta)\;=-\;\log\;P(e_t\;\vert\;e_{t-n+1}^{t-1}\;;\;\theta) $$
Optimizer
- SGD
  $$ \theta\leftarrow\theta-\eta\frac{d\;l(e_{t-n+1}^t,\;\theta)}{d\;\theta} $$
- SGD with momentum (exponentially decaying average of past gradients)
- AdaGrad (frequently updated parameters, getting smaller updates VS infrequently updated parameters, getting larger updates)
- Adam

reference: An overview of gradient descent optimization algorithms

Tips for stable training
- Adjusting the learning rate (learning rate decay)
- Early stopping
- Shuffling training order (reduce bias)

Derivatives for Log-linear Models

Exercise:

$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;b}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times1=p-onehot(e_t) $$
$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;W_{\cdot,j}}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times x_j=x_j\times\lbrack p-onehot(e_t)\rbrack $$

Other Features for Language Modeling

Context word features
Context class
- Brown Clustering
- Entry for each class
Context suffix features
- e.g. “…ing” or other common suffixes
Bag-of-words features
- lose positional information
- gain word co-occurence tendency

Model Formulation

Learning Model Parameters

Derivatives for Log-linear Models

Other Features for Language Modeling

Further Reading