CMU 11-731(MT&Seq2Seq) Log-linear Language Models

Model Formulation

$$ P(e_t | e^{t-1}_{t-n+1}) $$

  • Calculating Features

    • feature functions
      $$ \phi(e_{t-n+1}^{t-1})\;=\;x\in\;R^N $$
  • Calculating Scores

    • bias vector (how likely each word in the vocabulary is overall)
      $$ b\;\in\;R^{\left|V\right|} $$
    • weight matrix (relationship between feature values and scores)
      $$ W\;=\;R^{\left|V\right|\;\times\;N} $$
    • equation
      $$ s\;=\;Wx\;+\;b $$
      • specai case: for one-hot/sparse vectors, just look up the vector for the features active for this instance, and add them together
        $$ \sum_{j:x_j\neq0}W_{\cdot,j}x_j\;+\;b $$
  • Calculating Probabilities

    • softmax
      $$ p=softmax(s) $$

Learning Model Parameters

  • Loss function

    • negative log likelihood
      $$ -\sum_{E\in\varepsilon_{train}}\log P(E;\theta) $$
      • word level
        $$ l(e_{t-n+1}^t,\;\theta)\;=-\;\log\;P(e_t\;\vert\;e_{t-n+1}^{t-1}\;;\;\theta) $$
  • Optimizer

    • SGD
      $$ \theta\leftarrow\theta-\eta\frac{d\;l(e_{t-n+1}^t,\;\theta)}{d\;\theta} $$
    • SGD with momentum (exponentially decaying average of past gradients)
    • AdaGrad (frequently updated parameters, getting smaller updates VS infrequently updated parameters, getting larger updates)
    • Adam

reference: An overview of gradient descent optimization algorithms

  • Tips for stable training
    • Adjusting the learning rate (learning rate decay)
    • Early stopping
    • Shuffling training order (reduce bias)

Derivatives for Log-linear Models

Exercise:

$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;b}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times1=p-onehot(e_t) $$
$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;W_{\cdot,j}}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times x_j=x_j\times\lbrack p-onehot(e_t)\rbrack $$

Other Features for Language Modeling

  • Context word features
  • Context class
    • Brown Clustering
    • Entry for each class
  • Context suffix features
    • e.g. “…ing” or other common suffixes
  • Bag-of-words features
    • lose positional information
    • gain word co-occurence tendency

Further Reading

Share