Model Formulation
$$ P(e_t | e^{t-1}_{t-n+1}) $$
Calculating Features
- feature functions
$$ \phi(e_{t-n+1}^{t-1})\;=\;x\in\;R^N $$
- feature functions
Calculating Scores
- bias vector (how likely each word in the vocabulary is overall)
$$ b\;\in\;R^{\left|V\right|} $$ - weight matrix (relationship between feature values and scores)
$$ W\;=\;R^{\left|V\right|\;\times\;N} $$ - equation
$$ s\;=\;Wx\;+\;b $$- specai case: for one-hot/sparse vectors, just look up the vector for the features active for this instance, and add them together
$$ \sum_{j:x_j\neq0}W_{\cdot,j}x_j\;+\;b $$
- specai case: for one-hot/sparse vectors, just look up the vector for the features active for this instance, and add them together
- bias vector (how likely each word in the vocabulary is overall)
Calculating Probabilities
- softmax
$$ p=softmax(s) $$
- softmax
Learning Model Parameters
Loss function
- negative log likelihood
$$ -\sum_{E\in\varepsilon_{train}}\log P(E;\theta) $$- word level
$$ l(e_{t-n+1}^t,\;\theta)\;=-\;\log\;P(e_t\;\vert\;e_{t-n+1}^{t-1}\;;\;\theta) $$
- word level
- negative log likelihood
Optimizer
- SGD
$$ \theta\leftarrow\theta-\eta\frac{d\;l(e_{t-n+1}^t,\;\theta)}{d\;\theta} $$ - SGD with momentum (exponentially decaying average of past gradients)
- AdaGrad (frequently updated parameters, getting smaller updates VS infrequently updated parameters, getting larger updates)
- Adam
- SGD
reference: An overview of gradient descent optimization algorithms
- Tips for stable training
- Adjusting the learning rate (learning rate decay)
- Early stopping
- Shuffling training order (reduce bias)
Derivatives for Log-linear Models
Exercise:
$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;b}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times1=p-onehot(e_t) $$
$$ \frac{d\;l(e_{t-n+1}^t,\;W,\;b)}{d\;W_{\cdot,j}}=\frac{dl}{dp}\frac{dp}{ds}\frac{ds}{db}=(-\frac1p)\times\left[p(onehot(e_t)-p)\right]\times x_j=x_j\times\lbrack p-onehot(e_t)\rbrack $$
Other Features for Language Modeling
- Context word features
- Context class
- Brown Clustering
- Entry for each class
- Context suffix features
- e.g. “…ing” or other common suffixes
- Bag-of-words features
- lose positional information
- gain word co-occurence tendency