CMU 11-731(MT&Seq2Seq) Algorithms for MT 2 Parameter Optimization Methods

Error Functions and Error Minimization

error function
$error(\varepsilon,\widehat\varepsilon)$
difficulty in directly optimizing the error function
- a myriad of possible translations
- the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
how to overcome?
- approximate the hypothesis space
- easily calculable loss functions

$\log\;P(F,E)\;\propto\;S(F,E)\;=\;\sum_i\lambda_i\phi_i(F,E)$

$risk(F,E,\theta)=\sum_\widetilde EP(\widetilde E\vert F;\theta)error(E,\widetilde E)$

Optimization Through Search
structured perceptron
- linearized model -> just stochastic gradient descent
- variety
  - early stopping (stop when output is inconsistent with the reference)
    
    $l_{early-percep}\;=\;S(F,\widehat e_1^t)\;-\;\;S(F,\;e_1^t)$
Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)
- Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
- Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam

score of the best hypothesis to exceed by a margin $M$

$S(F,E) > S(F,\widehat E) \Rightarrow S(F,E) > S(F,\widehat E) + M$
- explanantion: have some breathing room in its predictions
loss-augmented training

$S(F,E) > S(F,\widehat E) + M * err(E,\widehat E)$

The story
- view each word selection as an action
- the final evaluation score (e.g. BLEU) as the reward
policy gradient methods
key word : REINFORCE objective
- self-training
  $l_{nll}(\widehat E)=\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1})$
- weighting the objective with the value of the evaluation function
  $l_{reinforce}(\widehat E,E)=eval(E,\widehat E)\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1})$
- make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
  $l_{reinforce+base}(\widehat E,E)=-(eval(E,\widehat E)-base(F,\widehat e_1^{t-1})\overset{\vert E\vert}{\underset{t=1}{)\sum}}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1})$
value-based reinforcement learning
- learn value/Q function, $Q(H,a)$
  $H = <F,\widehat e_1^{t-1}>,\;and\;a = e_t$
- actor-critic methods
recommended reading
- Reward Augmented Maximum Likelihood for Neural Structured Prediction
- SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation