- reference
Error Functions and Error Minimization
error function
error(ε,ˆε)difficulty in directly optimizing the error function
- a myriad of possible translations
- the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
how to overcome?
- approximate the hypothesis space
- easily calculable loss functions
Minimum Error Rate Training(MERT)
- assume dealing with a linear model
- work only a subset of hypotheses
- effcient line-search method
logP(F,E)∝S(F,E)=∑iλiϕi(F,E)
outer loop
Generating hypotheses
using beam search
Fin−bestlist⇒ˆEiˆEi,jisjthhypothesisinthen−bestlist
Adjusting parameters
ˆε(λ)=ˆE(λ)1,ˆE(λ)2,….,ˆE(λ)|ε|
whereˆE(λ)i=argmax˜E∈ˆEiS(Fi,˜E;λ)
ˆλ=argminλerror(ε,ˆε(λ))
inner loop (how to find parameter ˆλ - line search)
- Picking a direction(vector d)
- one-hot vector
- random vector
- vector calulated based on gradient-based methods (e.g. minimum-risk method)
Finding the optimal point along this direction
λα=λ+αdthen,wegetS(Fi,Ei,j)=b+cα
ˆλα=argminαerror(ε,ˆελ(α))
updateλ←ˆλα
- line sweep algorithm
- more tricks for MERT
- Random restarts
- Corpus-level measures
s
- Picking a direction(vector d)
Minimum Risk Training
- differentiable and conducive to optimization through gradient-based methods
risk(F,E,θ)=∑˜EP(˜E|F;θ)error(E,˜E)
Two thing to be careful
- sum the entire space of the hypo is intractable, so we choose a subset of hypotheses
not actually optimizing the error
introduce a temperature parameter τ
risk(F,E,θ)=∑˜EP(˜E|F;θ)1/τZerror(E,˜E)whereZ=∑˜EP(˜E|F;θ)1/τs
- τ=1, regular probability distribution
- τ>1, distribution becomes “smoother”
- τ<1, distribution becomes “sharper”
- how to choose τ -> annealing (decrease from a high value to zero)
Optimization Through Search
Optimization Through Search
structured perceptron
- linearized model -> just stochastic gradient descent
variety
early stopping (stop when output is inconsistent with the reference)
learly−percep=S(F,ˆet1)−S(F,et1)
Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)
- Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
- Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam
Margin-based Loss-augmented Training
score of the best hypothesis to exceed by a margin M
S(F,E)>S(F,ˆE)⇒S(F,E)>S(F,ˆE)+M
- explanantion: have some breathing room in its predictions
loss-augmented training
S(F,E)>S(F,ˆE)+M∗err(E,ˆE)
Optimization as Reinforcement Learning
The story
- view each word selection as an action
- the final evaluation score (e.g. BLEU) as the reward
policy gradient methods
key word : REINFORCE objective- self-training
lnll(ˆE)=|E|∑t=1−logP(ˆet|F,ˆet−11) - weighting the objective with the value of the evaluation function
lreinforce(ˆE,E)=eval(E,ˆE)|E|∑t=1−logP(ˆet|F,ˆet−11) - make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
lreinforce+base(ˆE,E)=−(eval(E,ˆE)−base(F,ˆet−11)|E|)∑t=1−logP(ˆet|F,ˆet−11)
- self-training
value-based reinforcement learning
- learn value/Q function, Q(H,a)
H=<F,ˆet−11>,anda=et - actor-critic methods
- learn value/Q function, Q(H,a)
recommended reading
Futher Reading
- Evaluation measures for optimization
- Effcient data structures and algorithms for optimization
training trick
- start with MLE
- learning rate/optimizer
- large batch size
- self-critic/ average baseline