Loading [MathJax]/jax/output/CommonHTML/jax.js

CMU 11-731(MT&Seq2Seq) Algorithms for MT 2 Parameter Optimization Methods

Error Functions and Error Minimization

  • error function
    error(ε,ˆε)

  • difficulty in directly optimizing the error function

    • a myriad of possible translations
    • the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
  • how to overcome?

    • approximate the hypothesis space
    • easily calculable loss functions

Minimum Error Rate Training(MERT)

  • assume dealing with a linear model
  • work only a subset of hypotheses
  • effcient line-search method

logP(F,E)S(F,E)=iλiϕi(F,E)

  • outer loop

    • Generating hypotheses

      • using beam search
        FinbestlistˆEi

        ˆEi,jisjthhypothesisinthenbestlist

    • Adjusting parameters

      ˆε(λ)=ˆE(λ)1,ˆE(λ)2,.,ˆE(λ)|ε|

      whereˆE(λ)i=argmax˜EˆEiS(Fi,˜E;λ)

      ˆλ=argminλerror(ε,ˆε(λ))

  • inner loop (how to find parameter ˆλ - line search)

    • Picking a direction(vector d)
      • one-hot vector
      • random vector
      • vector calulated based on gradient-based methods (e.g. minimum-risk method)
    • Finding the optimal point along this direction
      λα=λ+αd

      then,wegetS(Fi,Ei,j)=b+cα

      ˆλα=argminαerror(ε,ˆελ(α))

      updateλˆλα

      • line sweep algorithm
    • more tricks for MERT
      • Random restarts
      • Corpus-level measures
        s

Minimum Risk Training

  • differentiable and conducive to optimization through gradient-based methods

risk(F,E,θ)=˜EP(˜E|F;θ)error(E,˜E)

  • Two thing to be careful

    • sum the entire space of the hypo is intractable, so we choose a subset of hypotheses
    • not actually optimizing the error

      • introduce a temperature parameter τ
        risk(F,E,θ)=˜EP(˜E|F;θ)1/τZerror(E,˜E)

        whereZ=˜EP(˜E|F;θ)1/τs

        • τ=1, regular probability distribution
        • τ>1, distribution becomes “smoother”
        • τ<1, distribution becomes “sharper”
      • how to choose τ -> annealing (decrease from a high value to zero)

Optimization Through Search

  • Optimization Through Search

  • structured perceptron

    • linearized model -> just stochastic gradient descent
    • variety

      • early stopping (stop when output is inconsistent with the reference)

        learlypercep=S(F,ˆet1)S(F,et1)

  • Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)

    • Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
    • Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam

Margin-based Loss-augmented Training

  • score of the best hypothesis to exceed by a margin M

    S(F,E)>S(F,ˆE)S(F,E)>S(F,ˆE)+M

    • explanantion: have some breathing room in its predictions
  • loss-augmented training

    S(F,E)>S(F,ˆE)+Merr(E,ˆE)

Optimization as Reinforcement Learning

  • The story

    • view each word selection as an action
    • the final evaluation score (e.g. BLEU) as the reward
  • policy gradient methods
    key word : REINFORCE objective

    • self-training
      lnll(ˆE)=|E|t=1logP(ˆet|F,ˆet11)
    • weighting the objective with the value of the evaluation function
      lreinforce(ˆE,E)=eval(E,ˆE)|E|t=1logP(ˆet|F,ˆet11)
    • make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
      lreinforce+base(ˆE,E)=(eval(E,ˆE)base(F,ˆet11)|E|)t=1logP(ˆet|F,ˆet11)
  • value-based reinforcement learning

    • learn value/Q function, Q(H,a)
      H=<F,ˆet11>,anda=et
    • actor-critic methods
  • recommended reading

Futher Reading

  • Evaluation measures for optimization
  • Effcient data structures and algorithms for optimization

training trick

  • start with MLE
  • learning rate/optimizer
  • large batch size
  • self-critic/ average baseline
Share