Documentation: Training Deep Neural Networks

cmds/run_DNN.sh -- Training Deep Neural Networks

---------------------------------------------------------------------------------------------------------------------

Arguments

argument	meaning/value	comments
--train-data	training data specification	required
--valid-data	valid data specification	required
--nnet-spec	--nnet-spec="d:h(1):h(2):...:h(n):s" Eg.250:1024:1024:1024:1024:1920	required. d-input dimension; h(i)-size of the i-th hidden layers; s-number of targets
--wdir	working directory	required

--param-output-file	path to save model parameters in the PDNN format	by default "": doesn't output PDNN-formatted model
--cfg-output-file	path to save model config	by default "": doesn't output model config
--kaldi-output-file	path to save the Kaldi-formatted model	by default "": doesn't output Kaldi-formatted model
--model-save-step	number of epochs between model saving	by default 1: save the tmp model after each epoch

--ptr-file	pre-trained model file	by default "": no pre-training
--ptr-layer-number	how many layers to be initialized with the pre-trained model	required if --pre-file is provided

--lrate	learning rate	by default D:0.08:0.5:0.05,0.05:15
--batch-size	mini-batch size for SGD	by default 256
--momentum	the momentum	by default 0.5

--activation	1. sigmoid 2. tanh 3. rectifier	by default sigmoid
	4. maxout:${group_size}	when using maxout, you need to specify the group size variable, i.e., the number of units in each max-pooling group. More details can be found at the bottom of this page and also in this paper


--input-dropout-factor	dropout factor for the input layer (features)	by default 0: no dropout is applied to the input features
--dropout-factor	comma-delimited dropout factors for hidden layers. Note the matching between dropout factors and network structure (nnet-spec) E.g. --dropout-factor 0.2,0.2,0.2,0.2	by default "": no dropout is applied. This is equivalent to setting dropout factors to all 0s. However, the latter case will be slower. Thus, "--dropout-factor 0,0,0,0" is NOT recommended.

--l1-reg	l1 norm regularization weight train_objective = cross_entropy + l1_reg * [l1 norm of all weight matrices]	by default 0
--l2-reg	l2 norm regularization weight train_objective = cross_entropy + l2_reg * [l2 norm of all weight matrices]	by default 0
--max-col-norm	the max value of norm of gradients; usually used in dropout and maxout	by default none: not applied

Example

python pdnn/cmds/run_DNN.py --train-data "train.pickle.gz,partition=600m,stream=true,random=true" \
                         --valid-data "valid.pickle.gz,partition=600m,stream=true,random=true" \
                             --nnet-spec "330:1024:1024:1024:1024:1901" \
                             --ptr-file dnn.ptr --ptr-layer-number 4 \
                             --activation sigmoid --wdir ./ \
                             --param-output-file nnet.mdl --cfg-output-file nnet.cfg

Maxout

When using maxout as the activation, h(i) in nnet-spec means the number of maxout units. In the following example, the sizes of the weight matrices, from the lowest to the highest layer, are 330x1200, 400x1200, 400x1200, ...

python pdnn/cmds/run_DNN.py --train-data "train.pickle.gz,partition=600m,stream=true,random=true" \
                      --valid-data "valid.pickle.gz,partition=600m,stream=true,random=true" \
                      --nnet-spec "330:400:400:400:400:1901" \
                      --lrate "D:0.008:0.5:0.05,0.05:15" \
                      --activation "maxout:3" --wdir ./ \
                      --param-output-file nnet.mdl --cfg-output-file nnet.cfg

Some of our observations: (1) a much smaller learning rate should be used than when sigmoid networks; (2) pre-training doesn't help much