Thank you very much for sending this test code, Aravind.

I spent several hours figuring out what's going on.  It's pretty
interesting!

During training we use a technique called batch normalization to adjust
the variances in the activation levels of the hidden units, which not
only speeds up learning, it also improves the network performance.

During testing we are not doing batch normalization the same way.  Your
test code was using a batch size of 1, which is a no-op.  I wrote my own
test code that used a batch size of 100 or 200, but it did model.eval()
to turn off gradient computation, which is what you're supposed to do
when testing the network.  But this has the side effect of also turning
off batch normalization.  Both your test code and mine showed lower
performance because we're running the network without this normalization
step.

Another difference is that the calculation of % correct during training
is done by adding up the correct responses for each batch, but since the
model is also adjusting its weights with each batch, its performance is
a moving target.  But this effect is small when the model is close to
asymptote.

In my last experiment, I trained for 15 epochs and got 97.7% correct
during training, and 97.8% during testing when I had batch normalization
turned on and tested in batches of 200.  But if I turned off batch
normalization, the test result was only 88.8% correct.

When we're running on the robot we're using a batch size of 1, so the
error rates we're seeing are going to be higher than the the training
error.

Here's my revised code if you want to look at it.  Call test() first to
test without batch normalization, then test(True) to test with batch
normalization.  Once you do test(True), then test() produces a very
similar result; something is being cached somewhere.

Regards,

-- Dave