Thank you very much for sending this test code, Aravind. I spent several hours figuring out what's going on. It's pretty interesting! During training we use a technique called batch normalization to adjust the variances in the activation levels of the hidden units, which not only speeds up learning, it also improves the network performance. During testing we are not doing batch normalization the same way. Your test code was using a batch size of 1, which is a no-op. I wrote my own test code that used a batch size of 100 or 200, but it did model.eval() to turn off gradient computation, which is what you're supposed to do when testing the network. But this has the side effect of also turning off batch normalization. Both your test code and mine showed lower performance because we're running the network without this normalization step. Another difference is that the calculation of % correct during training is done by adding up the correct responses for each batch, but since the model is also adjusting its weights with each batch, its performance is a moving target. But this effect is small when the model is close to asymptote. In my last experiment, I trained for 15 epochs and got 97.7% correct during training, and 97.8% during testing when I had batch normalization turned on and tested in batches of 200. But if I turned off batch normalization, the test result was only 88.8% correct. When we're running on the robot we're using a batch size of 1, so the error rates we're seeing are going to be higher than the the training error. Here's my revised code if you want to look at it. Call test() first to test without batch normalization, then test(True) to test with batch normalization. Once you do test(True), then test() produces a very similar result; something is being cached somewhere. Regards, -- Dave