
CHAPTER 14. AUTOENCODERS
contractive penalty to
f
(
x
) rather than to
g
(
f
(
x
)). A contractive penalty on
f
(
x
)
also has close connections to score matching, as discussed in section 14.5.1.
The name
contractive
arises from the way that the CAE warps space. Specifi-
cally, because the CAE is trained to resist perturbations of its input, it is encouraged
to map a neighborhood of input points to a smaller neighborhood of output points.
We can think of this as contracting the input neighborhood to a smaller output
neighborhood.
To clarify, the CAE is contractive only locally—all perturbations of a training
point
x
are mapped near to
f
(
x
). Globally, two different points
x
and
x
may be
mapped to
f
(
x
) and
f
(
x
) points that are farther apart than the original points.
It is plausible that
f
could be expanding in-between or far from the data manifolds
(see, for example, what happens in the 1-D toy example of figure 14.7). When the
Ω(
h
) penalty is applied to sigmoidal units, one easy way to shrink the Jacobian is
to make the sigmoid units saturate to 0 or 1. This encourages the CAE to encode
input points with extreme values of the sigmoid, which may be interpreted as a
binary code. It also ensures that the CAE will spread its code values throughout
most of the hypercube that its sigmoidal hidden units can span.
We can think of the Jacobian matrix
J
at a point
x
as approximating the
nonlinear encoder
f
(
x
) as being a linear operator. This allows us to use the word
“contractive” more formally. In the theory of linear operators, a linear operator
is said to be contractive if the norm of
Jx
remains less than or equal to 1 for
all unit-norm
x
. In other words,
J
is contractive if it shrinks the unit sphere.
We can think of the CAE as penalizing the Frobenius norm of the local linear
approximation of
f
(
x
) at every training point
x
in order to encourage each of
these local linear operators to become a contraction.
As described in section 14.6, regularized autoencoders learn manifolds by
balancing two opposing forces. In the case of the CAE, these two forces are
reconstruction error and the contractive penalty Ω(
h
). Reconstruction error alone
would encourage the CAE to learn an identity function. The contractive penalty
alone would encourage the CAE to learn features that are constant with respect to
x
.
The compromise between these two forces yields an autoencoder whose derivatives
∂f (x)
∂x
are mostly tiny. Only a small number of hidden units, corresponding to a
small number of directions in the input, may have significant derivatives.
The goal of the CAE is to learn the manifold structure of the data. Directions
x
with large
Jx
rapidly change
h
, so these are likely to be directions that approximate
the tangent planes of the manifold. Experiments by Rifai et al. (2011a,b) show
that training the CAE results in most singular values of
J
dropping below 1 in
magnitude and therefore becoming contractive. Some singular values remain above
519