.. _sec_dropout:
Dropout
=======
Let’s think briefly about what we expect from a good predictive model.
We want it to peform well on unseen data. Classical generalization
theory suggests that to close the gap between train and test
performance, we should aim for a simple model. Simplicity can come in
the form of a small number of dimensions. We explored this when
discussing the monomial basis functions of linear models in
:numref:`sec_generalization_basics`. Additionally, as we saw when
discussing weight decay (:math:`\ell_2` regularization) in
:numref:`sec_weight_decay`, the (inverse) norm of the parameters also
represents a useful measure of simplicity. Another useful notion of
simplicity is smoothness, i.e., that the function should not be
sensitive to small changes to its inputs. For instance, when we classify
images, we would expect that adding some random noise to the pixels
should be mostly harmless.
:cite:t:`Bishop.1995` formalized this idea when he proved that training
with input noise is equivalent to Tikhonov regularization. This work
drew a clear mathematical connection between the requirement that a
function be smooth (and thus simple), and the requirement that it be
resilient to perturbations in the input.
Then, :cite:t:`Srivastava.Hinton.Krizhevsky.ea.2014` developed a clever
idea for how to apply Bishop’s idea to the internal layers of a network,
too. Their idea, called *dropout*, involves injecting noise while
computing each internal layer during forward propagation, and it has
become a standard technique for training neural networks. The method is
called *dropout* because we literally *drop out* some neurons during
training. Throughout training, on each iteration, standard dropout
consists of zeroing out some fraction of the nodes in each layer before
calculating the subsequent layer.
To be clear, we are imposing our own narrative with the link to Bishop.
The original paper on dropout offers intuition through a surprising
analogy to sexual reproduction. The authors argue that neural network
overfitting is characterized by a state in which each layer relies on a
specific pattern of activations in the previous layer, calling this
condition *co-adaptation*. Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to break up co-adapted genes.
While such an justification of this theory is certainly up for debate,
the dropout technique itself has proved enduring, and various forms of
dropout are implemented in most deep learning libraries.
The key challenge is how to inject this noise. One idea is to inject it
in an *unbiased* manner so that the expected value of each layer—while
fixing the others—equals the value it would have taken absent noise. In
Bishop’s work, he added Gaussian noise to the inputs to a linear model.
At each training iteration, he added noise sampled from a distribution
with mean zero :math:`\epsilon \sim \mathcal{N}(0,\sigma^2)` to the
input :math:`\mathbf{x}`, yielding a perturbed point
:math:`\mathbf{x}' = \mathbf{x} + \epsilon`. In expectation,
:math:`E[\mathbf{x}'] = \mathbf{x}`.
In standard dropout regularization, one zeros out some fraction of the
nodes in each layer and then *debiases* each layer by normalizing by the
fraction of nodes that were retained (not dropped out). In other words,
with *dropout probability* :math:`p`, each intermediate activation
:math:`h` is replaced by a random variable :math:`h'` as follows:
.. math::
\begin{aligned}
h' =
\begin{cases}
0 & \textrm{ with probability } p \\
\frac{h}{1-p} & \textrm{ otherwise}
\end{cases}
\end{aligned}
By design, the expectation remains unchanged, i.e., :math:`E[h'] = h`.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import torch
from torch import nn
from d2l import torch as d2l
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l
npx.set_np()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
from functools import partial
import jax
import optax
from flax import linen as nn
from jax import numpy as jnp
from d2l import jax as d2l
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
import tensorflow as tf
from d2l import tensorflow as d2l
.. raw:: html
.. raw:: html
Dropout in Practice
-------------------
Recall the MLP with a hidden layer and five hidden units from
:numref:`fig_mlp`. When we apply dropout to a hidden layer, zeroing
out each hidden unit with probability :math:`p`, the result can be
viewed as a network containing only a subset of the original neurons. In
:numref:`fig_dropout2`, :math:`h_2` and :math:`h_5` are removed.
Consequently, the calculation of the outputs no longer depends on
:math:`h_2` or :math:`h_5` and their respective gradient also vanishes
when performing backpropagation. In this way, the calculation of the
output layer cannot be overly dependent on any one element of
:math:`h_1, \ldots, h_5`.
.. _fig_dropout2:
.. figure:: ../img/dropout2.svg
MLP before and after dropout.
Typically, we disable dropout at test time. Given a trained model and a
new example, we do not drop out any nodes and thus do not need to
normalize. However, there are some exceptions: some researchers use
dropout at test time as a heuristic for estimating the *uncertainty* of
neural network predictions: if the predictions agree across many
different dropout outputs, then we might say that the network is more
confident.
Implementation from Scratch
---------------------------
To implement the dropout function for a single layer, we must draw as
many samples from a Bernoulli (binary) random variable as our layer has
dimensions, where the random variable takes value :math:`1` (keep) with
probability :math:`1-p` and :math:`0` (drop) with probability :math:`p`.
One easy way to implement this is to first draw samples from the uniform
distribution :math:`U[0, 1]`. Then we can keep those nodes for which the
corresponding sample is greater than :math:`p`, dropping the rest.
In the following code, we implement a ``dropout_layer`` function that
drops out the elements in the tensor input ``X`` with probability
``dropout``, rescaling the remainder as described above: dividing the
survivors by ``1.0-dropout``.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
if dropout == 1: return torch.zeros_like(X)
mask = (torch.rand(X.shape) > dropout).float()
return mask * X / (1.0 - dropout)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
if dropout == 1: return np.zeros_like(X)
mask = np.random.uniform(0, 1, X.shape) > dropout
return mask.astype(np.float32) * X / (1.0 - dropout)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def dropout_layer(X, dropout, key=d2l.get_key()):
assert 0 <= dropout <= 1
if dropout == 1: return jnp.zeros_like(X)
mask = jax.random.uniform(key, X.shape) > dropout
return jnp.asarray(mask, dtype=jnp.float32) * X / (1.0 - dropout)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
def dropout_layer(X, dropout):
assert 0 <= dropout <= 1
if dropout == 1: return tf.zeros_like(X)
mask = tf.random.uniform(
shape=tf.shape(X), minval=0, maxval=1) < 1 - dropout
return tf.cast(mask, dtype=tf.float32) * X / (1.0 - dropout)
.. raw:: html
.. raw:: html
We can test out the ``dropout_layer`` function on a few examples. In the
following lines of code, we pass our input ``X`` through the dropout
operation, with probabilities 0, 0.5, and 1, respectively.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
dropout_p = 0: tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0., 2., 0., 6., 8., 0., 0., 0.],
[16., 18., 20., 22., 24., 26., 28., 30.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = np.arange(16).reshape(2, 8)
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
dropout_p = 0: [[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]]
dropout_p = 0.5: [[ 0. 0. 0. 0. 8. 10. 12. 0.]
[16. 0. 20. 22. 0. 0. 0. 30.]]
dropout_p = 1: [[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
[21:50:21] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = jnp.arange(16, dtype=jnp.float32).reshape(2, 8)
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
dropout_p = 0: [[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]]
dropout_p = 0.5: [[ 0. 0. 0. 6. 0. 0. 12. 0.]
[ 0. 18. 20. 22. 0. 0. 28. 0.]]
dropout_p = 1: [[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]]
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
X = tf.reshape(tf.range(16, dtype=tf.float32), (2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))
.. raw:: latex
\diilbookstyleoutputcell
.. parsed-literal::
:class: output
dropout_p = 0: tf.Tensor(
[[ 0. 1. 2. 3. 4. 5. 6. 7.]
[ 8. 9. 10. 11. 12. 13. 14. 15.]], shape=(2, 8), dtype=float32)
dropout_p = 0.5: tf.Tensor(
[[ 0. 0. 0. 6. 0. 0. 0. 14.]
[ 0. 18. 20. 22. 0. 0. 28. 30.]], shape=(2, 8), dtype=float32)
dropout_p = 1: tf.Tensor(
[[0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0.]], shape=(2, 8), dtype=float32)
.. raw:: html
.. raw:: html
Defining the Model
~~~~~~~~~~~~~~~~~~
The model below applies dropout to the output of each hidden layer
(following the activation function). We can set dropout probabilities
for each layer separately. A common choice is to set a lower dropout
probability closer to the input layer. We ensure that dropout is only
active during training.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.lin1 = nn.LazyLinear(num_hiddens_1)
self.lin2 = nn.LazyLinear(num_hiddens_2)
self.lin3 = nn.LazyLinear(num_outputs)
self.relu = nn.ReLU()
def forward(self, X):
H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
if self.training:
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.relu(self.lin2(H1))
if self.training:
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.lin1 = nn.Dense(num_hiddens_1, activation='relu')
self.lin2 = nn.Dense(num_hiddens_2, activation='relu')
self.lin3 = nn.Dense(num_outputs)
self.initialize()
def forward(self, X):
H1 = self.lin1(X)
if autograd.is_training():
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.lin2(H1)
if autograd.is_training():
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLPScratch(d2l.Classifier):
num_hiddens_1: int
num_hiddens_2: int
num_outputs: int
dropout_1: float
dropout_2: float
lr: float
training: bool = True
def setup(self):
self.lin1 = nn.Dense(self.num_hiddens_1)
self.lin2 = nn.Dense(self.num_hiddens_2)
self.lin3 = nn.Dense(self.num_outputs)
self.relu = nn.relu
def forward(self, X):
H1 = self.relu(self.lin1(X.reshape(X.shape[0], -1)))
if self.training:
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.relu(self.lin2(H1))
if self.training:
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.lin1 = tf.keras.layers.Dense(num_hiddens_1, activation='relu')
self.lin2 = tf.keras.layers.Dense(num_hiddens_2, activation='relu')
self.lin3 = tf.keras.layers.Dense(num_outputs)
def forward(self, X):
H1 = self.lin1(tf.reshape(X, (X.shape[0], -1)))
if self.training:
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.lin2(H1)
if self.training:
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)
.. raw:: html
.. raw:: html
Training
~~~~~~~~
The following is similar to the training of MLPs described previously.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_63_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_66_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_69_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)
data = d2l.FashionMNIST(batch_size=256)
trainer = d2l.Trainer(max_epochs=10)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_72_0.svg
.. raw:: html
.. raw:: html
Concise Implementation
----------------------
With high-level APIs, all we need to do is add a ``Dropout`` layer after
each fully connected layer, passing in the dropout probability as the
only argument to its constructor. During training, the ``Dropout`` layer
will randomly drop out outputs of the previous layer (or equivalently,
the inputs to the subsequent layer) according to the specified dropout
probability. When not in training mode, the ``Dropout`` layer simply
passes the data through during testing.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(
nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(),
nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(),
nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential()
self.net.add(nn.Dense(num_hiddens_1, activation="relu"),
nn.Dropout(dropout_1),
nn.Dense(num_hiddens_2, activation="relu"),
nn.Dropout(dropout_2),
nn.Dense(num_outputs))
self.net.initialize()
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLP(d2l.Classifier):
num_hiddens_1: int
num_hiddens_2: int
num_outputs: int
dropout_1: float
dropout_2: float
lr: float
training: bool = True
@nn.compact
def __call__(self, X):
x = nn.relu(nn.Dense(self.num_hiddens_1)(X.reshape((X.shape[0], -1))))
x = nn.Dropout(self.dropout_1, deterministic=not self.training)(x)
x = nn.relu(nn.Dense(self.num_hiddens_2)(x))
x = nn.Dropout(self.dropout_2, deterministic=not self.training)(x)
return nn.Dense(self.num_outputs)(x)
Note that we need to redefine the loss function since a network with a
dropout layer needs a PRNGKey when using ``Module.apply()``, and this
RNG seed should be explicitly named ``dropout``. This key is used by the
``dropout`` layer in Flax to generate the random dropout mask
internally. It is important to use a unique ``dropout_rng`` key with
every epoch in the training loop, otherwise the generated dropout mask
will not be stochastic and different between the epoch runs. This
``dropout_rng`` can be stored in the ``TrainState`` object (in the
``d2l.Trainer`` class defined in :numref:`oo-design-training`) as an
attribute and with every epoch it is replaced with a new
``dropout_rng``. We already handled this with the ``fit_epoch`` method
defined in :numref:`sec_linear_scratch`.
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
@d2l.add_to_class(d2l.Classifier) #@save
@partial(jax.jit, static_argnums=(0, 5))
def loss(self, params, X, Y, state, averaged=True):
Y_hat = state.apply_fn({'params': params}, *X,
mutable=False, # To be used later (e.g., batch norm)
rngs={'dropout': state.dropout_rng})
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
Y = Y.reshape((-1,))
fn = optax.softmax_cross_entropy_with_integer_labels
# The returned empty dictionary is a placeholder for auxiliary data,
# which will be used later (e.g., for batch norm)
return (fn(Y_hat, Y).mean(), {}) if averaged else (fn(Y_hat, Y), {})
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.save_hyperparameters()
self.net = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(num_hiddens_1, activation=tf.nn.relu),
tf.keras.layers.Dropout(dropout_1),
tf.keras.layers.Dense(num_hiddens_2, activation=tf.nn.relu),
tf.keras.layers.Dropout(dropout_2),
tf.keras.layers.Dense(num_outputs)])
.. raw:: html
.. raw:: html
Next, we train the model.
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = DropoutMLP(**hparams)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_95_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = DropoutMLP(**hparams)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_98_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = DropoutMLP(**hparams)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_101_0.svg
.. raw:: html
.. raw:: html
.. raw:: latex
\diilbookstyleinputcell
.. code:: python
model = DropoutMLP(**hparams)
trainer.fit(model, data)
.. figure:: output_dropout_1110bf_104_0.svg
.. raw:: html
.. raw:: html
Summary
-------
Beyond controlling the number of dimensions and the size of the weight
vector, dropout is yet another tool for avoiding overfitting. Often
tools are used jointly. Note that dropout is used only during training:
it replaces an activation :math:`h` with a random variable with expected
value :math:`h`.
Exercises
---------
1. What happens if you change the dropout probabilities for the first
and second layers? In particular, what happens if you switch the ones
for both layers? Design an experiment to answer these questions,
describe your results quantitatively, and summarize the qualitative
takeaways.
2. Increase the number of epochs and compare the results obtained when
using dropout with those when not using it.
3. What is the variance of the activations in each hidden layer when
dropout is and is not applied? Draw a plot to show how this quantity
evolves over time for both models.
4. Why is dropout not typically used at test time?
5. Using the model in this section as an example, compare the effects of
using dropout and weight decay. What happens when dropout and weight
decay are used at the same time? Are the results additive? Are there
diminished returns (or worse)? Do they cancel each other out?
6. What happens if we apply dropout to the individual weights of the
weight matrix rather than the activations?
7. Invent another technique for injecting random noise at each layer
that is different from the standard dropout technique. Can you
develop a method that outperforms dropout on the Fashion-MNIST
dataset (for a fixed architecture)?
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html
`Discussions `__
.. raw:: html
.. raw:: html