in a deep neural network, doing the backpropagation involves multiplying several derivatives of activation functions together - for some activation functions, this makes the gradient ≈ 0
eg Sigmoid or Logistic Activation - the derivative of sigmoid ⇐ 0.25 so every time we are guaranteed to multiply with at max 0.25
opposite of Exploding Gradient Problem
can be solved by ReLU activation