Initialization is important to make sure that the network is not in a saturated state at the beginning of optimization


When the output of an artificial neuron is in the ’flat’ part e.g where σ ′ (z) ≈ 0 for Sigmoid or Logistic Activation

When the neural networks are saturated, gradient steps may not make much progress (as the gradient is very small), even though the loss is large

Link to original
It is important to stress that the randomness in the initializations plays an important role in symmetry breaking. All the units in a given layer typically are symmetric to begin with. If all the weights are initialized identically, then there will be nothing to differentiate them during training.

For Sigmoid or Logistic Activation, can use 0 or a random value around 0. For d weights (w1 … wd), initialize from a Gaussian distribution with mean 0 and variance = 1/d assuming that inputs xi satisfy E[xl] ~ 1. Bias terms can be set to 0 or some small value around 0

For ReLU activation, should use a small positive constant