a.k.a GMMs Probabilistic Clustering Model

  • Clusters are the hills in the normalized density function of distribution of the feature space.
    • The distribution of the feature space can have varying densities → more points in a space makes it more dense
    • normalized density function → returns the density
  • peak or mode of the hill is the cluster center.
  • look at the density function as a probability distribution function
  • fit a mixture of gaussian models is used to model this probability distribution
  • can handle overlapping clusters → assigns points to clusters with some probability
  • Generative, as it gives a probability model of x

See Gaussian Distribution

  • fitting 1 gaussian distribution to 1-dimensional gaussian-
    • where, ω is the scale or evidence
    • μ is the mean and σ is the std dev of the gaussian η

High dimensional GMM

  • Assume the probability distribution P(x) in some D-dimensions is made of k different gaussians
    • weighted sum of k gaussians
    • such that
    • where x ∈ RD
  • X belongs to gaussian k for which - ||X - μk|| is min and
    - ||X - μk|| < 2.5σk
  • Equivalent Latent variable form
    • p(z=k) = ωk
      • select a mixture component with probability ω
    • p(x | z = k) = η(x ; μk, σ k)
      • sample from that component’s gaussian
    • then p(x) is the marginal over x
    • z is the latent variable

Expectation-Maximization (EM) Algorithm

  • start with k clusters. Select some μc, σ c and ωc (a.k.a. πc ) for all the clusters.
  • Expectation step:
    • for all x datapoints, compute soft probabilities ric, the probability that xi belongs to cluster c →N x K matrix
    • normalize to sum to one
    • if xi likely belongs to cluster c, πc for the cluster will be high
  • Maximization step:
    • fixing the ric, recompute the cluster parameters
    • for each cluster
  • Repeat E-M steps until convergence
    • each step increases the log-likelihood P(x)
    • convergence when log-likelihood doesn’t change much
    • convergence guaranteed

How to choose k?

  • penalize the log-likelihood scores

Pros

  • can handle overlapping clusters

Cons

  • sensitive to initialization parameters - can converge to local optima
    • reinitialize multiple times

Alternatives

  • Stochastic EM
  • Hard EM