Gaussian Mixture Models

a.k.a GMMs Probabilistic Clustering Model

Clusters are the hills in the normalized density function of distribution of the feature space.
- The distribution of the feature space can have varying densities → more points in a space makes it more dense
- normalized density function → returns the density
peak or mode of the hill is the cluster center.
look at the density function as a probability distribution function
fit a mixture of gaussian models is used to model this probability distribution
can handle overlapping clusters → assigns points to clusters with some probability
Generative, as it gives a probability model of x

See Gaussian Distribution

fitting 1 gaussian distribution to 1-dimensional gaussian-
- where, ω is the scale or evidence
- μ is the mean and σ is the std dev of the gaussian η

Assume the probability distribution P(x) in some D-dimensions is made of k different gaussians
- weighted sum of k gaussians
- such that
- where x ∈ R^D
X belongs to gaussian k for which - ||X - μ_k|| is min and
- ||X - μ_k|| < 2.5σ_k
Equivalent Latent variable form
- p(z=k) = ω_k
  - select a mixture component with probability ω
- p(x | z = k) = η(x ; μ_k, σ _k)
  - sample from that component’s gaussian
- then p(x) is the marginal over x
- z is the latent variable

Expectation-Maximization (EM) Algorithm

start with k clusters. Select some μ_c, σ _c and ω_c (a.k.a. π_c ) for all the clusters.
Expectation step:
- for all x datapoints, compute soft probabilities r_ic, the probability that x_i belongs to cluster c →N x K matrix
- normalize to sum to one
- if x_i likely belongs to cluster c, π_c for the cluster will be high
Maximization step:
- fixing the r_ic, recompute the cluster parameters
- for each cluster
Repeat E-M steps until convergence
- each step increases the log-likelihood P(x)
- convergence when log-likelihood doesn’t change much
- convergence guaranteed

sensitive to initialization parameters - can converge to local optima
- reinitialize multiple times