• predicting the context given a word **wt

  • Let wt-1,…,wt-m, wt+1,…,wt+m be the context

  • Pr(wt | context) * Pr (context) = Pr(context | wt) * Pr(wt)

  • Pr(context) and Pr(wt) are uniform distributions and are constants

  • Pr(context | wt) = Product { Pr(wj | wt) } for all js

Word2Vec is a skip-gram model