a.k.a. Bag-of-words
If word u appears in document d, d is a context of u
Process
- Acquire large volume of documents
- count number of times a word u appears in a document d
- meaning of a word u is the (row-wise) count vector of documents that the word u appears in
- meaning(u) = [count(u,d1), count(u,d2), … ]
- vector dimension = |D| → D documents
- meaning of a document d is the (column-wise) count vector of words in the document
- meaning(d) = [count(u1,d), count(u2,d), … ]
- vector dimension = |V| → V is the vocabulary
We get, A matrix X, |D| × |V| or |V| × |D|
Pros
- find similar documents
- find documents close to a query (by considering query as a document)
- compare and visualize words
- dimensions are meaningful,Explainable AI
Cons
- vectors are sparse, high dimensional → |V| and |D| are both large
- use dimensionality reduction techniques like Latent Semantic Analysis (LSA)
- Distributional Semantics may not capture entire semantics