🪴Digital Garden

Search

❯

artificial intelligence

❯

natural language processing

❯

❯

Term-Document matrices

Term-Document matrices

Jul 19, 20241 min read

a.k.a. Bag-of-words

If word u appears in document d, d is a context of u

Process

Acquire large volume of documents
count number of times a word u appears in a document d
meaning of a word u is the (row-wise) count vector of documents that the word u appears in
- meaning(u) = [count(u,d₁), count(u,d₂), … ]
- vector dimension = |D| → D documents
meaning of a document d is the (column-wise) count vector of words in the document
- meaning(d) = [count(u₁,d), count(u₂,d), … ]
- vector dimension = |V| → V is the vocabulary

We get, A matrix X, |D| × |V| or |V| × |D|

Pros

find similar documents
find documents close to a query (by considering query as a document)
compare and visualize words
dimensions are meaningful,Explainable AI

Cons

vectors are sparse, high dimensional → |V| and |D| are both large
- use dimensionality reduction techniques like Latent Semantic Analysis (LSA)
Distributional Semantics may not capture entire semantics

Weighing scheme

Graph View

Process
Pros
Cons
Weighing scheme

Backlinks

TF-IDF
Word Embeddings

Created with Quartz v4.2.3 © 2024

GitHub
Discord Community