identify and assign part of speech tags to tokens - classify words according to their meaning and role in grammar. e.g. - nouns, pronouns, verbs, adjectives, adverbs, determiners, conjunctions, prepositions, etc.
Tokenization is usually performed before or alongside this step. Only one POS tag for a token in each run.
Tagsets
Challenges
Syntactic ambiguity
- accidental homophones and homonyms
- e.g. duck
homophones & homonyms are words that are pronounced or spelled the same but have different meanings
- different syntactic roles
- e.g. To walk (verb) vs to go for a walk(noun)
- e.g. old(adjective) people vs the old(noun)
- To disambiguate:
- ambiguity usually disappears when token is not in isolation
- e.g. The garbage can (axillary verb) smell vs The garbage can (noun) smells
- might not work - e.g. They can fish (can - auxiliary verb/verb) and canning - (verb/noun)
- come up with rules
- e.g. A token is very unlikely to be a verb if its preceding word is a determiner
- I want a go
- e.g. A token is unlikely to be a noun if the immediately preceding word is to
- I want to go
- A token is more likely to be a possessive pronoun when followed by a common noun
- He stroked her (possessive pronoun) cat
- He gave her money (personal pronoun)
- The rule does not work here
- look at sentence structure in such cases (direct and indirect objects
- e.g. A token is very unlikely to be a verb if its preceding word is a determiner
- ambiguity usually disappears when token is not in isolation
- to what the prepositional phrase is attached or sentence structure
Approaches
- Rule based
- statistical
- model trained on a POS corpus
- corpus has each token is labelled with correct POS tag
- formats
- each line a token-POS tag pair
- each sentence has POS tag appended to each token
- Corpora
- e.g. Penn Treebank corpus
- e.g. GENIA corpus
- model trained on a POS corpus