identify and assign part of speech tags to tokens - classify words according to their meaning and role in grammar. e.g. - nouns, pronouns, verbs, adjectives, adverbs, determiners, conjunctions, prepositions, etc.

Tokenization is usually performed before or alongside this step. Only one POS tag for a token in each run.

Tagsets

Challenges

Syntactic ambiguity

  • accidental homophones and homonyms
    • e.g. duck

homophones & homonyms are words that are pronounced or spelled the same but have different meanings

  • different syntactic roles
    • e.g. To walk (verb) vs to go for a walk(noun)
    • e.g. old(adjective) people vs the old(noun)
    • To disambiguate:
      • ambiguity usually disappears when token is not in isolation
        • e.g. The garbage can (axillary verb) smell vs The garbage can (noun) smells
        • might not work - e.g. They can fish (can - auxiliary verb/verb) and canning - (verb/noun)
      • come up with rules
        • e.g. A token is very unlikely to be a verb if its preceding word is a determiner
          • I want a go
        • e.g. A token is unlikely to be a noun if the immediately preceding word is to
          • I want to go
        • A token is more likely to be a possessive pronoun when followed by a common noun
          • He stroked her (possessive pronoun) cat
          • He gave her money (personal pronoun)
            • The rule does not work here
            • look at sentence structure in such cases (direct and indirect objects
  • to what the prepositional phrase is attached or sentence structure

Approaches

  • Rule based
  • statistical
    • model trained on a POS corpus
      • corpus has each token is labelled with correct POS tag
      • formats
        • each line a token-POS tag pair
        • each sentence has POS tag appended to each token
      • Corpora