Documents are typically unstructured or have complicated structures - keeping human readability in mind.

Annotations are mark-ups in the (converted raw text) document that enable machine readability. This can be for presentation, metadata and understanding content.

Types

  • Boundary Notation Annotation
  • inline markup language elements
  • stand-off annotation
    • delimiter separated values (DSV)
    • JSON

Boundary Notation Annotation

  • done at level of individual tokens
  • in the text itself
  • for units of interest spanning multiple tokens/ Named Entity Recognition, we can use BIO
    • BIO: B =begin, I = inside, O = outside
    • create a table of tokens with their annotations
    • every non entity is tagged O
    • every entity tagged with B (and optionally the type of entity) and when it spans multiple tokens, use I
    • e.g. The British Prime Minister Tony Blair and … →
      • The → O
      • British → B-Person
      • Prime → I-Person
      • Minister → I-Person
      • Tony → B-Person
      • Blair → I-Person
      • and → O
    • cannot handle hierarchical or structured annotations
      • nested entities, entity relations, events, etc.

event is a type of entity relation that contain multiple entities participating in an event. e.g. → attendees in a meeting or attacker and defender in a war, etc.

Inline markup language elements

  • addition of markup tags within text (in-text)
  • HTML, XML
  • can deal with hierarchical & structured annotations - nested entities, entity relations and events
    • relations can be done using some attributes in the tags and linking via IDs
  • impossible to encode overlapping/intersecting annotations
    • Iraqi city of Basra → Iraqi city and city of Basra
  • processing/parsing required to and from

Stand-off annotations

  • stored separately from text
  • links annotations to text using indexing based on character offsets
  • can be delimiter separated values (DSV) like CSV or JSON
    • e.g. Tony Blair → Person 144 154
    • e.g. { “id” : “199”, “ne_type”: “Person” “begin” : “144” “end”: “154” “surface_form” : “Tony Blair” }
  • can handle hierarchical, structured and overlapping annotations
  • not readily human readable