Chapter 1. Introduction

This book describes AQL, a language for building annotators that extract structured information from unstructured or semistructured text. AQL is the primary method of creating new annotators in System Text for Information Extraction. This reference manual covers the AQL language as of release 0.2 of System Text.

The syntax of AQL is similar to that of SQL, but with several important differences:

  • AQL is case sensitive.
  • AQL allows regular expressions to be expressed in Perl syntax, e.g. /regex/ instead of 'regex'.
  • AQL currently does not support advanced SQL features like correlated subqueries and recursive queries.
  • AQL has a new statement type, extract, which is not present in SQL.

Data Model

AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.

The fields of an AQL tuple must belong to one of the language's built-in scalar types:

  • Integer: A 32-bit signed integer.

  • Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to.

  • Span: A contiguous region of characters in a Text object.

Execution Model

An AQL annotator consists of a collection of views, each of which defines a relation. Some of these views are designated as "output views", while others are "non-output views". In addition, there is a special view called Document that represents the document being annotated.

Figure 1.1. Compiling Executing AQL with System Text

Compiling Executing AQL with System Text


Figure 1.1, “Compiling Executing AQL with System Text” shows how System Text for Information Extraction compiles and executes annotators written in AQL. First, the AQL is fed into the System Text Optimizer, which compiles an execution plan for the views in the annotator. This execution plan is then fed into the Runtime component of the system. The System Text Runtime has a document-at-a-time execution model. The Runtime receives a stream of documents, annotating each in turn and producing the relevant annotations as output tuples. For each document, the Runtime populates the Document relation with a single tuple representing the fields of the document. For example, the tuple that represents a web page might have two fields, URL and text. Then the Runtime evaluates all views that are necessary to produce the output views. The contents of the output views become the outputs of the annotator for the current document.