This book describes AQL, a language for building annotators that extract structured information from unstructured or semistructured text. AQL is the primary method of creating new annotators in System Text for Information Extraction. This reference manual covers the AQL language as of release 0.2 of System Text.
The syntax of AQL is similar to that of SQL, but with several important differences:
/regex/
instead of
'regex'
.extract
, which is
not present in SQL.
AQL's data model is similar to the standard relational model used by SQL databases like DB2. All data in AQL is stored in tuples, data records of one or more columns, or fields. A collection of tuples forms a relation. All tuples in a relation must have the same schema — the names and types of their fields.
The fields of an AQL tuple must belong to one of the language's built-in scalar types:
Integer: A 32-bit signed integer.
Text: A Unicode string, with additional metadata to indicate which tuple the string belongs to.
Span: A contiguous region of characters in a Text object.
An AQL annotator consists of a collection of views,
each of which defines a relation. Some of these views are designated as
"output views", while others are "non-output views". In addition, there is
a special view called Document
that represents the document
being annotated.
Figure 1.1, “Compiling Executing AQL with System Text” shows how System Text for Information
Extraction compiles and executes annotators written in AQL. First, the AQL
is fed into the System Text Optimizer, which compiles an execution plan for
the views in the annotator. This execution plan is then fed into the
Runtime component of the system. The System Text Runtime has a
document-at-a-time execution model. The Runtime
receives a stream of documents, annotating each in turn and producing the
relevant annotations as output tuples. For each document, the Runtime
populates the Document
relation with a single tuple
representing the fields of the document. For example, the tuple that
represents a web page might have two fields, URL
and
text
. Then the Runtime evaluates all views that are necessary
to produce the output views. The contents of the output views become the
outputs of the annotator for the current document.