A Slightly More Complex Annotator

The example AQL rules we used in the previous section produces only a single output type, PhoneNum, with a single column, number. In this section, we'll augment the AQL we just used with some additional rules that add a second, two-column output type. Go back to the main screen of the Development Environment, and replace the existing AQL with the following set of statements:

create view PhoneNum as
extract 
    regex /[0-9]{3}-[0-9]{4}/
        on D.text as number
from Document D;

output view PhoneNum;

create view AreaCode as
extract 
    regex /[0-9]{3}/
        on 1 token in D.text as code
from Document D;

create view FullPhoneNum as
select A.code as areacode, P.number as number
from PhoneNum P, AreaCode A
where Follows(P.number, A.code, 0, 1);
    
output view FullPhoneNum;

The first of these statements creates the same PhoneNum view that we used before. The second create view statement uses a similar extract statement to find sequences of three digits and puts the results into a view called AreaCode.

The AreaCode view specifies that its regular expression must match

on 1 token

of the document text. If the document text were 1234 567-890, this rule would match the strings "567" and "890", but not "123" or "234". The current version of AQL uses a simple whitespace-based tokenization to determine token boundaries. A token is defined as a sequence of word characters or a single punctuation character. For example, for the string 1234 567-890 AQL would identify token boundaries at the following locations:

    [1234] [567] [-] [890]
    

We'll cover the topic of token boundaries more in Chapter 2, Extracting Basic Features with AQL

The third view, FullPhoneNum, uses a select statement to combine phone numbers with their respective area codes, creating a composite annotation. We'll cover the select statement in detail in Chapter 3, Extracting Composite Entities with AQL. For now, here's an English traslation of the FullPhoneNum view:

Within each document, find all pairs of PhoneNum and AreaCode tuples, such that the number field of PhoneNum is followed by the the code field of AreaCode within 0 to 1 characters. For each matching pair of tuples, create a new output tuple with two fields, areacode and number.

The final line of AQL,

output view FullPhoneNum;

tells System Text to add the view FullPhoneNum to the outputs of the rule set. So this rule set has two outputs, PhoneNum and FullPhoneNum. We've already seen what PhoneNum looks like, so let's tell the Development Environment to show us the FullPhoneNum view. Click on the "Refresh Views" button at the bottom of the window. Then you should be able to select "FullPhoneNum" in the drop-down menu labeled "Output View":

Next, click on the "Execute" button. You should see a screen that looks like this:

This output screen looks similar to the one that PhoneNum produced, except that the tuples of the FullPhoneNum view have two columns, whereas the PhoneNum view only defines one column. The tuple list shows both fields:

For document 10539, the FullPhoneNum view produces a single output tuple. This tuple annotates the area code and number fields of the phone number 512-691-6127.

Note that the "Text Snippets" display only shows the number column of the output tuples. By convention, this display always shows the rightmost column of type Span.

Summary

Congratulations! You've installed the System Text Development Environment, run some example AQL annotators, and tried these annotators out on the Enron email collection. In the chapters that follow, we'll show you how to build your own AQL annotators, harnessing the full power of the language.