The example AQL rules we used in the previous section produces only a
single output type, PhoneNum
, with a single column,
number
. In this section, we'll augment the AQL we just used
with some additional rules that add a second, two-column output type. Go
back to the main screen of the Development Environment, and replace the
existing AQL with the following set of statements:
create view PhoneNum as extract regex /[0-9]{3}-[0-9]{4}/ on D.text as number from Document D; output view PhoneNum; create view AreaCode as extract regex /[0-9]{3}/ on 1 token in D.text as code from Document D; create view FullPhoneNum as select A.code as areacode, P.number as number from PhoneNum P, AreaCode A where Follows(P.number, A.code, 0, 1); output view FullPhoneNum;
The first of these statements creates the same PhoneNum
view
that we used before. The second create view
statement uses a
similar extract
statement to find sequences of three digits
and puts the results into a view called AreaCode
.
The AreaCode
view specifies that its regular expression must
match
on 1 token
of the
document text. If the document text were 1234 567-890
, this
rule would match the strings "567" and "890", but not "123" or "234".
The current version of AQL uses a simple whitespace-based tokenization to
determine token boundaries. A token is defined as a sequence of word characters
or a single punctuation character. For example, for the string 1234 567-890
AQL would identify token boundaries at the following locations:
[1234] [567] [-] [890]
We'll cover the topic of token boundaries more in Chapter 2, Extracting Basic Features with AQL
The third view, FullPhoneNum
, uses a select
statement to combine phone numbers with their respective area codes,
creating a composite annotation. We'll cover the select
statement in detail in Chapter 3, Extracting Composite Entities with AQL.
For now, here's an English traslation of the FullPhoneNum
view:
Within each document, find all pairs ofPhoneNum
andAreaCode
tuples, such that thenumber
field ofPhoneNum
is followed by the thecode
field ofAreaCode
within 0 to 1 characters. For each matching pair of tuples, create a new output tuple with two fields,areacode
andnumber
.
The final line of AQL,
output view FullPhoneNum;
tells System Text to add the view FullPhoneNum
to the
outputs of the rule set. So this rule set has two outputs,
PhoneNum
and FullPhoneNum
.
We've already seen what PhoneNum
looks like, so let's tell the
Development Environment to show us the FullPhoneNum
view.
Click on the "Refresh Views" button at the bottom of the window.
Then you should be able to select "FullPhoneNum" in the drop-down menu
labeled "Output View":
Next, click on the "Execute" button. You should see a screen that looks like this:
This output screen looks similar to the one that PhoneNum
produced, except that the tuples of the FullPhoneNum
view have
two columns, whereas the PhoneNum
view only defines one
column. The tuple list shows both fields:
For document 10539, the FullPhoneNum
view produces a single
output tuple. This tuple annotates the area code and number fields of the
phone number 512-691-6127.
Note that the "Text Snippets" display only shows the number
column of the output tuples. By convention, this display always shows the
rightmost column of type Span.
Congratulations! You've installed the System Text Development Environment, run some example AQL annotators, and tried these annotators out on the Enron email collection. In the chapters that follow, we'll show you how to build your own AQL annotators, harnessing the full power of the language.