Table of Contents
The extract
statement provides a variety of functionality
for extracting basic features directly from text. The basic form of this
statement is:
extract <colname 1> as <alias 1>, ... , <colname n> as <alias n>, <extraction specification> from <input relation> <alias> [having <having clause>];
The semantics of an extract
statement are as follows:
Evaluate the extraction specification over each tuple of the input relation. For each result that the extraction produces, produce an output tuple containing the extracted values, along with any columns of the original tuple that were specified at the top of theextract
statement. Rename the columns of the tuple according to the aliases at the beginning of theextract
statement. Finally, apply any predicates in the (optional)having
clause to the resulting tuple. If the tuple passes the predicates, add it to the output.
For example, the following extract
statement evaluates a
regular expression for U.S. phone numbers over the body
of the
Email
relation, while passing through the sender
column and filtering out emails that are not from the
enron.com
domain:
extract E.sender as emailsender, regex /\d{3}-\d{3}-\d{4}/ on E.body as num from Email E having MatchesRegex(/.*@enron.com/, emailsender);
Note that field names in the having
clause refer to the
aliases at the beginning of the extract
statement. In the
above example, the MatchesRegex
predicate is applied to values
from the sender
field of the input relation, but the predicate
refers to that field by the alias emailsender
.
The input relation for an extract
statement can be either a
view name, as in the previous example, or a nested extract
or
select
statement, as in the following example:
extract regex /foo/ on E.foobar as foo from ( extract regex /foobar/ as foobar from Document D ) E;
extract
statement supports a variety of different basic
extraction operations. This section describes each of them in detail.
A regular expression extraction specification has the following structure:
regex[es] /<regex1>/ [and /<regex2>/ and ... and /<regex n>/] [with flags '<flags string>'] on [<token spec<] <name>.<column> <grouping spec>
The first part of a regular expression extraction specification lists one or more regular expressions. By default, AQL uses Perl syntax for regular expressions: regular expression literals are enclosed in two forward slash characters, and regular expression escape sequences take precedence over other escape characters. AQL also allows regular expressions in SQL string syntax, so a regular expression for U.S. phone numbers could be expressed as either:
/\d{3}-\d{5}/
or:
'\\d{3}-\\d{5}'
In general, AQL supports the same features as the Java™ 5 regular expression implementation, as described at http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html. The System Text runtime contains several regular expression engine implementations, including Java™'s built-in implementation. During compilation, the Optimizer examines each regular expression and chooses the fastest engine that can execute the expression.
Note that the alternative execution engines may have slightly different
sementics for certain corner cases. In particular, AQL does not provide any
guarantee about the order in which alternations will be evaluated. For
example, if an extract
statement runs the regular expression:
/fish|fisherman/
over the text 'fisherman', the statement may match either 'fish' or 'fisherman', depending on which regular expression engine is used internally.
The second part of a regular expression extraction specification is an optional flags string:
[with flags '<flags string>']
This string specifies a combination of flags to control regular expression
matching. These flags correspond a subset of those defined in the
Java 5 implementation.
To specify multiple flags, separate them with the '|' character; for example,
to specify both multiline matching and Unicode case folding, use the flags
string 'MULTILINE|UNICODE_CASE'
Note: If the flags string is not provided, AQL will default to using only the 'DOTALL' flag.
The following table lists the regular expression flags supported in AQL, along with their Java equivalents:
Table 3.1. Flags that control regular expression matching
AQL Flag String | Java™ Flag | Meaning |
---|---|---|
CANON_EQ | CANON_EQ | Canonical equivalence: Different Unicode encodings of the same character are considered equivalent. |
CASE_INSENSITIVE | CASE_INSENSITIVE | Perform case-insensitive matching. |
UNICODE | UNICODE_CASE | If case-insensitive matching is specified, use Unicode case folding to determine whether two characters are equivalent in a case-insensitive comparison. Note:The behavior of this flag is not defined when it is used without the CASE_INSENSITIVE flag. |
DOTALL | DOTALL | Make the dot character '.' match all characters, including newlines. |
LITERAL | LITERAL | Treat the expression as a sequence of literal characters, ignoring the normal regular expression escapes. |
MULTILINE | MULTILINE |
Makes the characters ^ and $ match the
beginning and end of any line, as opposed to the beginning and end of
the entire input text.
|
UNIX_LINES | UNIX_LINES | Treat only the UNIX newline character '\n' as a line break, ignoring the carriage return character '\r'. |
The third part of a regular expression extraction specification tells whether to match the regular expression only on token boundaries
... [[between <number> and] <number> token[s] in] ...
This specification is optional; if it is omitted, then AQL will return the longest non-overlapping match at each character position in the input text.
If token constraints are present, the extract
statement will
return all matches that start and end on a token boundary
and are within the specified range of tokens in length. If there are multiple
overlapping matches, the extract
statement will return all of
them.
The current version of AQL uses a simple whitespace-based tokenization to determine token boundaries. A token is defined as a sequence of word characters or a single punctuation character. For example, in the string:
"The fish are pretty," said the boy.
AQL would identify token boundaries at the following locations:
["][The] [fish] [are] [pretty][,]["] [said] [the] [boy][.]
The final part of a regular expression extraction specification tells the system how to handle capturing groups in the regular expression. Capturing groups are regions of the regular expression match, identified by parentheses in the original expression. For example, in the expression
(fish)(cakes)
has 3 capturing groups:
fishcakes
.fish
.cakes
.
The format of the grouping specification is as follows:
return group <number> as <name> [, group <number> as <name>]*
To return only group 0 (the entire match), you can use a shorter, alternative format:
as <name>
which is equivalent to
return group 0 as <name>
create view NotFirstName as extract regex /[A-Z][a-z]*/ with flags 'CANON_EQ' on 1 token in D.text as word from Document D having Not(ContainsDict('first.dict', word));
create view Phone as extract regex /(\d{3})-(\d{3}-\d{4})/ on between 4 and 5 tokens in D.text return group 1 as areaCode and group 2 as restOfNumber and group 0 as fullNumber from Document D;
extract
statement:
create view PhoneNum as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /[Xx]\d{3,5}/ on between 1 and 5 tokens in D.text as num from Document D;
You can help to make your annotators run faster and be easier to maintain by following a few guidelines.
having
clause of your extract
statement.
The extract
statement can evaluate an exhaustive
dictionary of strings. To find matches of a dictionary,
use a dictionary extraction specification, which has the following structure:
dictionar[y|ies] '<dictionary<' [and '<dictionary<' and ... and '<dictionary<'] [with flags '<flags string<']
Each <dictionary<
can be either an on-disk external
dictionary file or an the section called “Inline Dictionaries”.
External dictionary files are carriage-return-delimited text files with one
dictionary entry per line. Lines with the "#" character at position 0 are
treated as comments.
The flags string controls how dictionary matching is performed; currently there are two options: "Exact" provides exact, case-sensitive matching, and "IgnoreCase" provides case-insensitive matching. If no flags string is provided, "IgnoreCase" is the default.
The create dictionary
statement allows users to define
dictionaries of words or phrases without needing to create an external
dictionary file. These inline dictionaries can be used in extract
statements and in the the section called “ContainsDict”
function. The syntax of the create dictionary
statement is as
follows:
create dictionary <dictionary name> as ( "<entry 1>", "<entry 2>", ... , "<entry n>" );
Entries can consist of multiple tokens and can even start with the dictionary file comment character "#".
create view Name as extract dictionaries 'first.dict' and 'last.dict' with flags 'Exact' on D.text as name from Document D;
create dictionary ConjunctionDict as ( 'and', 'or', 'but', 'yet' ); create view Conjunction as extract dictionary 'ConjunctionDict' on D.text as name from Document D;
The extract
statement can also be used to split a large
span into several smaller spans. The split extraction
specification takes two arguments: A column containing longer
target spans of text, and a second column containing
split points.
The splitting algorithm works in two passes over the input relation. The first pass groups all of the input tuples by the target column. The second pass goes through the tuples in each group, splitting the target column with each value of the splitting column.
A split extraction specification has the following structure:
split using <name>.<split point column> [retain [right|left|both] split point[s]] on <name>.<column to split> as <output name>
The optional retain...
arguments allow the user to specify how to
treat the left and right endpoints of each result. If retain left split
point
is specified, then each output span will also contain the split
point to its left, if such a split point exists. Likewise, retain right
split point
tells the system to make each output span contain the split
point to its right.
For example, if the split points were all the instances of the word "fish" in
the phrase "fish are swimming in the fish pond", then the various versions of
the retain
clause would have the following effects:
retain right split point
: " are swimming in the fish",
" pond"
retain left split point
: "fish are swimming in the ",
"fish pond"
retain both split points
: "fish are swimming in the fish",
"fish pond"
create view Sentences as extract split using B.boundary retain right split point on B.text as sentence from ( extract D.text as text, regex /(([\.\?!]+\s)|(\n\s*\n))/ on D.text as boundary from Document D -- Filter the candidate boundaries. having Not(ContainsDict('abbreviations.dict', CombineSpans(LeftContextTok(boundary, 1), boundary))) ) B;