Chapter 3. The extract Statement

Table of Contents

Extraction Specifications
Regular Expressions
Dictionaries
Splits

The extract statement provides a variety of functionality for extracting basic features directly from text. The basic form of this statement is:

extract 
    <colname 1> as <alias 1>, 
    ... , 
    <colname n> as <alias n>,
    <extraction specification>
from <input relation> <alias>
[having <having clause>];

The semantics of an extract statement are as follows:

Evaluate the extraction specification over each tuple of the input relation. For each result that the extraction produces, produce an output tuple containing the extracted values, along with any columns of the original tuple that were specified at the top of the extract statement. Rename the columns of the tuple according to the aliases at the beginning of the extract statement. Finally, apply any predicates in the (optional) having clause to the resulting tuple. If the tuple passes the predicates, add it to the output.

For example, the following extract statement evaluates a regular expression for U.S. phone numbers over the body of the Email relation, while passing through the sender column and filtering out emails that are not from the enron.com domain:

extract
    E.sender as emailsender,
    regex /\d{3}-\d{3}-\d{4}/ on E.body as num
from Email E
having MatchesRegex(/.*@enron.com/, emailsender);

Note that field names in the having clause refer to the aliases at the beginning of the extract statement. In the above example, the MatchesRegex predicate is applied to values from the sender field of the input relation, but the predicate refers to that field by the alias emailsender.

The input relation for an extract statement can be either a view name, as in the previous example, or a nested extract or select statement, as in the following example:

extract
    regex /foo/ on E.foobar as foo
from 
(
    extract regex /foobar/ as foobar
    from Document D
) E;

Extraction Specifications

AQL's extract statement supports a variety of different basic extraction operations. This section describes each of them in detail.

Regular Expressions

A regular expression extraction specification has the following structure:

    regex[es] /<regex1>/ 
        [and /<regex2>/ and ... and /<regex n>/]
        [with flags '<flags string>']
    on [<token spec<] <name>.<column>
    <grouping spec>

The first part of a regular expression extraction specification lists one or more regular expressions. By default, AQL uses Perl syntax for regular expressions: regular expression literals are enclosed in two forward slash characters, and regular expression escape sequences take precedence over other escape characters. AQL also allows regular expressions in SQL string syntax, so a regular expression for U.S. phone numbers could be expressed as either:

/\d{3}-\d{5}/

or:

'\\d{3}-\\d{5}'

In general, AQL supports the same features as the Java™ 5 regular expression implementation, as described at http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html. The System Text runtime contains several regular expression engine implementations, including Java™'s built-in implementation. During compilation, the Optimizer examines each regular expression and chooses the fastest engine that can execute the expression.

Note that the alternative execution engines may have slightly different sementics for certain corner cases. In particular, AQL does not provide any guarantee about the order in which alternations will be evaluated. For example, if an extract statement runs the regular expression:

/fish|fisherman/

over the text 'fisherman', the statement may match either 'fish' or 'fisherman', depending on which regular expression engine is used internally.

Flags

The second part of a regular expression extraction specification is an optional flags string:

        [with flags '<flags string>']

This string specifies a combination of flags to control regular expression matching. These flags correspond a subset of those defined in the Java 5 implementation. To specify multiple flags, separate them with the '|' character; for example, to specify both multiline matching and Unicode case folding, use the flags string 'MULTILINE|UNICODE_CASE'

Note: If the flags string is not provided, AQL will default to using only the 'DOTALL' flag.

The following table lists the regular expression flags supported in AQL, along with their Java equivalents:

Table 3.1. Flags that control regular expression matching

AQL Flag StringJava™ FlagMeaning
CANON_EQCANON_EQCanonical equivalence: Different Unicode encodings of the same character are considered equivalent.
CASE_INSENSITIVECASE_INSENSITIVEPerform case-insensitive matching.
UNICODEUNICODE_CASE If case-insensitive matching is specified, use Unicode case folding to determine whether two characters are equivalent in a case-insensitive comparison. Note:The behavior of this flag is not defined when it is used without the CASE_INSENSITIVE flag.
DOTALLDOTALLMake the dot character '.' match all characters, including newlines.
LITERALLITERALTreat the expression as a sequence of literal characters, ignoring the normal regular expression escapes.
MULTILINEMULTILINE Makes the characters ^ and $ match the beginning and end of any line, as opposed to the beginning and end of the entire input text.
UNIX_LINESUNIX_LINESTreat only the UNIX newline character '\n' as a line break, ignoring the carriage return character '\r'.

Token Constraints

The third part of a regular expression extraction specification tells whether to match the regular expression only on token boundaries

    ... [[between <number> and] <number> token[s] in] ...

This specification is optional; if it is omitted, then AQL will return the longest non-overlapping match at each character position in the input text.

If token constraints are present, the extract statement will return all matches that start and end on a token boundary and are within the specified range of tokens in length. If there are multiple overlapping matches, the extract statement will return all of them.

The current version of AQL uses a simple whitespace-based tokenization to determine token boundaries. A token is defined as a sequence of word characters or a single punctuation character. For example, in the string:

"The fish are pretty," said the boy.

AQL would identify token boundaries at the following locations:

["][The] [fish] [are] [pretty][,]["] [said] [the] [boy][.]

Grouping Specification

The final part of a regular expression extraction specification tells the system how to handle capturing groups in the regular expression. Capturing groups are regions of the regular expression match, identified by parentheses in the original expression. For example, in the expression

(fish)(cakes)

has 3 capturing groups:

  • Group 0 is the entire match, fishcakes.
  • Group 1 is fish.
  • Group 2 is cakes.

The format of the grouping specification is as follows:

return 
    group <number> as <name>
    [, group <number> as <name>]*

To return only group 0 (the entire match), you can use a shorter, alternative format:

as <name>

which is equivalent to

return group 0 as <name>

Regular Expression Examples

  • Find capitalized words that aren't first names, using canonical Unicode character equivalence to determine matches:
    create view NotFirstName as
    extract 
        regex /[A-Z][a-z]*/ with flags 'CANON_EQ'
            on 1 token in D.text 
            as word 
    from Document D
    having Not(ContainsDict('first.dict', word));
    
  • Extract the fields of a U.S. phone number, using capturing groups:
    create view Phone as
    extract regex /(\d{3})-(\d{3}-\d{4})/ 
        on between 4 and 5 tokens in D.text 
        return 
            group 1 as areaCode 
            and group 2 as restOfNumber
            and group 0 as fullNumber
    from Document D;
    
  • Run multiple regular expressions with a single extract statement:
    create view PhoneNum as
    extract regexes 
    	/(\d{3})-(\d{3}-\d{4})/ and /[Xx]\d{3,5}/
            on between 1 and 5 tokens in D.text as num
    from Document D;
    

Regular Expression Tips

You can help to make your annotators run faster and be easier to maintain by following a few guidelines.

  • Avoid long, complex regular expressions. Instead, use simpler, smaller regular expressions and combine them together with AQL statements.
  • Avoid unnecessary use of lookahead and lookbehind in regular expressions. You can usually achieve the same effect by adding predicates to the having clause of your extract statement.
  • Use token constraints in your regular expression extraction specifications whenever possible.

Dictionaries

The extract statement can evaluate an exhaustive dictionary of strings. To find matches of a dictionary, use a dictionary extraction specification, which has the following structure:

    dictionar[y|ies]
        '<dictionary<'
        [and '<dictionary<' and ... and '<dictionary<']
        [with flags '<flags string<']

Each <dictionary< can be either an on-disk external dictionary file or an the section called “Inline Dictionaries”. External dictionary files are carriage-return-delimited text files with one dictionary entry per line. Lines with the "#" character at position 0 are treated as comments.

The flags string controls how dictionary matching is performed; currently there are two options: "Exact" provides exact, case-sensitive matching, and "IgnoreCase" provides case-insensitive matching. If no flags string is provided, "IgnoreCase" is the default.

Inline Dictionaries

The create dictionary statement allows users to define dictionaries of words or phrases without needing to create an external dictionary file. These inline dictionaries can be used in extract statements and in the the section called “ContainsDict” function. The syntax of the create dictionary statement is as follows:

create dictionary <dictionary name> as
(
    "<entry 1>", "<entry 2>", ... , "<entry n>"
);

Entries can consist of multiple tokens and can even start with the dictionary file comment character "#".

Dictionary Examples

  • Find person names, using on-disk dictionaries of common first and last names, and case-sensitive matching.
    create view Name as
    extract
        dictionaries
            'first.dict' 
            and 'last.dict'
        with flags 'Exact'
            on D.text   
            as name
    from Document D;
    
  • Find conjunctions, using an inline dictionary and the default case-insensitive matching.
    create dictionary ConjunctionDict as
    (
        'and', 'or', 'but', 'yet'
    );
    
    create view Conjunction as
    extract
        dictionary 'ConjunctionDict'
            on D.text   
            as name
    from Document D;
    

Splits

The extract statement can also be used to split a large span into several smaller spans. The split extraction specification takes two arguments: A column containing longer target spans of text, and a second column containing split points.

The splitting algorithm works in two passes over the input relation. The first pass groups all of the input tuples by the target column. The second pass goes through the tuples in each group, splitting the target column with each value of the splitting column.

A split extraction specification has the following structure:

split using <name>.<split point column>
    [retain [right|left|both] split point[s]]
    on <name>.<column to split>
    as <output name>

The optional retain... arguments allow the user to specify how to treat the left and right endpoints of each result. If retain left split point is specified, then each output span will also contain the split point to its left, if such a split point exists. Likewise, retain right split point tells the system to make each output span contain the split point to its right.

For example, if the split points were all the instances of the word "fish" in the phrase "fish are swimming in the fish pond", then the various versions of the retain clause would have the following effects:

  • Clause omitted: " are swimming in the ", " pond"
  • retain right split point: " are swimming in the fish", " pond"
  • retain left split point: "fish are swimming in the ", "fish pond"
  • retain both split points: "fish are swimming in the fish", "fish pond"

Split Examples

  • Split the document into sentences, using a regular expression for sentence boundaries.
    create view Sentences as
    extract 
        split using B.boundary 
            retain right split point
            on B.text
            as sentence
    from (
        extract 
            D.text as text,
            regex /(([\.\?!]+\s)|(\n\s*\n))/ on D.text as boundary
            from Document D
    		-- Filter the candidate boundaries.
    		having Not(ContainsDict('abbreviations.dict', 
                CombineSpans(LeftContextTok(boundary, 1), boundary)))
    ) B;