Chapter 2. Extracting Basic Features with AQL

In this section, we illustrate how the extract statement in AQL can be used to identify basic features from text. Continuing with the phone number example, notice how the FullPhoneNum AQL rule identifies phone numbers such as 713-410-0642 and 713-853-9780, but fails to identify other phone numbers such as (713) 853-5536. In order to account for these patterns, we write the following extract statement that comprises of multiple regular expressions.

    create view PhoneNumber  as
    extract 
      regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ 
	      on D.text as num 
    from Document D;

    output view PhoneNumber;
    

When the above extract statement is evaluated, each of the constituent regular expressions are evaluated individually and the results are combined together to create the PhoneNumber view. Executing the above AQL query, the output will look like this:

While the above AQL statement identify complete phone numbers, it does not capture mentions of extension numbers such as the number in the following snippet.

Please give me call if you have any questions. (Ext. 7637) Thanks. Lynn.

Notice how a phrase such as Ext followed by a sequence of digits is a good pattern to identify extension numbers. We write the following AQL statement to identify extension numbers using this pattern.

    create view ExtensionNumbers as
    extract 
       regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/
	      on D.text 
       return group 1 as num and group 0 as completenum
    from Document D;

    output view ExtensionNumbers;
    

In the above statement, notice how two attributes are created for each match to the regular expression. The num attribute contains the actual extension number, which corresponds to group 1 in the regular expression pattern. The notion of groups in AQL regular expressions is identical to what is supported in Java™ 5 regular expressions as described at http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html. The complete match for the regular expression is returned as the completenum attribute. Executing the above AQL statement returns matches as shown below.

We next consider the task of identifying person names from text. One way to start is by collecting a list of first names and last names and then identifying occurrences of each of these entries in the input documents. We refer to such lists as dictionaries, where each entry in the dictionary is on a separate line of the dictionary file as shown below.

We next upload the dictionary using the Add Dictionary menu option. A sample dictionary file containing first and last names is provided under data\sampledictionaries\names.dict and can be uploaded as shown below.

The Dictionary extraction operation allows us to now extract occurrences of first and last names using the dictionary file as illustrated by the following AQL query.

    create view PersonFirstOrLastName as
    extract
	   dictionary 'names.dict' on D.text as name
    from Document D;
      
    output view PersonFirstOrLastName;
    

Every occurrence of an entry in the dictionary file is identified by this extraction rule and the results look like this:

Notice how the AQL rule identifies all occurrences of first or last names in the document such as Jeff, Keeler, Lisa and Jacobson. But, the rule also makes some mistakes like identifying long as a possible person name. While Long is a popular last name in the United States (according the 1990 US Census data, Long was the 86th most popular last name in the country), the word also has other common meanings. To partially account for this ambiguity, we modify the rule to ensure that the matches to the dictionary is capitalized, i.e., the match begins with an upper-case letter.

     create view PersonFirstOrLastName as
     extract
	    dictionary 'names.dict' on D.text as name
     from Document D
     having MatchesRegex(/[A-Z].+/, name);
     
     output view PersonFirstOrLastName;
    

The having clause in the above query applies an additional predicate on every dictionary match. In this particular example, the MatchesRegex predicate checks whether the name is capitalized. Executing the above AQL statement removes the spurious match (long) as shown below.