Chapter 3. Extracting Composite Entities with AQL

In this section, we describe how the features extracted using extract operations can be combined together to identify complete entities. Recall that in the previous section we identified phone numbers using two different AQL statements: one AQL statement for 10 digit phone numbers and another for extension numbers. Sometimes phone numbers may consist of both a 10-digit number and an extension (e.g., 713-463-9595 ext. 302). In order to identify such phone numbers, we write the following AQL statement that identifies all 10-digit phone numbers and extension numbers that are adjacent to each other.

    create view PhoneNumber  as
    extract 
      regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ 
	      on D.text as num 
    from Document D;
  
    create view ExtensionNumbers as
    extract 
       regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/
	      on D.text 
       return group 1 as num and group 0 as completenum
    from Document D;

    create view PhoneNumberWithExtension as
    select CombineSpans(P.num,E.completenum) as num
    from   PhoneNumber P, ExtensionNumbers E
    where  FollowsTok(P.num, E.completenum,0,1);
    
    output view PhoneNumberWithExtension;
    

The final AQL statement defines the view PhoneNumberWithExtension. The from clause in this statement defines a set of input views, PhoneNumber and ExtensionNumbers in this case. A pair (PhoneNumber,ExtensionNumbers) constructed by choosing one tuple from each of the two input views is a candidate to be output. The pair will be output if it satisfies the predicate specified in the where clause. In this particular example, the predicate requires that the PhoneNumber appear before the ExtensionNumber and the number of tokens in between them is at least 0 and at most 1. If this condition is satisfied then the pair (PhoneNumber,ExtensionNumbers) is output. The select clause describes how the output tuple is constructed. Here, the CombineSpans function takes spans as input and outputs the minimum region of text that completely covers both the input spans. The output for PhoneNumberWithExtension looks like this.

At this point we have identified 10-digit phone numbers, extension numbers and 10-digit phone numbers with extensions. We now combine all these results together using the union all operation.

 	create view PhoneNumberAll as
 	(select P.num as num from PhoneNumber P)
 	union all
 	(select E.completenum as num from ExtensionNumbers E)
 	union all
 	(select P.num as num from PhoneNumberWithExtension P);
 	
 	output view PhoneNumberAll;
 	 

We add the above AQL statements to the existing AQL statements in the main screen of the Development Environment and select PhoneNumberAll as the output view. Executing this query, we notice that phone numbers of all the three forms we targeted are identified. But, we also observe that there are some duplicate results as shown below.

Notice how for every PhoneNumberWithExtendion (e.g., 713-463-9595 ext. 302), the corresponding PhoneNumber and ExtensionNumber are also output separately. In order to remove such duplicate results, AQL supports a consolidate operation . The consolidate operation takes a set of spans as input and removes some of the spans based on a consolidation policy. In this example, we use the ContainedWithin that removes all those spans that are contained within some other span in the input. Applying this policy removes all those PhoneNumber and ExtensionNumber tuples that appear within PhoneNumberWithExtension.

 	create view PhoneNumberAllConsolidated as
        select P.num as num
        from PhoneNumberAll P
        consolidate on P.num
        using 'ContainedWithin';

        output view PhoneNumberAllConsolidated;
 	 

Refreshing the output views and selecting PhoneNumberAllConsolidated as the output view, we execute the AQL annotator. The results after consolidation looks like this.

We next present an example to illustrate how more complex tasks can be handled using AQL. Consider the task of finding people's phone numbers. Since we have already identified PhoneNumber and Person entities, we combine these to find occurrences of the relationship between the two entities.

 
    
        create view PersonFirstOrLastName as
        extract
	       dictionary 'names.dict' on D.text as name
        from Document D
        having MatchesRegex(/[A-Z].+/, name);
     
        create view PhoneNumber  as
        extract 
          regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ 
	          on D.text as num 
        from Document D;
  
        create view ExtensionNumbers as
        extract 
           regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/
	          on D.text 
           return group 1 as num and group 0 as completenum
        from Document D;

        create view PhoneNumberWithExtension as
        select CombineSpans(P.num,E.completenum) as num
        from   PhoneNumber P, ExtensionNumbers E
        where  FollowsTok(P.num, E.completenum,0,1);
        
        create view PhoneNumberAll as
 	    (select P.num as num from PhoneNumber P)
 	    union all
 	    (select E.completenum as num from ExtensionNumbers E)
 	    union all
 	    (select P.num as num from PhoneNumberWithExtension P);
 	     
 	     create view PhoneNumberAllConsolidated as
         select P.num as num
         from PhoneNumberAll P
         consolidate on P.num
         using 'ContainedWithin';

        
         create view PersonsPhone  as
         select person.name as person, phone.num as phone,
         CombineSpans(person.name, phone.num) as personphone
         from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone
         where Follows(person.name, phone.num, 0, 30);
      
         output view PersonsPhone; 
    

The results obtained by executing the above query look like this:

Notice how the AQL rule correctly identifies person's phone numbers such as Pete Castrejana at 713-410-0642 and Scott Shishido 713-853-9780. The rule also correctly identifies the association when a person has multiple phone numbers. For example, from the piece of text Robert Humlicek 713-853-6366 713-406-8293 two output tuples are generated, one associating Robert Humlicek with the phone number 713-853-6366 and another associating Robert Humlicek with the phone number 713-406-8293.