extract
operations can be combined
together to identify complete entities. Recall that in the previous section we identified phone numbers using two different AQL statements:
one AQL statement for 10 digit phone numbers and another for extension numbers.
Sometimes phone numbers may consist of both a 10-digit number and an extension (e.g., 713-463-9595 ext. 302
). In order to identify
such phone numbers, we write the following AQL statement that identifies all 10-digit phone numbers and extension numbers that are adjacent to each other.
create view PhoneNumber as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ on D.text as num from Document D; create view ExtensionNumbers as extract regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/ on D.text return group 1 as num and group 0 as completenum from Document D; create view PhoneNumberWithExtension as select CombineSpans(P.num,E.completenum) as num from PhoneNumber P, ExtensionNumbers E where FollowsTok(P.num, E.completenum,0,1); output view PhoneNumberWithExtension;
The final AQL statement defines the view
PhoneNumberWithExtension
. The from
clause in this statement defines a set of input views, PhoneNumber
and ExtensionNumbers
in this case. A pair (PhoneNumber
,ExtensionNumbers
) constructed by choosing one tuple from each of the two input views is a candidate to be output.
The pair will be output if it satisfies the predicate specified in the where
clause.
In this particular example, the predicate requires that the PhoneNumber
appear before the
ExtensionNumber
and the number of tokens in between them is at least 0 and at most 1.
If this condition is satisfied then the pair (PhoneNumber
,ExtensionNumbers
)
is output. The select
clause describes how the output tuple is constructed.
Here, the CombineSpans
function takes spans as input and outputs the minimum region of text that
completely covers both the input spans. The output for PhoneNumberWithExtension
looks like this.
union all
operation.
create view PhoneNumberAll as (select P.num as num from PhoneNumber P) union all (select E.completenum as num from ExtensionNumbers E) union all (select P.num as num from PhoneNumberWithExtension P); output view PhoneNumberAll;
We add the above AQL statements to the existing AQL statements in the main screen of the Development Environment and select
PhoneNumberAll
as the output view. Executing this query, we notice that
phone numbers of all the three forms we targeted are identified. But, we also observe that there are some
duplicate results as shown below.
PhoneNumberWithExtendion
(e.g., 713-463-9595 ext. 302
), the
corresponding PhoneNumber
and ExtensionNumber
are also output separately.
In order to remove such duplicate results, AQL supports a consolidate
operation .
The consolidate
operation takes a set of spans as input and removes some of the spans based on a
consolidation policy
. In this example, we use the ContainedWithin
that removes all those spans that are contained
within some other span in the input. Applying this policy removes all those PhoneNumber
and ExtensionNumber
tuples
that appear within PhoneNumberWithExtension
.
create view PhoneNumberAllConsolidated as select P.num as num from PhoneNumberAll P consolidate on P.num using 'ContainedWithin'; output view PhoneNumberAllConsolidated;
Refreshing the output views and selecting
PhoneNumberAllConsolidated
as the output view, we execute the AQL annotator.
The results after consolidation
looks like this.
PhoneNumber
and Person
entities,
we combine these to find occurrences of the relationship between the two entities.
create view PersonFirstOrLastName as extract dictionary 'names.dict' on D.text as name from Document D having MatchesRegex(/[A-Z].+/, name); create view PhoneNumber as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ on D.text as num from Document D; create view ExtensionNumbers as extract regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/ on D.text return group 1 as num and group 0 as completenum from Document D; create view PhoneNumberWithExtension as select CombineSpans(P.num,E.completenum) as num from PhoneNumber P, ExtensionNumbers E where FollowsTok(P.num, E.completenum,0,1); create view PhoneNumberAll as (select P.num as num from PhoneNumber P) union all (select E.completenum as num from ExtensionNumbers E) union all (select P.num as num from PhoneNumberWithExtension P); create view PhoneNumberAllConsolidated as select P.num as num from PhoneNumberAll P consolidate on P.num using 'ContainedWithin'; create view PersonsPhone as select person.name as person, phone.num as phone, CombineSpans(person.name, phone.num) as personphone from PersonFirstOrLastName person, PhoneNumberAllConsolidated phone where Follows(person.name, phone.num, 0, 30); output view PersonsPhone;
The results obtained by executing the above query look like this:
Pete Castrejana at 713-410-0642
and Scott Shishido 713-853-9780
. The rule also correctly identifies the association when a person has multiple phone numbers.
For example, from the piece of text Robert Humlicek 713-853-6366 713-406-8293
two output tuples are generated,
one associating Robert Humlicek
with the phone number 713-853-6366
and another associating
Robert Humlicek
with the phone number 713-406-8293
.