In this section, we illustrate how the extract
statement in AQL can be used
to identify basic features from text. Continuing with the phone number example, notice how
the FullPhoneNum
AQL rule identifies phone numbers such as 713-410-0642
and
713-853-9780
, but fails to identify other phone numbers such as
(713) 853-5536
. In order to account for these patterns,
we write the following extract
statement that comprises of multiple regular expressions.
create view PhoneNumber as extract regexes /(\d{3})-(\d{3}-\d{4})/ and /\(\d{3}\)\s*(\d{3}-\d{4})/ on D.text as num from Document D; output view PhoneNumber;
When the above
extract
statement is evaluated, each of the constituent regular expressions are evaluated
individually and the results are combined together to create the PhoneNumber
view.
Executing the above AQL query, the output will look like this:
Please give me call if you have any questions. (Ext. 7637) Thanks. Lynn
.
Ext
followed by a sequence of digits is a good pattern
to identify extension numbers. We write the following AQL statement to identify extension numbers
using this pattern.
create view ExtensionNumbers as extract regex /[Ee]xt\s*[\.\-\:]?\s*(\d{3,5})/ on D.text return group 1 as num and group 0 as completenum from Document D; output view ExtensionNumbers;
In the above statement, notice how two attributes are created for each match to the regular expression. The
num
attribute contains the actual extension number, which corresponds to group 1 in the
regular expression pattern. The notion of groups in AQL regular expressions is identical to what is
supported in Java™ 5 regular expressions as described at
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html.
The complete match for the regular expression is returned as the completenum
attribute.
Executing the above AQL statement returns matches as shown below.
dictionaries
, where each entry
in the dictionary is on a separate line of the dictionary file as shown below.
Add Dictionary
menu option.
A sample dictionary file containing first and last names is provided under data\sampledictionaries\names.dict
and can be uploaded as shown below.
Dictionary extraction
operation allows us to now extract occurrences of first and last names
using the dictionary file as illustrated by the following AQL query.
create view PersonFirstOrLastName as extract dictionary 'names.dict' on D.text as name from Document D; output view PersonFirstOrLastName;
Every occurrence of an entry in the dictionary file is identified by this extraction rule and the results look like this:
Jeff
, Keeler
, Lisa
and Jacobson
. But, the rule
also makes some mistakes like identifying long
as a possible person name.
While Long
is a popular last name in the United States (according the 1990 US Census data,
Long
was the 86th most popular last name in the country), the word also has other common meanings.
To partially account for this ambiguity, we modify the rule to ensure that the matches to the dictionary
is capitalized, i.e., the match begins with an upper-case letter.
create view PersonFirstOrLastName as extract dictionary 'names.dict' on D.text as name from Document D having MatchesRegex(/[A-Z].+/, name); output view PersonFirstOrLastName;
The
having
clause in the above query applies an additional predicate on every dictionary match.
In this particular example, the MatchesRegex
predicate checks whether the
name
is capitalized. Executing the above AQL statement removes the spurious match (long
)
as shown below.