Chapter 4. Managing Document Collections
The System Text Development Environment supports the development of AQL annotators over a collection of documents.
We next show how a custom document collection can be used in this environment.
- Every document in the collection can be viewed as containing a
document id
and the actual text
value over which the rules are executed. We create a text file for each document where the name of the file serves as the document id and the contents correspond
to the text to be processed.
- The contents of all the documents in a single collection need to be in the same encoding. The encodings currently supported in System Text
are UTF-8, UTF-16, Windows-1252, ISO-8859-1, US-ASCII, UTF-16BE and UTF-16LE.
- The document collection is created as a zip file comprising of these individual text files.
- The document collection is uploaded using the
Add Collection
option under the Collection
menu.
When writing AQL annotators for the custom document collection, we select the appropriate collection before executing the rules as shown below.