public abstract class AbstractCharStreamTagger extends AbstractTextRestrictiveHandler implements IDocumentTagger
Base class for taggers dealing with the body of text documents only. Subclasses can safely be used as either pre-parse or post-parse handlers.
For pre-parsing, non-text documents will simply be ignored and no
tagging will occur. To find out if a document is a text-one, the
metadata Importer.DOC_CONTENT_TYPE
value is used. By default
any content type starting with "text/" is considered text. This default
behavior can be changed with the AbstractTextRestrictiveHandler.setContentTypeRegex(String)
method.
One must make sure to only match text documents to parsing exceptions.
For post-parsing, all documents are assumed to be text.
Sub-classes can restrict to which document to apply this tagger
based on document metadata (see AbstractRestrictiveHandler
).
Subclasses implementing IXMLConfigurable
should allow this inner
configuration:
<contentTypeRegex> (regex to identify text content-types, overridding default) </contentTypeRegex> <restrictTo caseSensitive="[false|true]" > property="(name of header/metadata name to match)" (regular expression of value to match) </restrictTo>
Constructor and Description |
---|
AbstractCharStreamTagger() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
hashCode() |
void |
tagDocument(String reference,
InputStream document,
Properties metadata,
boolean parsed)
Tags a document with extra metadata information.
|
protected abstract void |
tagTextDocument(String reference,
Reader input,
Properties metadata,
boolean parsed) |
documentAccepted, getContentTypeRegex, loadFromXML, saveToXML, setContentTypeRegex, toString
setRestriction
public void tagDocument(String reference, InputStream document, Properties metadata, boolean parsed) throws IOException
IDocumentTagger
tagDocument
in interface IDocumentTagger
reference
- document reference (e.g. URL)document
- documentmetadata
- document metadataparsed
- whether the document has been parsed already or not (a
parsed document should normally be text-based)IOException
- problem reading the documentprotected abstract void tagTextDocument(String reference, Reader input, Properties metadata, boolean parsed) throws IOException
IOException
public boolean equals(Object other)
equals
in class AbstractTextRestrictiveHandler
public int hashCode()
hashCode
in class AbstractTextRestrictiveHandler
Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.