public abstract class AbstractCharStreamTransformer extends AbstractTextRestrictiveHandler implements IDocumentTransformer
Base class for transformers dealing with text documents only. Subclasses can safely be used as either pre-parse or post-parse handlers.
For pre-parsing, non-text documents will simply be ignored and no
transformation will occur. To find out if a document is a text-one, the
metadata Importer.DOC_CONTENT_TYPE
value is used. By default
any content type starting with "text/" is considered text. This default
behavior can be changed with the AbstractTextRestrictiveHandler.setContentTypeRegex(String)
method.
One must make sure to only match text documents to parsing exceptions.
For post-parsing, all documents are assumed to be text.
Sub-classes can restrict to which document to apply this transformation
based on document metadata (see AbstractRestrictiveHandler
).
Subclasses implementing IXMLConfigurable
should allow this inner
configuration:
<contentTypeRegex> (regex to identify text content-types, overridding default) </contentTypeRegex> <restrictTo caseSensitive="[false|true]" > property="(name of header/metadata name to match)" (regular expression of value to match) </restrictTo>
Constructor and Description |
---|
AbstractCharStreamTransformer() |
Modifier and Type | Method and Description |
---|---|
boolean |
equals(Object other) |
int |
hashCode() |
void |
transformDocument(String reference,
InputStream input,
OutputStream output,
Properties metadata,
boolean parsed)
Transforms document content and metadata.
|
protected abstract void |
transformTextDocument(String reference,
Reader input,
Writer output,
Properties metadata,
boolean parsed) |
documentAccepted, getContentTypeRegex, loadFromXML, saveToXML, setContentTypeRegex, toString
setRestriction
public final void transformDocument(String reference, InputStream input, OutputStream output, Properties metadata, boolean parsed) throws IOException
IDocumentTransformer
transformDocument
in interface IDocumentTransformer
reference
- document reference (e.g. URL)input
- document to transformoutput
- transformed documentmetadata
- document metadataparsed
- whether the document has been parsed already or not (a
parsed document should normally be text-based)IOException
- could not transform the documentprotected abstract void transformTextDocument(String reference, Reader input, Writer output, Properties metadata, boolean parsed) throws IOException
IOException
public boolean equals(Object other)
equals
in class AbstractTextRestrictiveHandler
public int hashCode()
hashCode
in class AbstractTextRestrictiveHandler
Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.