public class TextBetweenTagger extends AbstractStringTagger implements IXMLConfigurable
Extracts and add values found between a matching start and end strings to a document metadata field. The matching string end-points are defined in pairs and multiple ones can be specified at once. The field specified for a pair of end-points is considered a multi-value field.
This class can be used as a pre-parsing or post-parsing handlers.
XML configuration usage:
<tagger class="com.norconex.importer.transformer.impl.TextBetweenTagger" inclusive="[false|true]" caseSensitive="[false|true]" > <contentTypeRegex> (regex to identify text content-types for pre-import, overriding default) </contentTypeRegex> <restrictTo caseSensitive="[false|true]" > property="(name of header/metadata name to match)" (regular expression of value to match) </restrictTo> <textBetween name="targetFieldName"> <start>(regex)</start> <end>(regex)</end> </textBetween> <-- multiple textBetween tags allowed --> </tagger>
Constructor and Description |
---|
TextBetweenTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addTextEndpoints(String name,
String fromText,
String toText)
Adds a new pair of end points to match.
|
boolean |
isCaseSensitive() |
boolean |
isInclusive() |
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setCaseSensitive(boolean caseSensitive)
Sets whether to ignore case when matching start and end text.
|
void |
setInclusive(boolean inclusive)
Sets whether start and end text pairs should themselves be stripped or
not.
|
protected void |
tagStringDocument(String reference,
StringBuilder content,
Properties metadata,
boolean parsed,
boolean partialContent) |
equals, hashCode, tagTextDocument, toString
tagDocument
documentAccepted, getContentTypeRegex, loadFromXML, saveToXML, setContentTypeRegex
setRestriction
protected void tagStringDocument(String reference, StringBuilder content, Properties metadata, boolean parsed, boolean partialContent)
tagStringDocument
in class AbstractStringTagger
public boolean isInclusive()
public void setInclusive(boolean inclusive)
inclusive
- true
to strip start and end textpublic boolean isCaseSensitive()
public void setCaseSensitive(boolean caseSensitive)
caseSensitive
- true
to consider character casepublic void addTextEndpoints(String name, String fromText, String toText)
name
- target metadata field name where to store the extracted
valuesfromText
- the left string to matchtoText
- the right string to matchpublic void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009-2014 Norconex Inc.. All Rights Reserved.