public class TextExtractor
extends java.lang.Object
TextExtractor
is used to analyze a PDF page and extract words
and logical structures that are visible within a given region. The
resulting list of lines and words can be traversed element by element or
accessed as a string buffer. The class also includes utility methods to
extract PDF text as HTML or XML.
Possible use case scenarios for TextExtractor
include:
The main task of TextExtractor
is to interpret PDF pages and offer a
simple to use API to:
Note: TextExtractor
is analyzing only textual content of the page.
This means that the rasterized (e.g. in scanned pages) or vectorized
text (where glyphs are converted to path outlines) will not be recognized
as text. Please note that it is still possible to extract this content
using ElementReader
interface.
In some cases TextExtractor
may extract text that does not appear to
be on the visible page (e.g. when text is obscured by an image or a
rectangle). In these situations it is possible to use processing flags
such as 'e_remove_hidden_text'
and 'e_no_invisible_text'
to remove hidden text.
A sample use case:
... Initialize PDFNet ... PDFDoc doc = new PDFDoc(filein); doc.initSecurityHandler(); Page page = doc.pageBegin().current(); TextExtractor txt = new TextExtractor(); txt.begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text); string text = txt.getAsText(); // or traverse words one by one... TextExtractor.Word word; for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) { for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) { string w = word.GetString(); } }
For full sample code, please take a look at TextExtract sample sample project.
Modifier and Type | Class and Description |
---|---|
class |
TextExtractor.Line |
class |
TextExtractor.Style
A class representing predominant text style associated with a
given Line, a Word, or a Glyph.
|
class |
TextExtractor.Word |
Modifier and Type | Field and Description |
---|---|
static int |
e_no_dup_remove
Disables removing duplicated text that is frequently used to
achieve visual effects of drop shadow and fake bold
|
static int |
e_no_invisible_text
Enables removing text that uses rendering mode 3 (i.e.
|
static int |
e_no_ligature_exp
Disables expanding of ligatures using a predefined mapping.
|
static int |
e_output_bbox
The Constant e_output_bbox.
|
static int |
e_output_style_info
The Constant e_output_style_info.
|
static int |
e_punct_break
Treat punctuation (e.g.
|
static int |
e_remove_hidden_text
Enables removal of text that is obscured by images or
rectangles.
|
static int |
e_words_as_elements
The Constant e_words_as_elements.
|
Constructor and Description |
---|
TextExtractor()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
void |
begin(Page page)
Start reading the page.
|
void |
begin(Page page,
Rect clip_ptr)
Start reading the page.
|
void |
begin(Page page,
Rect clip_ptr,
int flags)
Start reading the page.
|
void |
destroy()
Frees the native memory of the object.
|
java.lang.String |
getAsText()
Get all words in the current selection as a single string.
|
java.lang.String |
getAsText(boolean dehyphen)
Get all words in the current selection as a single string.
|
java.lang.String |
getAsXML()
Get text content in a form of an XML string.
|
java.lang.String |
getAsXML(int xml_output_flags)
Get text content in a form of an XML string.
|
TextExtractor.Line |
getFirstLine()
Get the first line.
|
int |
getNumLines()
Get the number lines.
|
boolean |
getRightToLeftLanguage() |
java.lang.String |
getTextUnderAnnot(Annot annot)
Get all the characters that intersect an annotation.
|
int |
getWordCount()
Get the word count.
|
void |
setRightToLeftLanguage(boolean rtl)
Sets the directionality of text extractor.
|
public static final int e_no_ligature_exp
public static final int e_no_dup_remove
public static final int e_punct_break
public static final int e_remove_hidden_text
public static final int e_no_invisible_text
public static final int e_words_as_elements
public static final int e_output_bbox
public static final int e_output_style_info
public void destroy()
public void begin(Page page)
page
- Page to read.public void begin(Page page, Rect clip_ptr)
page
- Page to read.clip_ptr
- A pointer to the optional clipping rectangle. This
parameter can be used to selectively read text from a given rectangle.public void begin(Page page, Rect clip_ptr, int flags)
page
- Page to read.clip_ptr
- A pointer to the optional clipping rectangle. This
parameter can be used to selectively read text from a given rectangle.flags
- A list of ProcessingFlags used to control text extraction
algorithm.public int getWordCount()
public java.lang.String getAsText()
public java.lang.String getAsText(boolean dehyphen)
dehyphen
- If true, finds and removes hyphens that split words
across two lines. Hyphens are often used a the end of lines as an
indicator that a word spans two lines. Hyphen detection enables removal
of hyphen character and merging of text runs to form a single word.
This option has no effect on Tagged PDF files.public java.lang.String getTextUnderAnnot(Annot annot)
annot
- The annotation to intersect with.public java.lang.String getAsXML()
* Note: This method returns the same as if calling getAsXML(0)
. Please see getAsXML(int)
for more information.
public java.lang.String getAsXML(int xml_output_flags)
Note: XML output will be encoded in UTF-8 and will have the following structure:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
<Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
<Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
<Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
...
</Line>
</Para>
</Flow>
</Page>
The above XML output was generated by passing the following union of
flags in the call to getAsXML():
e_words_as_elements | e_output_bbox | e_output_style_info
.
In case 'xml_output_flags' was not specified, the default XML output
would look as follows:
<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
<Flow id="1">
<Para id="1">
<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line>
<Line>levels. Using the PDFNet PDF library, ...</Line>
...
</Para>
</Flow>
</Page>
xml_output_flags
- flags controlling XML output. For more
information, please see TextExtract::XMLOutputFlags.public int getNumLines()
public TextExtractor.Line getFirstLine()
Note: To traverse the list of all text lines on the page use line.GetNextLine().
To traverse the list of all word on a given line use line.GetFirstWord().
public boolean getRightToLeftLanguage()
public void setRightToLeftLanguage(boolean rtl)
rtl
- mode reverses the directionality of TextExtractor algorithm.