it.unipi.di.tokenizer
Class TermTokenizer
java.lang.Object
it.unipi.di.tokenizer.TermTokenizer
- All Implemented Interfaces:
- Tokenizer
- Direct Known Subclasses:
- FixedTokenizer, URLTokenizer
public class TermTokenizer
- extends Object
- implements Tokenizer
Splits a text into a list of tokens, '\n' is added.
A token is a maximal sequence of alphanumeric
characters or other symbols. White-spaces are considered part of
the previous token.
- Author:
- Claudio Corsi, Paolo Ferragina
Field Summary |
protected static char[] |
term
|
Method Summary |
void |
close()
|
it.unimi.dsi.mg4j.util.MutableString |
next()
|
void |
reset()
|
protected it.unimi.dsi.mg4j.util.MutableString[] |
split(it.unimi.dsi.mg4j.util.MutableString line)
|
String |
toString()
|
term
protected static char[] term
TermTokenizer
public TermTokenizer(String file)
throws IOException
- Create a new TermTokenizer over the given file.
- Parameters:
file
- the file to split in tokens
- Throws:
IOException
TermTokenizer
public TermTokenizer(String file,
char separator)
throws IOException
- Create a new TermTokenizer over the given file. The passed character will be
detected as distinct token (if any).
- Parameters:
file
- the file to split in tokensseparator
- the character to detect as distinct token
- Throws:
IOException
toString
public String toString()
- Overrides:
toString
in class Object
split
protected it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)
next
public it.unimi.dsi.mg4j.util.MutableString next()
throws IOException
- Specified by:
next
in interface Tokenizer
- Throws:
IOException
reset
public void reset()
throws IOException
- Specified by:
reset
in interface Tokenizer
- Throws:
IOException
close
public void close()
throws IOException
- Specified by:
close
in interface Tokenizer
- Throws:
IOException