it.unipi.di.tokenizer
Class FixedTokenizer
java.lang.Object
it.unipi.di.tokenizer.TermTokenizer
it.unipi.di.tokenizer.FixedTokenizer
- All Implemented Interfaces:
- Tokenizer
public class FixedTokenizer
- extends TermTokenizer
Splits a text into a list of fixed-size tokens. Token "\n" is added.
- Author:
- Claudio Corsi, Paolo Ferragina
Constructor Summary |
FixedTokenizer(String file)
Create a new FixedTokenizer object over the given file using
the default length value of 4. |
FixedTokenizer(String file,
int length)
Create a new FixedTokenizer over the given file using a custom
token length. |
Method Summary |
protected it.unimi.dsi.mg4j.util.MutableString[] |
split(it.unimi.dsi.mg4j.util.MutableString line)
|
String |
toString()
|
DEFAULT_LENGTH
public static final int DEFAULT_LENGTH
- See Also:
- Constant Field Values
FixedTokenizer
public FixedTokenizer(String file)
throws IOException
- Create a new FixedTokenizer object over the given file using
the default length value of 4.
- Parameters:
file
- the file to split in tokens
- Throws:
IOException
FixedTokenizer
public FixedTokenizer(String file,
int length)
throws IOException
- Create a new FixedTokenizer over the given file using a custom
token length.
- Parameters:
file
- the file to split in tokenslength
- the (maximum) length of the tokens in the number of chars.
- Throws:
IOException
toString
public String toString()
- Overrides:
toString
in class TermTokenizer
split
protected it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)
- Overrides:
split
in class TermTokenizer