it.unipi.di.tokenizer
Class FixedTokenizer

java.lang.Object
  extended by it.unipi.di.tokenizer.TermTokenizer
      extended by it.unipi.di.tokenizer.FixedTokenizer
All Implemented Interfaces:
Tokenizer

public class FixedTokenizer
extends TermTokenizer

Splits a text into a list of fixed-size tokens. Token "\n" is added.

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
static int DEFAULT_LENGTH
           
 
Fields inherited from class it.unipi.di.tokenizer.TermTokenizer
term
 
Constructor Summary
FixedTokenizer(String file)
          Create a new FixedTokenizer object over the given file using the default length value of 4.
FixedTokenizer(String file, int length)
          Create a new FixedTokenizer over the given file using a custom token length.
 
Method Summary
protected  it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)
           
 String toString()
           
 
Methods inherited from class it.unipi.di.tokenizer.TermTokenizer
close, next, reset
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_LENGTH

public static final int DEFAULT_LENGTH
See Also:
Constant Field Values
Constructor Detail

FixedTokenizer

public FixedTokenizer(String file)
               throws IOException
Create a new FixedTokenizer object over the given file using the default length value of 4.

Parameters:
file - the file to split in tokens
Throws:
IOException

FixedTokenizer

public FixedTokenizer(String file,
                      int length)
               throws IOException
Create a new FixedTokenizer over the given file using a custom token length.

Parameters:
file - the file to split in tokens
length - the (maximum) length of the tokens in the number of chars.
Throws:
IOException
Method Detail

toString

public String toString()
Overrides:
toString in class TermTokenizer

split

protected it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)
Overrides:
split in class TermTokenizer