it.unipi.di.tokenizer
Class TermTokenizer

java.lang.Object
  extended by it.unipi.di.tokenizer.TermTokenizer
All Implemented Interfaces:
Tokenizer
Direct Known Subclasses:
FixedTokenizer, URLTokenizer

public class TermTokenizer
extends Object
implements Tokenizer

Splits a text into a list of tokens, '\n' is added. A token is a maximal sequence of alphanumeric characters or other symbols. White-spaces are considered part of the previous token.

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
protected static char[] term
           
 
Constructor Summary
TermTokenizer(String file)
          Create a new TermTokenizer over the given file.
TermTokenizer(String file, char separator)
          Create a new TermTokenizer over the given file.
 
Method Summary
 void close()
           
 it.unimi.dsi.mg4j.util.MutableString next()
           
 void reset()
           
protected  it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)
           
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

term

protected static char[] term
Constructor Detail

TermTokenizer

public TermTokenizer(String file)
              throws IOException
Create a new TermTokenizer over the given file.

Parameters:
file - the file to split in tokens
Throws:
IOException

TermTokenizer

public TermTokenizer(String file,
                     char separator)
              throws IOException
Create a new TermTokenizer over the given file. The passed character will be detected as distinct token (if any).

Parameters:
file - the file to split in tokens
separator - the character to detect as distinct token
Throws:
IOException
Method Detail

toString

public String toString()
Overrides:
toString in class Object

split

protected it.unimi.dsi.mg4j.util.MutableString[] split(it.unimi.dsi.mg4j.util.MutableString line)

next

public it.unimi.dsi.mg4j.util.MutableString next()
                                          throws IOException
Specified by:
next in interface Tokenizer
Throws:
IOException

reset

public void reset()
           throws IOException
Specified by:
reset in interface Tokenizer
Throws:
IOException

close

public void close()
           throws IOException
Specified by:
close in interface Tokenizer
Throws:
IOException