it.unipi.di.tokenizer
Class URLTokenizer

java.lang.Object
  extended by it.unipi.di.tokenizer.TermTokenizer
      extended by it.unipi.di.tokenizer.URLTokenizer
All Implemented Interfaces:
Tokenizer

public class URLTokenizer
extends TermTokenizer

A Tokenizer for a list of URLs. Each URL is tokenized into: protocol, host, port, directories of the path, query part. The token '\n' is added.

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
 
Fields inherited from class it.unipi.di.tokenizer.TermTokenizer
term
 
Constructor Summary
URLTokenizer(String file)
          Create a new URLTokenizer over the given file.
 
Method Summary
protected  String[] split(String line)
           
 
Methods inherited from class it.unipi.di.tokenizer.TermTokenizer
close, next, reset, split, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

URLTokenizer

public URLTokenizer(String file)
             throws IOException
Create a new URLTokenizer over the given file.

Parameters:
file - the file containing a list of URLs to split in tokens (separated by '\n')
Throws:
IOException
Method Detail

split

protected String[] split(String line)