it.unipi.di.textdb
Class RSHuffword

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
      extended by it.unipi.di.textdb.RSHuffword

public class RSHuffword
extends TextDB

A TextDB that compress the source file with the Huffword technique and access it through Rank and Select operations. The compressed file is marked by arrays of bits delimiting the fields and the records. At query time Rank&Select operations over these vectors will be performed in order to find the positions of the requested fields/records to be uncompressed and returned.
The Huffword dictionary is built using a Tokenizer instance in order to split the source text into terms. To each of them is assigned a codeword depending on its frequency in the text. The Huffman prefix-free code is then used to encode that codewords.

Author:
Claudio Corsi, Paolo Ferragina, Alessandro Barilari

Field Summary
static int DEFAULT_BUCKET_LENGTH
           
 
Fields inherited from class it.unipi.di.textdb.TextDB
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename
 
Constructor Summary
RSHuffword(String filename)
          Create a new RSHuffword object loading the needed data structures from the provided file.
 
Method Summary
 TextDB build(String outfile, PrintStream log)
          Compress with a standard bucketed huffword technique.
static TextDB build(Tokenizer tokenizer, String inputfile, String outfile, PrintStream log, boolean withFields, char separator)
          Compress the input file with the bucketed huffword technique using customized parameters.
 void close()
          Closes the TextDB and releases all of its resources.
 String get(int record)
          Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.
 String get(int record, int field)
          Returns the field of a record, given their ordinal positions, or null if one is not present.
 String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 void getRange(int i, int j, int field, BufferedWriter out)
          Print on the passed PrintStream the specified field for the records in the range [i,j].
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 void getSequential(int[] records, int field, BufferedWriter out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
 String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
static void main(String[] args)
           
 void open()
          Opens the TextDB.
 int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class it.unipi.di.textdb.TextDB
build, fromTDBFile, getField, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BUCKET_LENGTH

public static final int DEFAULT_BUCKET_LENGTH
See Also:
Constant Field Values
Constructor Detail

RSHuffword

public RSHuffword(String filename)
Create a new RSHuffword object loading the needed data structures from the provided file.

Parameters:
filename - the TDB file containing the needed data structures and the compressed content
Method Detail

build

public TextDB build(String outfile,
                    PrintStream log)
             throws IOException
Compress with a standard bucketed huffword technique. This method will use a TermTokenizer to build the huffword dictionary.
The TermTokenizer use the default separator (the blank space character) to recognize fields.

See the static method build(Tokenizer, String, String, PrintStream, boolean, char) to compress with customized parameters.

Specified by:
build in class TextDB
Parameters:
log - a PrintStream where to print log messages. A null value will suppress any output message
outfile - The output file name.
Returns:
A TextDB instance to access the built database.
Throws:
IOException

build

public static TextDB build(Tokenizer tokenizer,
                           String inputfile,
                           String outfile,
                           PrintStream log,
                           boolean withFields,
                           char separator)
                    throws IOException
Compress the input file with the bucketed huffword technique using customized parameters.

Parameters:
tokenizer - the tokenizer used to parse the input file
inputfile - the input file name
log - a PrintStream where to print log messages. A null value will suppress any output message
Returns:
the TextDB to access the built database (to be opened)
Throws:
IOException

open

public void open()
          throws IOException
Description copied from class: TextDB
Opens the TextDB.
This method has to be called before any other operation on the TextDB.

Overrides:
open in class TextDB
Throws:
IOException

close

public void close()
           throws IOException
Description copied from class: TextDB
Closes the TextDB and releases all of its resources.

Overrides:
close in class TextDB
Throws:
IOException

get

public String get(int record)
           throws IOException
Description copied from class: TextDB
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.

Specified by:
get in class TextDB
Parameters:
record - a position in the range [0, N-1]
Returns:
the requested record
Throws:
IOException

get

public String get(int record,
                  int field)
           throws IOException
Returns the field of a record, given their ordinal positions, or null if one is not present.

Overrides:
get in class TextDB
Parameters:
record - the position of a record
field - the position of the field to be retrieved
Returns:
the requested field for that record
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j)
                  throws IOException
Description copied from class: TextDB
Returns the records having positions from i to j in the TextDB.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getRange

public void getRange(int i,
                     int j,
                     int field,
                     BufferedWriter out)
              throws IOException
Description copied from class: TextDB
Print on the passed PrintStream the specified field for the records in the range [i,j]. If not present, an empty line will be dumped out.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire record
out - the output BufferedWriter
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Description copied from class: TextDB
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Overrides:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int pos,
                              int length)
                       throws IOException
Description copied from class: TextDB
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Specified by:
getSequential in class TextDB
Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public void getSequential(int[] records,
                          int field,
                          BufferedWriter out)
                   throws IOException
Description copied from class: TextDB
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method TextDB.getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Specified by:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output BufferedWriter
Throws:
IOException

size

public int size()
Description copied from class: TextDB
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Specified by:
size in class TextDB
Returns:
the size of this TextDB as the number of the contained records

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception