it.unipi.di.textdb
Class BucketedHuffword

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
      extended by it.unipi.di.textdb.BucketedHuffword

public class BucketedHuffword
extends TextDB

TextDB implementation that uses a combination of the Huffword data compressor and a bucketing scheme. Each bucket consists of a fixed-number of contiguous records, and has variable length. The whole file is compressed with Huffword, and pointers to compressed buckets (also called jumpers) are kept in a file on disk.

At query time the bucket containing the requested record is identified, using its corresponding jumper, loaded in memory and read sequentially until the requested record is met. Given the properties of Huffword, only the requested portion of a bucket must be uncompressed.
The tokens constituting the Huffword's alphabet are computed using a Tokenizer. Codewords are assigned to each token by considering their frequencies in the input file. Optionally, the ZetaCompressor can be used to assign codewords to tokens: sort tokens by decreasing frequency and Zeta-encode their ranks (in this order). This avoids the need to explicitly store the dictionary of codewords (but the output is sub-optimal).

The Huffman and ZetaCode implementations are provided by the library MG4J (http://mg4j.dsi.unimi.it/).

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
static int DEFAULT_BUCKET_LENGTH
           
static int DEFAULT_ZETA_K
           
static int HUFFMAN_CODEC
           
static int RECORD_BUFFER_LENGTH
          The default size of the in memory buffer where to store the loaded record
static int ZETA_CODEC
           
 
Fields inherited from class it.unipi.di.textdb.TextDB
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename
 
Constructor Summary
BucketedHuffword(String filename)
          Create a new BucketedHuffword object loading the needed data structures from the provided file.
 
Method Summary
 TextDB build(String outfile, PrintStream log)
          Compresses the input file with BucketedHuffword.
static TextDB build(Tokenizer tokenizer, String inputfile, String outfile, int bucketLen, PrintStream log)
          Compresses the input file with Bucketed Huffword using a set of custom parameters.
 void close()
          Closes the TextDB and releases all of its resources.
 String get(int record)
          Returns the record for a given position in the compressed file, null if this position is out of range.
 String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 String[] getRange(int i, int j, int field)
          Returns the specified field for the records in the range [i,j].
 void getRange(int i, int j, int field, BufferedWriter out)
          Print on the passed PrintStream the specified field for the records in the range [i,j].
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 void getSequential(int[] records, int field, BufferedWriter out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
 String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
static void main(String[] args)
           
 void open()
          Opens the TextDB.
static void setCodec(int codec)
           
static void setZetaKParameter(int k)
           
 int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class it.unipi.di.textdb.TextDB
build, fromTDBFile, get, getField, getFieldValues, getName, getRecordFields, getSequential, setFieldSeparator
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ZETA_CODEC

public static final int ZETA_CODEC
See Also:
Constant Field Values

HUFFMAN_CODEC

public static final int HUFFMAN_CODEC
See Also:
Constant Field Values

DEFAULT_BUCKET_LENGTH

public static final int DEFAULT_BUCKET_LENGTH
See Also:
Constant Field Values

DEFAULT_ZETA_K

public static final int DEFAULT_ZETA_K
See Also:
Constant Field Values

RECORD_BUFFER_LENGTH

public static final int RECORD_BUFFER_LENGTH
The default size of the in memory buffer where to store the loaded record

See Also:
Constant Field Values
Constructor Detail

BucketedHuffword

public BucketedHuffword(String filename)
Create a new BucketedHuffword object loading the needed data structures from the provided file.

Parameters:
filename - the TDB file containing the needed data structures and the compressed content
Method Detail

setCodec

public static void setCodec(int codec)

setZetaKParameter

public static void setZetaKParameter(int k)

close

public void close()
           throws IOException
Description copied from class: TextDB
Closes the TextDB and releases all of its resources.

Overrides:
close in class TextDB
Throws:
IOException

build

public TextDB build(String outfile,
                    PrintStream log)
             throws IOException
Compresses the input file with BucketedHuffword. The default bucket length is 100, and the tokens are identified via a TermTokenizer.

See the static method build(Tokenizer, String, String, int, PrintStream) to compress with a set of customized parameters.

Specified by:
build in class TextDB
Parameters:
log - a PrintStream where log messages are print out. A null value will suppress any output message
outfile - The output file name.
Returns:
A TextDB instance to access the built database.
Throws:
IOException

build

public static TextDB build(Tokenizer tokenizer,
                           String inputfile,
                           String outfile,
                           int bucketLen,
                           PrintStream log)
                    throws IOException
Compresses the input file with Bucketed Huffword using a set of custom parameters.

Parameters:
tokenizer - the tokenizer used to define the Huffoword tokens from the input file
inputfile - the input file name
bucketLen - the number of records per bucket
log - a PrintStream where log messages are print out. A null value will suppress any output message
Returns:
A TextDB instance to access the built database.
Throws:
IOException

open

public void open()
          throws IOException
Description copied from class: TextDB
Opens the TextDB.
This method has to be called before any other operation on the TextDB.

Overrides:
open in class TextDB
Throws:
IOException

size

public int size()
Description copied from class: TextDB
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Specified by:
size in class TextDB
Returns:
the size of this TextDB as the number of the contained records

get

public String get(int record)
           throws IOException
Returns the record for a given position in the compressed file, null if this position is out of range.

Specified by:
get in class TextDB
Parameters:
record - The record number in the range [0, N-1]
Returns:
the record or null
Throws:
IOException

getSequential

public void getSequential(int[] records,
                          int field,
                          BufferedWriter out)
                   throws IOException
Description copied from class: TextDB
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method TextDB.getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Specified by:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output BufferedWriter
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Description copied from class: TextDB
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Overrides:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int pos,
                              int length)
                       throws IOException
Description copied from class: TextDB
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Specified by:
getSequential in class TextDB
Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j)
                  throws IOException
Description copied from class: TextDB
Returns the records having positions from i to j in the TextDB.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j,
                         int field)
                  throws IOException
Description copied from class: TextDB
Returns the specified field for the records in the range [i,j]. If not present, a null value is stored in the corresponding position of the returned array.
The default implementation is based on a sequential scan of the fetched records.

Overrides:
getRange in class TextDB
Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position of the field to return for all those records
Returns:
the field of the records in the range [i,j]
Throws:
IOException

getRange

public void getRange(int i,
                     int j,
                     int field,
                     BufferedWriter out)
              throws IOException
Description copied from class: TextDB
Print on the passed PrintStream the specified field for the records in the range [i,j]. If not present, an empty line will be dumped out.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire record
out - the output BufferedWriter
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception