it.unipi.di.textdb
Class FrontCoding

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
      extended by it.unipi.di.textdb.FrontCoding
All Implemented Interfaces:
SearchableDB

public class FrontCoding
extends TextDB
implements SearchableDB

A TextDB that uses the Front-Coding technique to compress its records and a bucketing scheme to provide efficient access to them. A bucket is defined as a fixed-number of contiguous records, and has variable length. In each bucket the first record is stored explicitly, whereas any other record is front-encoded by squeezing out the longest prefix (LP) it shares with the preceding record in its bucket. The squeezing consists of substituting LP with its length (encoded using 1 or 3 bytes). Optionally a front-coded bucket can be further compressed using GZip. The pointer to the beginning of each bucket (also called jumper) is stored in a separate file on disk.

At query time the bucket containing the requested record is identified, using its jumper, loaded in memory and accessed sequentially until the record is met.


To achieve higher compression, this scheme should be applied over a TextDB whose records are sorted lexicographically (see ExternalSort).

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
static char BIGGER_CHAR
           
static int DEFAULT_BUCKET_SIZE
           
static int DEFAULT_COMPRESSION_LEVEL
           
static byte ESCAPE_VALUE_BYTE
           
static char SMALLER_CHAR
           
 
Fields inherited from class it.unipi.di.textdb.TextDB
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename
 
Constructor Summary
FrontCoding(String filename)
          Created a new FrontCoding object from the provided file.
FrontCoding(String filename, boolean searchSupport)
          Created a new FrontCoding object from the provided file.
 
Method Summary
 long bucketSize()
           
 TextDB build(String outfile, PrintStream log)
          Compresses the TextDB via (plain) front-coding with buckets of 100 records.
static TextDB build(String inputfile, String outfile, int bucketSize, boolean compress, boolean frontCoding, boolean searchSupport, int level, PrintStream log)
          Compresses the input file with a customized compression scheme, which combines frontcoding and Zip.
 void close()
          Closes the TextDB and releases all of its resources.
 float compressRatio()
           
 String dictName()
           
 long dictSize()
           
 long fcBeginSize()
           
 long fcDictSize()
           
 long fcJumpSize()
           
 long fcTotalSize()
           
 String get(int record)
          Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.
 String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 void getRange(int i, int j, int field, BufferedWriter out)
          Print on the passed PrintStream the specified field for the records in the range [i,j].
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 void getSequential(int[] records, int field, BufferedWriter out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
 String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
protected  int locate(String p)
          Returns the position of the first record having prefix p (here records are assumed to be sorted), or the alphabetical position of p among the records in the TextDB.
static void main(String[] args)
           
 long numBuckets()
           
 long numStrings()
           
 void open()
          It opens and loads from disk all data structures needed to access the compressed TextDB.
 Range prefix(String p)
          Returns the range [i, j) of positions identifying the records in the (ordered) TextDB which are prefixed by p.
 int rank(String s)
          It returns the rank r of the smallest record in the current TextDB which is larger than or equal to s.
 int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class it.unipi.di.textdb.TextDB
build, fromTDBFile, get, getField, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BUCKET_SIZE

public static final int DEFAULT_BUCKET_SIZE
See Also:
Constant Field Values

DEFAULT_COMPRESSION_LEVEL

public static final int DEFAULT_COMPRESSION_LEVEL
See Also:
Constant Field Values

SMALLER_CHAR

public static final char SMALLER_CHAR
See Also:
Constant Field Values

BIGGER_CHAR

public static final char BIGGER_CHAR
See Also:
Constant Field Values

ESCAPE_VALUE_BYTE

public static final byte ESCAPE_VALUE_BYTE
See Also:
Constant Field Values
Constructor Detail

FrontCoding

public FrontCoding(String filename)
Created a new FrontCoding object from the provided file.

Parameters:
filename - the TDB file containing the front coded records and the data structures needed to access them

FrontCoding

public FrontCoding(String filename,
                   boolean searchSupport)
Created a new FrontCoding object from the provided file.

Parameters:
filename - the TDB file containing the front coded records and the data structures needed to access them
searchSupport - if false don't load the data structures to support the searches defined in the SearchableDB interface
Method Detail

dictName

public String dictName()

dictSize

public long dictSize()

numBuckets

public long numBuckets()

bucketSize

public long bucketSize()

numStrings

public long numStrings()

fcBeginSize

public long fcBeginSize()

fcDictSize

public long fcDictSize()

fcJumpSize

public long fcJumpSize()

fcTotalSize

public long fcTotalSize()

compressRatio

public float compressRatio()

build

public TextDB build(String outfile,
                    PrintStream log)
             throws IOException
Compresses the TextDB via (plain) front-coding with buckets of 100 records. There is no support for the interface SearchableDB.

Specified by:
build in class TextDB
Parameters:
log - a PrintStream where to print log messages. A null value will suppress any output message
outfile - the output file name
Returns:
A TextDB instance to access the built database.
Throws:
IOException

build

public static TextDB build(String inputfile,
                           String outfile,
                           int bucketSize,
                           boolean compress,
                           boolean frontCoding,
                           boolean searchSupport,
                           int level,
                           PrintStream log)
                    throws IOException
Compresses the input file with a customized compression scheme, which combines frontcoding and Zip.

Parameters:
inputfile - the file to compress
outfile - the output file name
bucketSize - the number of contiguous records forming a bucket
compress - if true, ZIP is used onto the bucket (possibly front-compressed).
frontCoding - if true, front-coding is used over the buckets (before applying Gzip, if any)
searchSupport - build the support for the SearchableDB interface
level - compression level if compress=true. An integer between 0 and 9 (inclusive)
log - a PrintStream where to print log messages. A null value will suppress any output message
Returns:
A TextDB instance to access the built database.
Throws:
IOException

open

public void open()
          throws IOException
It opens and loads from disk all data structures needed to access the compressed TextDB.

Overrides:
open in class TextDB
Throws:
IOException

close

public void close()
           throws IOException
Description copied from class: TextDB
Closes the TextDB and releases all of its resources.

Overrides:
close in class TextDB
Throws:
IOException

locate

protected int locate(String p)
              throws IOException
Returns the position of the first record having prefix p (here records are assumed to be sorted), or the alphabetical position of p among the records in the TextDB.

Parameters:
p - The prefix to search.
Returns:
The position of (the first record having prefix) p.
Throws:
IOException

prefix

public Range prefix(String p)
             throws IOException
Returns the range [i, j) of positions identifying the records in the (ordered) TextDB which are prefixed by p. If no such record does exist, an empty range [i, i) is returned, where i is the alphabetic position of p within the ordered TextDB.

Specified by:
prefix in interface SearchableDB
Parameters:
p - the prefix to search
Returns:
the range [i, j) represented in a Range object
Throws:
IOException

rank

public int rank(String s)
         throws IOException
It returns the rank r of the smallest record in the current TextDB which is larger than or equal to s. Note that the rank is counted from 1, thus guaranteeing that if s is not found, the absolute value of the returned (negative) result is the alphabetic position of s among the records in the current TextDB.

Specified by:
rank in interface SearchableDB
Parameters:
s - The string to be ranked.
Returns:
the position pos of s, if s occurs in the TextDB, or the value -pos
Throws:
IOException

size

public int size()
Description copied from class: TextDB
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Specified by:
size in class TextDB
Returns:
the size of this TextDB as the number of the contained records

get

public String get(int record)
           throws IOException
Description copied from class: TextDB
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.

Specified by:
get in class TextDB
Parameters:
record - a position in the range [0, N-1]
Returns:
the requested record
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j)
                  throws IOException
Description copied from class: TextDB
Returns the records having positions from i to j in the TextDB.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getRange

public void getRange(int i,
                     int j,
                     int field,
                     BufferedWriter out)
              throws IOException
Description copied from class: TextDB
Print on the passed PrintStream the specified field for the records in the range [i,j]. If not present, an empty line will be dumped out.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire record
out - the output BufferedWriter
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Description copied from class: TextDB
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Overrides:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int pos,
                              int length)
                       throws IOException
Description copied from class: TextDB
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Specified by:
getSequential in class TextDB
Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public void getSequential(int[] records,
                          int field,
                          BufferedWriter out)
                   throws IOException
Description copied from class: TextDB
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method TextDB.getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Specified by:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output BufferedWriter
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception