it.unipi.di.textdb
Class BucketedZip

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
      extended by it.unipi.di.textdb.BucketedZip

public class BucketedZip
extends TextDB

This is a TextDB which uses a combination of a bucketing scheme and the Zip data compression technique. A bucket is defined as a fixed-number of contiguous records. Each bucket is compressed with Zip (thus it has variable length), and may be accessed via a pointer (also called jumper) kept in a file on disk.

At query time the bucket containing the requested record is identified, using its corresponding jumper, loaded in memory and (fully-)uncompressed until the requested record is met.

Author:
Claudio Corsi, Paolo Ferragina
See Also:
ExternalSort

Field Summary
static int DEFAULT_BUCKET_SIZE
           
static int DEFAULT_COMPRESSION_LEVEL
           
 
Fields inherited from class it.unipi.di.textdb.TextDB
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename
 
Constructor Summary
BucketedZip(String filename)
          Create a new BucketedZip object loading the needed data structures from the provided file.
 
Method Summary
 TextDB build(String outfile, PrintStream log)
          Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB.TextDB(String)).
static TextDB build(String inputfile, String outfile, int bucketSize, int level, PrintStream log)
          Build a BucketedZip over an input file.
 void close()
          Closes the TextDB and releases all of its resources.
 String get(int record)
          Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.
 String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 void getRange(int i, int j, int field, BufferedWriter out)
          Print on the passed PrintStream the specified field for the records in the range [i,j].
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 void getSequential(int[] records, int field, BufferedWriter out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
 String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
static void main(String[] args)
           
 void open()
          Opens the TextDB.
 int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class it.unipi.di.textdb.TextDB
build, fromTDBFile, get, getField, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_BUCKET_SIZE

public static final int DEFAULT_BUCKET_SIZE
See Also:
Constant Field Values

DEFAULT_COMPRESSION_LEVEL

public static final int DEFAULT_COMPRESSION_LEVEL
See Also:
Constant Field Values
Constructor Detail

BucketedZip

public BucketedZip(String filename)
Create a new BucketedZip object loading the needed data structures from the provided file.

Parameters:
filename - the file containing the content and the data structures to load, stored in TDB format
Method Detail

close

public void close()
           throws IOException
Description copied from class: TextDB
Closes the TextDB and releases all of its resources.

Overrides:
close in class TextDB
Throws:
IOException

size

public int size()
Description copied from class: TextDB
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Specified by:
size in class TextDB
Returns:
the size of this TextDB as the number of the contained records

get

public String get(int record)
           throws IOException
Description copied from class: TextDB
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.

Specified by:
get in class TextDB
Parameters:
record - a position in the range [0, N-1]
Returns:
the requested record
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j)
                  throws IOException
Description copied from class: TextDB
Returns the records having positions from i to j in the TextDB.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getRange

public void getRange(int i,
                     int j,
                     int field,
                     BufferedWriter out)
              throws IOException
Description copied from class: TextDB
Print on the passed PrintStream the specified field for the records in the range [i,j]. If not present, an empty line will be dumped out.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire record
out - the output BufferedWriter
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Description copied from class: TextDB
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Overrides:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int pos,
                              int length)
                       throws IOException
Description copied from class: TextDB
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Specified by:
getSequential in class TextDB
Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public void getSequential(int[] records,
                          int field,
                          BufferedWriter out)
                   throws IOException
Description copied from class: TextDB
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method TextDB.getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Specified by:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output BufferedWriter
Throws:
IOException

open

public void open()
          throws IOException
Description copied from class: TextDB
Opens the TextDB.
This method has to be called before any other operation on the TextDB.

Overrides:
open in class TextDB
Throws:
IOException

build

public static TextDB build(String inputfile,
                           String outfile,
                           int bucketSize,
                           int level,
                           PrintStream log)
                    throws IOException
Build a BucketedZip over an input file. Custom parameters are related to the bucket size and to the search support. This last option need a sorted file in input.

Parameters:
inputfile - the file to compress
outfile - the output file name
bucketSize - the maximum size (in the number of records) of each bucket
level - the compression level (from 0 = FASTEST to 9 = BEST COMPRESSION)
log - a PrintStream where to send the log messages. If null that messages will be suppressed
Returns:
the TextDB to access the build database
Throws:
IOException

build

public TextDB build(String outfile,
                    PrintStream log)
             throws IOException
Description copied from class: TextDB
Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB.TextDB(String)). This method runs a build process with default values for all input parameters.

Log messages will be dumped into the passed PrintStream, or suppressed if the passed reference is null.

Specified by:
build in class TextDB
Parameters:
outfile - The output file name.
log - a PrintStream for log messages. A null value will suppress any output message
Returns:
A TextDB instance to access the built database.
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception