it.unipi.di.textdb
Class BucketedGZip

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
      extended by it.unipi.di.textdb.BucketedGZip
All Implemented Interfaces:
SearchableDB

public class BucketedGZip
extends TextDB
implements SearchableDB

This is a TextDB which uses a combination of a bucketing scheme and the GZip data compression technique. A bucket is defined as a fixed-number of contiguous records. Each bucket is compressed with GZip (thus it has variable length), and may be accessed via a pointer (also called jumper) kept in a file on disk.

At query time the bucket containing the requested record is identified, using its corresponding jumper, loaded in memory and (fully-)uncompressed until the requested record is met.

Search support

If the input file is sorted then the search support can be provided. This means that the methods defined in the interface SearchableDB will work over this TextDB. That support must be required at building time using the proper custom parameter (see build(String, int, boolean, PrintStream)) or the --search-support command line option. Remember that no check will be performed by the library to verify if the input file is really sorted. The user has to guarantee it.

Author:
Claudio Corsi, Paolo Ferragina
See Also:
ExternalSort

Field Summary
 
Fields inherited from class it.unipi.di.textdb.TextDB
DEFAULT_FIELD_SEPARATOR
 
Constructor Summary
BucketedGZip(String filename)
          Create a new BucketedGZip object loading the needed data structures from the provided file.
 
Method Summary
 TextDB build(PrintStream log)
          Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB.TextDB(String)).
static TextDB build(String inputFile, int bucketSize, boolean searchSupport, PrintStream log)
          Build a BucketedGZip over an input file.
 void close()
          Closes the TextDB and releases all of its resources.
 String get(int record)
          Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.
 String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
 void getSequential(int[] records, int field, PrintStream out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
static void main(String[] args)
           
 void open()
          Opens the TextDB.
 Range prefix(String p)
          Returns the range [i, j) of consecutive records in the TextDB that are prefixed by string p.
 int rank(String s)
          Returns the position in this TextDB of the input string.
 int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class it.unipi.di.textdb.TextDB
build, fromTDBFile, get, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BucketedGZip

public BucketedGZip(String filename)
Create a new BucketedGZip object loading the needed data structures from the provided file.

Parameters:
filename - the file containing the content and the data structures to load, stored in TDB format
Method Detail

close

public void close()
           throws IOException
Description copied from class: TextDB
Closes the TextDB and releases all of its resources.

Overrides:
close in class TextDB
Throws:
IOException

get

public String get(int record)
           throws IOException
Description copied from class: TextDB
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.

Specified by:
get in class TextDB
Parameters:
record - a position in the range [0, N-1]
Returns:
the requested record
Throws:
IOException

size

public int size()
Description copied from class: TextDB
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Specified by:
size in class TextDB
Returns:
the size of this TextDB as the number of the contained records

getRange

public String[] getRange(int i,
                         int j)
                  throws IOException
Description copied from class: TextDB
Returns the records having positions from i to j in the TextDB.

Specified by:
getRange in class TextDB
Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Description copied from class: TextDB
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Overrides:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int pos,
                              int length)
                       throws IOException
Description copied from class: TextDB
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Specified by:
getSequential in class TextDB
Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public void getSequential(int[] records,
                          int field,
                          PrintStream out)
                   throws IOException
Description copied from class: TextDB
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method TextDB.getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Specified by:
getSequential in class TextDB
Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output PrintStream
Throws:
IOException

open

public void open()
          throws IOException
Description copied from class: TextDB
Opens the TextDB.
This method has to be called before any other operation on the TextDB.

Overrides:
open in class TextDB
Throws:
IOException

prefix

public Range prefix(String p)
             throws IOException
Description copied from interface: SearchableDB
Returns the range [i, j) of consecutive records in the TextDB that are prefixed by string p. Positions are counted from 1.

Specified by:
prefix in interface SearchableDB
Parameters:
p - the prefix to search
Returns:
the range [i, j) of records sharing the common prefix p
Throws:
IOException

rank

public int rank(String s)
         throws IOException
Description copied from interface: SearchableDB
Returns the position in this TextDB of the input string. As the underlying TextDB is a sorted list of records (strings), if p is not found, the returned value is the negative value of the position where the input string should be placed in the TextDB.

Specified by:
rank in interface SearchableDB
Parameters:
s - the string to be searched
Returns:
the position pos of s, if s occurs in the TextDB, or the value -pos
Throws:
IOException

build

public static TextDB build(String inputFile,
                           int bucketSize,
                           boolean searchSupport,
                           PrintStream log)
                    throws IOException
Build a BucketedGZip over an input file. Custom parameters are related to the bucket size and to the search support. This last option need a sorted file in input.

Parameters:
inputFile - the file to compress
bucketSize - the maximum size (in the number of records) of each bucket
searchSupport - if true build the data structures needed to support the search methods defined in the interface SearchableDB
log - a PrintStream where to send the log messages. If null that messages will be suppressed
Returns:
the TextDB to access the build database
Throws:
IOException

build

public TextDB build(PrintStream log)
             throws IOException
Description copied from class: TextDB
Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB.TextDB(String)). This method runs a build process with default values for all input parameters.

Log messages will be dumped into the passed PrintStream, or suppressed if the passed reference is null.

Specified by:
build in class TextDB
Parameters:
log - a PrintStream for log messages. A null value will suppress any output message
Returns:
A TextDB instance to access the built database.
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception