it.unipi.di.textdb
Class TextDB

java.lang.Object
  extended by it.unipi.di.textdb.TextDB
Direct Known Subclasses:
BucketedHuffword, BucketedZip, FrontCoding, RSHuffword, ZipCursor

public abstract class TextDB
extends Object

Consider a textual file consisting of records (i.e. lines separated by new lines). Each record is composed of a variable number of fields (i.e. strings separated by '\t'). Possibly, each field may be composed by multiple values separated by a char-sequence specified by the user. The number of values and the number of fields may differ among the records, thus generalizing the classical relational approach.

A TextDB stores this input file on disk in a compressed form and offers efficient access to individual records and fields. A record and a field can be identified by means of an ordinal position, counted from 0 and starting from the beginning of the file (for the records) or the beginning of the record (for the fields).

Implementations of this class provide different compression techniques and different accessing methods, thus offering time/space trade-offs.

A note on the build interface

This interface specifies a standard build(String, PrintStream) method that must be provided by each implementation. This method will build the data structures using default values for any custom parameter. In order to build the TextDB with customized parameters, you need to define a static "build" method accepting parameters and performing all the work in order to produce on disk the permanent data structures. This is only a suggestion on how to develop a better implementation, but there are no constraints to respect it (except the good practice to respect a suggested standard).

The output file format (TDB)

During the build process, the provided TextDB implementations will write the compressed file content into a single file on disk in a format called TDB. The output file name on disk will be the file name passed to the constructor followed by the suffix ".tdb". That file will contains all the data structures needed at run time in order to access the content of the built TextDB. Being a single file, the generated TDB file is easy to share among different users. From the programmer point of view a TDB file can be loaded through the static method fromTDBFile(String) that load a TDB file and returns the correct instance of the stored TextDB without knowing nothing about it.

Author:
Claudio Corsi, Paolo Ferragina

Field Summary
static String DEFAULT_FIELD_SEPARATOR
          The default separator '\t' for fields.
protected  it.unimi.dsi.mg4j.util.MutableString fieldSeparator
          The char sequence used to separate the fields within a record.
protected  String filename
           
 
Constructor Summary
TextDB(String filename)
          Creates a new TextDB from an input textual file.
 
Method Summary
 TextDB build(String outfile)
          Builds a TextDB over the textual file identified by the filename string used in the constructor (see TextDB(String)).
abstract  TextDB build(String outfile, PrintStream log)
          Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB(String)).
 void close()
          Closes the TextDB and releases all of its resources.
static TextDB fromTDBFile(String tdbFile)
          Returns a TextDB from a TDB file.
abstract  String get(int record)
          Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.
 String get(int record, int field)
          Returns the field of a record, given their ordinal positions, or null if one is not present.
protected  String getField(String record, int field)
          Splits the input record into fields, using the separator specified with setFieldSeparator(String), and returns the field at the specified position, or null if that position is out-of-bound.
 String[] getFieldValues(String field, String sep)
          Returns the values of a multi-valued field, where values are separated by a user-defined separator.
 String getName()
          Returns the name of this TextDB.
abstract  String[] getRange(int i, int j)
          Returns the records having positions from i to j in the TextDB.
 String[] getRange(int i, int j, int field)
          Returns the specified field for the records in the range [i,j].
abstract  void getRange(int i, int j, int field, BufferedWriter out)
          Print on the passed PrintStream the specified field for the records in the range [i,j].
 String[] getRecordFields(String record)
          Returns all fields forming the input record.
 String[] getSequential(int[] records)
          Given a sorted array of record positions, this method returns all of them.
 String[] getSequential(int[] records, int field)
          Given a sorted array of record positions and the position of a field, this method returns that field of those records.
abstract  void getSequential(int[] records, int field, BufferedWriter out)
          Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records.
abstract  String[] getSequential(int[] records, int pos, int length)
          Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.
 void open()
          Opens the TextDB.
 void setFieldSeparator(String sep)
          Set the sequence of chars used to separate fields.
abstract  int size()
          Returns the number of records contained in this TextDB.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

filename

protected String filename

DEFAULT_FIELD_SEPARATOR

public static final String DEFAULT_FIELD_SEPARATOR
The default separator '\t' for fields.

See Also:
Constant Field Values

fieldSeparator

protected it.unimi.dsi.mg4j.util.MutableString fieldSeparator
The char sequence used to separate the fields within a record.

Constructor Detail

TextDB

public TextDB(String filename)
Creates a new TextDB from an input textual file.
The input file name is used by the implementing class to load all data structures from disk. If the TextDB has yet to be built, the file name will be used to create the necessary file(s).

Implementation of this interface are encouraged to store their files into one single file using a Directory object and to respect the TDB format.
NOTE: more informations should be added here about the TDB format!

Parameters:
filename - the file containing this TextDB
Method Detail

fromTDBFile

public static TextDB fromTDBFile(String tdbFile)
                          throws IOException
Returns a TextDB from a TDB file.

Parameters:
tdbFile - the TDB file
Returns:
an instance of a TextDB corresponding to that file
Throws:
IOException - if the input file is not a valid TDB file or some I/O errors occur

getName

public String getName()
Returns the name of this TextDB.

This is a not unique string that identify this TextDB instance and is computed over the name of the file storing this database. It is the filename without the file system path and without the canonical ".tdb" suffix (if any). As the ".tdb" is just a convention for the file names containing the TextDBs, if it is not present this method will return the full filename without the file system path. Be aware that different TextDBs can clash into the same name if they are stored into files named in the same way but located in different file system location. To avoid this case we encourage the use of meaningful filenames.

As an example consider the file "/home/user/example.tdb". It store a TextDB named "example".

Returns:
the name of this TextDB instance

setFieldSeparator

public void setFieldSeparator(String sep)
Set the sequence of chars used to separate fields.

Parameters:
sep - the new separator for fields

getField

protected String getField(String record,
                          int field)
Splits the input record into fields, using the separator specified with setFieldSeparator(String), and returns the field at the specified position, or null if that position is out-of-bound.

The entire record is returned if -1 is specified as field position.

NOTE: Descending class are encouraged to overwrite this method with a faster implementation.

Parameters:
record - the record to parse
field - the ordinal position of the field to return, or -1 to return the entire record
Returns:
the requested field or the content of the entire record

getFieldValues

public String[] getFieldValues(String field,
                               String sep)
Returns the values of a multi-valued field, where values are separated by a user-defined separator.

This implementation splits the field using a regular expression built over the separator string sep. Subclasses are encouraged to overwrite this method with a more efficient one.

Parameters:
field - the field content
sep - the separator used to separate the values composing this field
Returns:
an array containing the values of this field

getRecordFields

public String[] getRecordFields(String record)
Returns all fields forming the input record. The fields are extracted through a linear scan of the record itself.

NOTE: Descending classes are encouraged to overwrite this method with a faster implementation.

Parameters:
record - the record to parse
Returns:
an array of strings, one per field (order is preserved).

get

public abstract String get(int record)
                    throws IOException
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB.

Parameters:
record - a position in the range [0, N-1]
Returns:
the requested record
Throws:
IOException

get

public String get(int record,
                  int field)
           throws IOException
Returns the field of a record, given their ordinal positions, or null if one is not present.

Parameters:
record - the position of a record
field - the position of the field to be retrieved
Returns:
the requested field for that record
Throws:
IOException

getRange

public abstract String[] getRange(int i,
                                  int j)
                           throws IOException
Returns the records having positions from i to j in the TextDB.

Parameters:
i - the starting position of the records to retrieve (inclusive)
j - the ending position of the records to retrieve (inclusive)
Returns:
the records in the defined range
Throws:
IOException

getRange

public String[] getRange(int i,
                         int j,
                         int field)
                  throws IOException
Returns the specified field for the records in the range [i,j]. If not present, a null value is stored in the corresponding position of the returned array.
The default implementation is based on a sequential scan of the fetched records.

Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position of the field to return for all those records
Returns:
the field of the records in the range [i,j]
Throws:
IOException

getRange

public abstract void getRange(int i,
                              int j,
                              int field,
                              BufferedWriter out)
                       throws IOException
Print on the passed PrintStream the specified field for the records in the range [i,j]. If not present, an empty line will be dumped out.

Parameters:
i - the starting position of the records to be fetched (included)
j - the ending position of the records to be fetched (included)
field - the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire record
out - the output BufferedWriter
Throws:
IOException

getSequential

public String[] getSequential(int[] records)
                       throws IOException
Given a sorted array of record positions, this method returns all of them.

If some of the requested records are not available, the behavior is unspecified and depend on the underlying implementation.

Parameters:
records - a sorted array of record positions
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public abstract String[] getSequential(int[] records,
                                       int pos,
                                       int length)
                                throws IOException
Given an array of record positions containing a sorted subrange defined by the parameters pos and length, this method returns the records for such positions.

The fetched positions are the ones in the range records[pos] (included) to records[pos+length] (exluded).

Parameters:
records - array with a sorted subrange of records positions
pos - the starting position of the subrange
length - the length of the subrange
Returns:
the records having these positions (order is preserved)
Throws:
IOException

getSequential

public String[] getSequential(int[] records,
                              int field)
                       throws IOException
Given a sorted array of record positions and the position of a field, this method returns that field of those records. A null value is stored in an array entry whenever the field is not present in the corresponding record. The input order is respected in the output array.

The default implementation is based on a sequential scan of the fetched records.

Parameters:
records - a sorted array of record positions
field - the field to select into each of these records
Returns:
the requested fields
Throws:
IOException

getSequential

public abstract void getSequential(int[] records,
                                   int field,
                                   BufferedWriter out)
                            throws IOException
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. If a record doesn't contain the requested field, the behavior of the method depends on its implementation (implementing classes are encouraged to dump a new line in this case, i.e. empty string).
In order to dump all fields of the specified records, you have to input the integer -1 as field position.

The retrieved records are not kept in memory but immediately dumped on the provided PrintStream without wasting further memory.

NOTE: implementations can use the method getField(String, int) provided by this abstract class that selects a field of a record through a sequential access to the record itself. The use of a more efficient implementation of this function is encouraged.

Parameters:
records - a sorted array of record positions
field - the position of the field to extract, or -1 to dump all fields
out - the output BufferedWriter
Throws:
IOException

open

public void open()
          throws IOException
Opens the TextDB.
This method has to be called before any other operation on the TextDB.

Throws:
IOException

close

public void close()
           throws IOException
Closes the TextDB and releases all of its resources.

Throws:
IOException

size

public abstract int size()
Returns the number of records contained in this TextDB. If N is the returned value then records of this database are numbered from 0 to N-1.

Returns:
the size of this TextDB as the number of the contained records

build

public abstract TextDB build(String outfile,
                             PrintStream log)
                      throws IOException
Builds the TextDB over the textual file identified by the filename string used in the constructor (see TextDB(String)). This method runs a build process with default values for all input parameters.

Log messages will be dumped into the passed PrintStream, or suppressed if the passed reference is null.

Parameters:
log - a PrintStream for log messages. A null value will suppress any output message
outfile - The output file name.
Returns:
A TextDB instance to access the built database.
Throws:
IOException

build

public TextDB build(String outfile)
             throws IOException
Builds a TextDB over the textual file identified by the filename string used in the constructor (see TextDB(String)). This method runs a build process with default values for all input parameters.

Parameters:
outfile - The output file name
Returns:
A TextDB instance to access the built database.
Throws:
IOException