|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unipi.di.textdb.TextDB
it.unipi.di.textdb.BucketedHuffword
public class BucketedHuffword
TextDB
implementation that uses a combination of the Huffword data compressor
and a bucketing scheme. Each bucket consists of a fixed-number of
contiguous records, and has variable length. The whole file is compressed
with Huffword, and pointers to compressed buckets (also called jumpers)
are kept in a file on disk.
At query time the bucket containing the
requested record is identified, using its corresponding jumper, loaded in memory
and read sequentially until the requested record is met. Given the properties of
Huffword, only the requested portion of a bucket must be uncompressed.
The tokens constituting the Huffword's alphabet are computed using a
Tokenizer
. Codewords are assigned to each token by considering
their frequencies in the input file. Optionally, the ZetaCompressor
can be used to assign codewords to tokens: sort tokens by decreasing
frequency and Zeta-encode their ranks (in this order). This avoids the need
to explicitly store the dictionary of codewords (but the output is sub-optimal).
The Huffman and ZetaCode implementations are provided by the library MG4J
(http://mg4j.dsi.unimi.it/).
Field Summary | |
---|---|
static int |
DEFAULT_BUCKET_LENGTH
|
static int |
DEFAULT_ZETA_K
|
static int |
HUFFMAN_CODEC
|
static int |
RECORD_BUFFER_LENGTH
The default size of the in memory buffer where to store the loaded record |
static int |
ZETA_CODEC
|
Fields inherited from class it.unipi.di.textdb.TextDB |
---|
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename |
Constructor Summary | |
---|---|
BucketedHuffword(String filename)
Create a new BucketedHuffword object loading the needed data structures from the provided file. |
Method Summary | |
---|---|
TextDB |
build(String outfile,
PrintStream log)
Compresses the input file with BucketedHuffword. |
static TextDB |
build(Tokenizer tokenizer,
String inputfile,
String outfile,
int bucketLen,
PrintStream log)
Compresses the input file with Bucketed Huffword using a set of custom parameters. |
void |
close()
Closes the TextDB and releases all of its resources. |
String |
get(int record)
Returns the record for a given position in the compressed file, null if this position is out of range. |
String[] |
getRange(int i,
int j)
Returns the records having positions from i to j in the TextDB. |
String[] |
getRange(int i,
int j,
int field)
Returns the specified field for the records in the range [i,j]. |
void |
getRange(int i,
int j,
int field,
BufferedWriter out)
Print on the passed PrintStream the specified field for the records in the range [i,j]. |
String[] |
getSequential(int[] records)
Given a sorted array of record positions, this method returns all of them. |
void |
getSequential(int[] records,
int field,
BufferedWriter out)
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. |
String[] |
getSequential(int[] records,
int pos,
int length)
Given an array of record positions containing a sorted subrange defined by the parameters pos and length ,
this method returns the records for such positions. |
static void |
main(String[] args)
|
void |
open()
Opens the TextDB. |
static void |
setCodec(int codec)
|
static void |
setZetaKParameter(int k)
|
int |
size()
Returns the number of records contained in this TextDB. |
Methods inherited from class it.unipi.di.textdb.TextDB |
---|
build, fromTDBFile, get, getField, getFieldValues, getName, getRecordFields, getSequential, setFieldSeparator |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int ZETA_CODEC
public static final int HUFFMAN_CODEC
public static final int DEFAULT_BUCKET_LENGTH
public static final int DEFAULT_ZETA_K
public static final int RECORD_BUFFER_LENGTH
Constructor Detail |
---|
public BucketedHuffword(String filename)
filename
- the TDB file containing the needed data structures and the compressed contentMethod Detail |
---|
public static void setCodec(int codec)
public static void setZetaKParameter(int k)
public void close() throws IOException
TextDB
close
in class TextDB
IOException
public TextDB build(String outfile, PrintStream log) throws IOException
TermTokenizer
.
build(Tokenizer, String, String, int, PrintStream)
to compress with a set of customized parameters.
build
in class TextDB
log
- a PrintStream where log messages are print out. A null value will suppress any output messageoutfile
- The output file name.
IOException
public static TextDB build(Tokenizer tokenizer, String inputfile, String outfile, int bucketLen, PrintStream log) throws IOException
tokenizer
- the tokenizer used to define the Huffoword tokens from the input fileinputfile
- the input file namebucketLen
- the number of records per bucketlog
- a PrintStream where log messages are print out. A null
value will suppress any output message
IOException
public void open() throws IOException
TextDB
open
in class TextDB
IOException
public int size()
TextDB
size
in class TextDB
public String get(int record) throws IOException
get
in class TextDB
record
- The record number in the range [0, N-1]
IOException
public void getSequential(int[] records, int field, BufferedWriter out) throws IOException
TextDB
TextDB.getField(String, int)
provided by
this abstract class that selects a field of a record through a sequential access
to the record itself. The use of a more efficient implementation of this function
is encouraged.
getSequential
in class TextDB
records
- a sorted array of record positionsfield
- the position of the field to extract, or -1 to dump all fieldsout
- the output BufferedWriter
IOException
public String[] getSequential(int[] records) throws IOException
TextDB
getSequential
in class TextDB
records
- a sorted array of record positions
IOException
public String[] getSequential(int[] records, int pos, int length) throws IOException
TextDB
pos
and length
,
this method returns the records for such positions.
records[pos]
(included) to records[pos+length]
(exluded).
getSequential
in class TextDB
records
- array with a sorted subrange of records positionspos
- the starting position of the subrangelength
- the length of the subrange
IOException
public String[] getRange(int i, int j) throws IOException
TextDB
getRange
in class TextDB
i
- the starting position of the records to retrieve (inclusive)j
- the ending position of the records to retrieve (inclusive)
IOException
public String[] getRange(int i, int j, int field) throws IOException
TextDB
getRange
in class TextDB
i
- the starting position of the records to be fetched (included)j
- the ending position of the records to be fetched (included)field
- the position of the field to return for all those records
IOException
public void getRange(int i, int j, int field, BufferedWriter out) throws IOException
TextDB
PrintStream
the specified field for the records in the range [i,j].
If not present, an empty line will be dumped out.
getRange
in class TextDB
i
- the starting position of the records to be fetched (included)j
- the ending position of the records to be fetched (included)field
- the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire recordout
- the output BufferedWriter
IOException
public static void main(String[] args) throws Exception
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |