|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unipi.di.textdb.TextDB
it.unipi.di.textdb.FrontCoding
public class FrontCoding
A TextDB
that uses the Front-Coding technique to compress its records
and a bucketing scheme to provide efficient access to them.
A bucket is defined as a fixed-number of contiguous records, and
has variable length.
In each bucket the first record is stored explicitly, whereas any other record is
front-encoded by squeezing out the longest prefix (LP) it shares with the
preceding record in its bucket. The squeezing consists of substituting LP
with its length (encoded using 1 or 3 bytes). Optionally a front-coded
bucket can be further compressed using GZip.
The pointer to the beginning of each bucket (also called jumper)
is stored in a separate file on disk.
At query time the bucket containing the requested record is identified,
using its jumper, loaded in memory and accessed sequentially until the
record is met.
To achieve higher compression, this scheme should be applied over a
TextDB whose records
are sorted lexicographically (see ExternalSort
).
Field Summary | |
---|---|
static char |
BIGGER_CHAR
|
static int |
DEFAULT_BUCKET_SIZE
|
static int |
DEFAULT_COMPRESSION_LEVEL
|
static byte |
ESCAPE_VALUE_BYTE
|
static char |
SMALLER_CHAR
|
Fields inherited from class it.unipi.di.textdb.TextDB |
---|
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename |
Constructor Summary | |
---|---|
FrontCoding(String filename)
Created a new FrontCoding object from the provided file. |
|
FrontCoding(String filename,
boolean searchSupport)
Created a new FrontCoding object from the provided file. |
Method Summary | |
---|---|
long |
bucketSize()
|
TextDB |
build(String outfile,
PrintStream log)
Compresses the TextDB via (plain) front-coding with buckets of 100 records. |
static TextDB |
build(String inputfile,
String outfile,
int bucketSize,
boolean compress,
boolean frontCoding,
boolean searchSupport,
int level,
PrintStream log)
Compresses the input file with a customized compression scheme, which combines frontcoding and Zip. |
void |
close()
Closes the TextDB and releases all of its resources. |
float |
compressRatio()
|
String |
dictName()
|
long |
dictSize()
|
long |
fcBeginSize()
|
long |
fcDictSize()
|
long |
fcJumpSize()
|
long |
fcTotalSize()
|
String |
get(int record)
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB. |
String[] |
getRange(int i,
int j)
Returns the records having positions from i to j in the TextDB. |
void |
getRange(int i,
int j,
int field,
BufferedWriter out)
Print on the passed PrintStream the specified field for the records in the range [i,j]. |
String[] |
getSequential(int[] records)
Given a sorted array of record positions, this method returns all of them. |
void |
getSequential(int[] records,
int field,
BufferedWriter out)
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. |
String[] |
getSequential(int[] records,
int pos,
int length)
Given an array of record positions containing a sorted subrange defined by the parameters pos and length ,
this method returns the records for such positions. |
protected int |
locate(String p)
Returns the position of the first record having prefix p (here records are assumed to be sorted), or the alphabetical position of p among the records in the TextDB. |
static void |
main(String[] args)
|
long |
numBuckets()
|
long |
numStrings()
|
void |
open()
It opens and loads from disk all data structures needed to access the compressed TextDB. |
Range |
prefix(String p)
Returns the range [i, j) of positions identifying the records in the (ordered) TextDB which are prefixed by p. |
int |
rank(String s)
It returns the rank r of the smallest record in the current TextDB which is larger than or equal to s. |
int |
size()
Returns the number of records contained in this TextDB. |
Methods inherited from class it.unipi.di.textdb.TextDB |
---|
build, fromTDBFile, get, getField, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_BUCKET_SIZE
public static final int DEFAULT_COMPRESSION_LEVEL
public static final char SMALLER_CHAR
public static final char BIGGER_CHAR
public static final byte ESCAPE_VALUE_BYTE
Constructor Detail |
---|
public FrontCoding(String filename)
filename
- the TDB file containing the front coded records and the data
structures needed to access thempublic FrontCoding(String filename, boolean searchSupport)
filename
- the TDB file containing the front coded records and the data
structures needed to access themsearchSupport
- if false don't load the data structures to support
the searches defined in the SearchableDB
interfaceMethod Detail |
---|
public String dictName()
public long dictSize()
public long numBuckets()
public long bucketSize()
public long numStrings()
public long fcBeginSize()
public long fcDictSize()
public long fcJumpSize()
public long fcTotalSize()
public float compressRatio()
public TextDB build(String outfile, PrintStream log) throws IOException
SearchableDB
.
build
in class TextDB
log
- a PrintStream where to print log messages. A null value will suppress any output messageoutfile
- the output file name
IOException
public static TextDB build(String inputfile, String outfile, int bucketSize, boolean compress, boolean frontCoding, boolean searchSupport, int level, PrintStream log) throws IOException
inputfile
- the file to compressoutfile
- the output file namebucketSize
- the number of contiguous records forming a bucketcompress
- if true, ZIP is used onto the bucket (possibly front-compressed).frontCoding
- if true, front-coding is used over the buckets (before applying Gzip, if any)searchSupport
- build the support for the SearchableDB
interfacelevel
- compression level if compress=true. An integer between 0 and 9 (inclusive)log
- a PrintStream where to print log messages. A null value
will suppress any output message
IOException
public void open() throws IOException
open
in class TextDB
IOException
public void close() throws IOException
TextDB
close
in class TextDB
IOException
protected int locate(String p) throws IOException
p
- The prefix to search.
IOException
public Range prefix(String p) throws IOException
prefix
in interface SearchableDB
p
- the prefix to search
IOException
public int rank(String s) throws IOException
rank
in interface SearchableDB
s
- The string to be ranked.
IOException
public int size()
TextDB
size
in class TextDB
public String get(int record) throws IOException
TextDB
get
in class TextDB
record
- a position in the range [0, N-1]
IOException
public String[] getRange(int i, int j) throws IOException
TextDB
getRange
in class TextDB
i
- the starting position of the records to retrieve (inclusive)j
- the ending position of the records to retrieve (inclusive)
IOException
public void getRange(int i, int j, int field, BufferedWriter out) throws IOException
TextDB
PrintStream
the specified field for the records in the range [i,j].
If not present, an empty line will be dumped out.
getRange
in class TextDB
i
- the starting position of the records to be fetched (included)j
- the ending position of the records to be fetched (included)field
- the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire recordout
- the output BufferedWriter
IOException
public String[] getSequential(int[] records) throws IOException
TextDB
getSequential
in class TextDB
records
- a sorted array of record positions
IOException
public String[] getSequential(int[] records, int pos, int length) throws IOException
TextDB
pos
and length
,
this method returns the records for such positions.
records[pos]
(included) to records[pos+length]
(exluded).
getSequential
in class TextDB
records
- array with a sorted subrange of records positionspos
- the starting position of the subrangelength
- the length of the subrange
IOException
public void getSequential(int[] records, int field, BufferedWriter out) throws IOException
TextDB
TextDB.getField(String, int)
provided by
this abstract class that selects a field of a record through a sequential access
to the record itself. The use of a more efficient implementation of this function
is encouraged.
getSequential
in class TextDB
records
- a sorted array of record positionsfield
- the position of the field to extract, or -1 to dump all fieldsout
- the output BufferedWriter
IOException
public static void main(String[] args) throws Exception
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |