|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unipi.di.textdb.TextDB
it.unipi.di.textdb.RSHuffword
public class RSHuffword
A TextDB
that compress the source file with the Huffword
technique and access it through Rank and Select operations.
The compressed file is marked by arrays of bits delimiting the fields and the records.
At query time Rank&Select operations over these vectors will be performed in order to find the
positions of the requested fields/records to be uncompressed and returned.
The Huffword dictionary is built using a Tokenizer
instance in order to split
the source text into terms.
To each of them is assigned a codeword depending on its frequency in the text.
The Huffman prefix-free code is then used to encode that codewords.
Field Summary | |
---|---|
static int |
DEFAULT_BUCKET_LENGTH
|
Fields inherited from class it.unipi.di.textdb.TextDB |
---|
DEFAULT_FIELD_SEPARATOR, fieldSeparator, filename |
Constructor Summary | |
---|---|
RSHuffword(String filename)
Create a new RSHuffword object loading the needed data structures from the provided file. |
Method Summary | |
---|---|
TextDB |
build(String outfile,
PrintStream log)
Compress with a standard bucketed huffword technique. |
static TextDB |
build(Tokenizer tokenizer,
String inputfile,
String outfile,
PrintStream log,
boolean withFields,
char separator)
Compress the input file with the bucketed huffword technique using customized parameters. |
void |
close()
Closes the TextDB and releases all of its resources. |
String |
get(int record)
Returns the record for a given position in the range [0, N-1], where N is the number of records present in the TextDB. |
String |
get(int record,
int field)
Returns the field of a record, given their ordinal positions, or null if one is not present. |
String[] |
getRange(int i,
int j)
Returns the records having positions from i to j in the TextDB. |
void |
getRange(int i,
int j,
int field,
BufferedWriter out)
Print on the passed PrintStream the specified field for the records in the range [i,j]. |
String[] |
getSequential(int[] records)
Given a sorted array of record positions, this method returns all of them. |
void |
getSequential(int[] records,
int field,
BufferedWriter out)
Given a sorted array of record positions and the position of a field, this method retrieves the specified field from those records. |
String[] |
getSequential(int[] records,
int pos,
int length)
Given an array of record positions containing a sorted subrange defined by the parameters pos and length ,
this method returns the records for such positions. |
static void |
main(String[] args)
|
void |
open()
Opens the TextDB. |
int |
size()
Returns the number of records contained in this TextDB. |
Methods inherited from class it.unipi.di.textdb.TextDB |
---|
build, fromTDBFile, getField, getFieldValues, getName, getRange, getRecordFields, getSequential, setFieldSeparator |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_BUCKET_LENGTH
Constructor Detail |
---|
public RSHuffword(String filename)
filename
- the TDB file containing the needed data structures and the compressed contentMethod Detail |
---|
public TextDB build(String outfile, PrintStream log) throws IOException
TermTokenizer
to build the huffword dictionary.
build(Tokenizer, String, String, PrintStream, boolean, char)
to compress
with customized parameters.
build
in class TextDB
log
- a PrintStream where to print log messages. A null value will suppress any output messageoutfile
- The output file name.
IOException
public static TextDB build(Tokenizer tokenizer, String inputfile, String outfile, PrintStream log, boolean withFields, char separator) throws IOException
tokenizer
- the tokenizer used to parse the input fileinputfile
- the input file namelog
- a PrintStream where to print log messages. A null value will suppress any output message
IOException
public void open() throws IOException
TextDB
open
in class TextDB
IOException
public void close() throws IOException
TextDB
close
in class TextDB
IOException
public String get(int record) throws IOException
TextDB
get
in class TextDB
record
- a position in the range [0, N-1]
IOException
public String get(int record, int field) throws IOException
get
in class TextDB
record
- the position of a recordfield
- the position of the field to be retrieved
IOException
public String[] getRange(int i, int j) throws IOException
TextDB
getRange
in class TextDB
i
- the starting position of the records to retrieve (inclusive)j
- the ending position of the records to retrieve (inclusive)
IOException
public void getRange(int i, int j, int field, BufferedWriter out) throws IOException
TextDB
PrintStream
the specified field for the records in the range [i,j].
If not present, an empty line will be dumped out.
getRange
in class TextDB
i
- the starting position of the records to be fetched (included)j
- the ending position of the records to be fetched (included)field
- the position (counting from 0) of the field to return for all the records in range, or -1 to retrieve the entire recordout
- the output BufferedWriter
IOException
public String[] getSequential(int[] records) throws IOException
TextDB
getSequential
in class TextDB
records
- a sorted array of record positions
IOException
public String[] getSequential(int[] records, int pos, int length) throws IOException
TextDB
pos
and length
,
this method returns the records for such positions.
records[pos]
(included) to records[pos+length]
(exluded).
getSequential
in class TextDB
records
- array with a sorted subrange of records positionspos
- the starting position of the subrangelength
- the length of the subrange
IOException
public void getSequential(int[] records, int field, BufferedWriter out) throws IOException
TextDB
TextDB.getField(String, int)
provided by
this abstract class that selects a field of a record through a sequential access
to the record itself. The use of a more efficient implementation of this function
is encouraged.
getSequential
in class TextDB
records
- a sorted array of record positionsfield
- the position of the field to extract, or -1 to dump all fieldsout
- the output BufferedWriter
IOException
public int size()
TextDB
size
in class TextDB
public static void main(String[] args) throws Exception
Exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |