it.unipi.di.util
Class ExternalSort

java.lang.Object
  extended by it.unipi.di.util.ExternalSort

public class ExternalSort
extends Object

A multi-way external merge sort that use a TreeMap (Java Red-Black Tree) as internal sorting data structure. This implementation is 100% pure Java and is able to scale over GBs of data better than the unix sort command.

This class has a main method so that it can be used via command line in a easy way. Being a standard Java class it can be instantiated and configured by means of its "set" methods. The sort process can be started by the run() method. By default this class reads from stdin and write in stdout. This behavior can be changed using the setInFile(String) and the setOutFile(String) method.

Author:
Claudio Corsi, Paolo Ferragina

Nested Class Summary
 class ExternalSort.ReverseComparator
          A reverse comparator.
protected  class ExternalSort.SortingKey
          Used to compare two strings by their sorting columns (aka sorted keys).
protected  class ExternalSort.Tuple
          Used to store all the Strings associated to a sorting key with their run's ids.
 
Field Summary
protected  it.unimi.dsi.mg4j.util.MutableString buff
           
protected  int[] columns
           
protected  String currKey
           
static int DEFAULT_PAGE_SIZE
           
static long DEFAULT_RUN_SIZE
           
protected  boolean dist
           
protected  long elapsedSecs
           
protected  boolean EOF
           
protected  boolean extract
           
protected  InputStream in
           
protected  String infile
           
protected  long numberOfDumpedRows
           
protected  long numberOfInputRows
           
protected  boolean numeric
           
protected  PrintStream out
           
protected  String outfile
           
protected  int pageSize
           
protected  String prevKey
           
protected  boolean reverse
           
protected  long rowsCount
           
protected  long runSize
           
protected  HashMap<Long,Reader> runsMap
           
protected  char sep
           
protected  boolean uniq
           
protected  boolean verbose
           
static String VERSION
           
 
Constructor Summary
ExternalSort()
          Create a new ExternalSort.
 
Method Summary
protected  File createSortedRun(List<ExternalSort.SortingKey> chunk)
           
protected  void dumpSortedRows()
           
protected  void initDataStructure()
           
protected  void loadNextPage(long runNumber)
           
static void main(String[] args)
           
protected static void printUsage()
           
 void run()
          Start the sorting process.
 void setColumns(int[] columns)
          Set the columns to sort.
 void setExtract(boolean extract)
          If true, dump out only the sorting column(s) omitting the other ones.
 void setInFile(String infile)
          Set the input file to sort.
 void setKeysDistribution(boolean dist)
          Instead to sort, dump the frequencies of the sorting keys in their sorted order (not in the frequency values order).
 void setNumeric(boolean numeric)
          Compare the sorting values (rows or columns) as numerical values.
 void setOutFile(String outfile)
          Set the output file.
 void setPageSize(int pageSize)
          Set the page size to use in the second stage of the algorithm (pagination of the sorted runs).
 void setReverse(boolean reverse)
          Set true to sort in reverse order (ascendent instead of descendent).
 void setRunSize(long runSize)
          Set the size of the chunk (run) of text to sort in memory at the first stage of the algorithm.
 void setSeparator(char sep)
          Set the character to use to split the rows in columns.
 void setUniq(boolean uniq)
          Remove duplicates from the result.
 void setVerbose(boolean verbose)
          Set true to have log messages on stdout during the sorting process.
protected  void updateProgressInfos(long start)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_RUN_SIZE

public static final long DEFAULT_RUN_SIZE

DEFAULT_PAGE_SIZE

public static final int DEFAULT_PAGE_SIZE
See Also:
Constant Field Values

VERSION

public static final String VERSION
See Also:
Constant Field Values

verbose

protected boolean verbose

runsMap

protected HashMap<Long,Reader> runsMap

in

protected InputStream in

out

protected PrintStream out

outfile

protected String outfile

infile

protected String infile

columns

protected int[] columns

sep

protected char sep

runSize

protected long runSize

pageSize

protected int pageSize

elapsedSecs

protected long elapsedSecs

numberOfDumpedRows

protected long numberOfDumpedRows

numberOfInputRows

protected long numberOfInputRows

reverse

protected boolean reverse

numeric

protected boolean numeric

uniq

protected boolean uniq

currKey

protected String currKey

dist

protected boolean dist

prevKey

protected String prevKey

rowsCount

protected long rowsCount

EOF

protected boolean EOF

buff

protected it.unimi.dsi.mg4j.util.MutableString buff

extract

protected boolean extract
Constructor Detail

ExternalSort

public ExternalSort()
Create a new ExternalSort.

Method Detail

setVerbose

public void setVerbose(boolean verbose)
Set true to have log messages on stdout during the sorting process.

Parameters:
verbose -

setReverse

public void setReverse(boolean reverse)
Set true to sort in reverse order (ascendent instead of descendent).

Parameters:
reverse -

setNumeric

public void setNumeric(boolean numeric)
Compare the sorting values (rows or columns) as numerical values. In other words this flag will cause the right padding of the sorting values.

Parameters:
numeric -

setKeysDistribution

public void setKeysDistribution(boolean dist)
Instead to sort, dump the frequencies of the sorting keys in their sorted order (not in the frequency values order).

Parameters:
dist -

setUniq

public void setUniq(boolean uniq)
Remove duplicates from the result. In case of equals sorting values (rows or columns) only one of these is kept.

Parameters:
uniq -

setRunSize

public void setRunSize(long runSize)
Set the size of the chunk (run) of text to sort in memory at the first stage of the algorithm. This is the maximum size of memory available to sort. If not set the default value is 50MB.

Parameters:
runSize - the size of memory available expressed in bytes

setPageSize

public void setPageSize(int pageSize)
Set the page size to use in the second stage of the algorithm (pagination of the sorted runs). This value should be chosen depending on the disk page size. The default value is 100KB.

Parameters:
pageSize - the page size expressed in bytes.

setColumns

public void setColumns(int[] columns)
Set the columns to sort. If not specified then the input file will be sorted considering the entire content of the rows (divided by new lines). If specified each row will be divided into columns and sorted accordingly to the value of the specified rows. The order of the columns matter. For example if the columns list is [1, 0, 5] then for first the column 1 of each row is compared. In case of the same value the column 0 is compared and, at the end, the column 5. Two rows are considered equals if all the specified columns contain the same value

Parameters:
columns - the list of columns to sort
See Also:
setUniq(boolean), setSeparator(char)

setSeparator

public void setSeparator(char sep)
Set the character to use to split the rows in columns. Default value is tab ('\t').

Parameters:
sep -

setExtract

public void setExtract(boolean extract)
If true, dump out only the sorting column(s) omitting the other ones. The dumped column(s) will be sorted respecting the sorting parameters. If no columns are selected (sort by the entire rows) this option doesn't have effect.

Parameters:
extract - true to dump out only the sorting column(s)

setOutFile

public void setOutFile(String outfile)
                throws FileNotFoundException
Set the output file. By default the result is written in stdout.

Parameters:
outfile -
Throws:
FileNotFoundException

setInFile

public void setInFile(String infile)
               throws FileNotFoundException
Set the input file to sort. By default is the stdin.

Parameters:
infile -
Throws:
FileNotFoundException

updateProgressInfos

protected void updateProgressInfos(long start)

run

public void run()
         throws IOException
Start the sorting process. This method can take much time to complete.

Throws:
IOException

createSortedRun

protected File createSortedRun(List<ExternalSort.SortingKey> chunk)
                        throws IOException
Throws:
IOException

initDataStructure

protected void initDataStructure()

loadNextPage

protected void loadNextPage(long runNumber)
                     throws IOException
Throws:
IOException

dumpSortedRows

protected void dumpSortedRows()
                       throws IOException
Throws:
IOException

printUsage

protected static void printUsage()

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception