A command-line tool for transforming datasets. It contains import/export functionality, unsupervised algorithms, and other useful transforms that you may wish to perform on a dataset. Here's the usage information:
Full Usage Information [Square brackets] are used to indicate required arguments. <Angled brackets> are used to indicate optional arguments. waffles_transform [command] Transform data, shuffle rows, swap columns, matrix operations, etc. add [dataset1] [dataset2] Adds two matrices together element-wise. Results are printed to stdout. [dataset1] The filename of the first matrix. [dataset2] The filename of the second matrix. addindexcolumn [dataset] <options> Add a column that Specify the index of each row. This column will be inserted as column 0. (For example, suppose you would like to plot the values in each column of your data against the row index. Most plotting tools expect one of the columns to supply the position on the horizontal axis. This feature will create such a column for you.) [dataset] The filename of a dataset. <options> -start [value] Specify the initial index. (the default is 0.0). -increment [value] Specify the increment amount. (the default is 1.0). addnoise [dataset] [dev] <options> Add Gaussian noise with the specified deviation to all the elements in the dataset. (Assumes that the values are all continuous.) [dataset] The filename of a dataset. [dev] The deviation of the Gaussian noise <options> -seed [value] Specify a seed for the random number generator. -excludelast [n] Do not add noise to the last [n] columns. aggregatecols [n] Make a matrix by aggregating each column [n] from the .arff files in the current directory. The resulting matrix is printed to stdout. aggregaterows [n] Make a matrix by aggregating each row [n] from the .arff files in the current directory. The resulting matrix is printed to stdout. align [a] [b] Translates and rotates dataset [b] to minimize mean squared difference with dataset [a]. (Uses the Kabsch algorithm.) [a] The filename of a dataset. [b] The filename of a dataset. autocorrelation [dataset] Compute the autocorrelation of the specified time-series data. cholesky [dataset] Compute the cholesky decomposition of the specified matrix. correlation [dataset] [attr1] [attr2] <options> Compute the linear correlation coefficient of the two specified attributes. [dataset] The filename of a dataset. [attr1] A zero-indexed attribute number. [attr2] A zero-indexed attribute number. <options> -aboutorigin Compute the correlation about the origin. (The default is to compute it about the mean.) cumulativecolumns [dataset] [column-list] Accumulates the values in the specified columns. For example, a column that contains the values 2,1,3,2 would be changed to 2,3,6,8. This might be useful for converting a histogram of some distribution into a histogram of the cumulative disribution. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 determinant [dataset] Compute the determinant of the specified matrix. discretize [dataset] <options> Discretizes the continuous attributes in the specified dataset. [dataset] The filename of a dataset. <options> -buckets [count] Specify the number of buckets to use. If not specified, the default is to use the square root of the number of rows in the dataset. -colrange [first] [last] Specify a range of columns. Only continuous columns in the specified range will be modified. (Columns are zero-indexed.) dropcolumns [dataset] [column-list] Remove one or more columns from a dataset and prints the results to stdout. (The input file is not modified.) [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to drop. A hypen may be used to specify a range of columns. A '*' preceding a value means to index from the right instead of the left. For example, "0,2-5" refers to columns 0, 2, 3, 4, and 5. "*0" refers to the last column. "0-*1" refers to all but the last column. dropmissingvalues [dataset] Remove all rows that contain missing values. droprandomvalues [dataset] [portion] <options> Drop random values from the specified dataset. The resulting dataset with missing values is printed to stdout. [dataset] The filename of a dataset. [portion] The portion of the data to drop. For example, if [portion] is 0.1, then 10% of the values will be replaced with unknown values <options> -seed [value] Specify a seed for the random number generator. export [dataset] <options> Print the data as a list of comma separated values without any meta-data. [dataset] The filename of a dataset. <options> -tab Separate with tabs instead of commas. -space Separate with spaces instead of commas. droprows [dataset] [after-size] Removes all rows except for the first [after-size] rows. [dataset] The filename of a dataset. [after-size] The number of rows to keep fillmissingvalues [dataset] <options> Replace all missing values in the dataset. (Note that the fillmissingvalues command in the waffles_recommend tool performs a similar task, but it can intelligently predict the missing values instead of just using the baseline value.) [dataset] The filename of a dataset <options> -seed [value] Specify a seed for the random number generator. -random Replace each missing value with a randomly chosen non-missing value from the same attribute. (The default is to use the baseline value. That is, the mean for continuous attributes, and the most-common value for nominal attributes.) import [dataset] <options> Convert a text file of comma separated (or otherwise separated) values to a .arff file. The meta-data is automatically determined. The .arff file is printed to stdout. This makes it easy to operate on structured data from a spreadsheet, database, or pretty-much any other source. [dataset] The filename of a dataset. <options> -tab Data elements are separated with a tab character instead of a comma. -space Data elements are separated with a single space instead of a comma. -whitespace Data elements are separated with an arbitrary amount of whitespace. -semicolon Data elements are separated with semicolons. -separator [char] Data elements are separated with the specified character. -columnnames Use the first row of data for column names. enumeratevalues [dataset] [col] Enumerates all of the unique values in the specified column, and replaces each value with its enumeration. (For example, if you have a column that contains the social-security-number of each user, this will change them to numbers from 0 to n-1, where n is the number of unique users.) [dataset] The filename of a dataset [col] The column index (starting with 0) to enumerate measuremeansquarederror [dataset1] [dataset2] <options> Print the mean squared error between two datasets. (Both datasets must be the same size.) [dataset1] The filename of a dataset [dataset2] The filename of a dataset <options> -fit Use a hill-climber to find an affine transformation to make dataset2 fit as closely as possible to dataset1. Report results after each iteration. -sum Sum the mean-squared error over each attribute and only report this sum. (The default is to report the mean-squared error in each attribute.) mergehoriz [dataset1] [dataset2] Merge two (or more) datasets horizontally. All datasets must already have the same number of rows. The resulting dataset will have all the columns of both datasets. [dataset1] The filename of a dataset [dataset2] The filename of a dataset mergevert [dataset1] [dataset2] Merge two datasets vertically. Both datasets must already have the same number of columns. The resulting dataset will have all the rows of both datasets. [dataset1] The filename of a dataset [dataset2] The filename of a dataset multiply [a] [b] <options> Matrix multiply [a] x [b]. Both arguments are the filenames of .arff files. Results are printed to stdout. [dataset1] The filename of a dataset [dataset2] The filename of a dataset <options> -transposea Transpose [a] before multiplying. -transposeb Transpose [b] before multiplying. multiplyscalar [dataset] [scalar] Multiply all elements in [dataset] by the specified scalar. Results are printed to stdout. [dataset] The filename of a dataset. [scalar] A scalar to multiply each element by. normalize [dataset] <options> Normalize all continuous attributes to fall within the specified range. (Nominal columns are left unchanged.) [dataset] The filename of a dataset <options> -range [min] [max] Specify the output min and max values. (The default is 0 1.) nominaltocat [dataset] <options> Convert all nominal attributes in the data to vectors of real values by representing them as a categorical distribution. Columns with only two nominal values are converted to 0 or 1. If there are three or more possible values, a column is created for each value. The column corresponding to the value is set to 1, and the others are set to 0. (This is similar to Weka's NominalToBinaryFilter.) [dataset] The filename of a dataset <options> -maxvalues [cap] Specify the maximum number of nominal values for which to create new columns. If not specified, the default is 12. powercolumns [dataset] [column-list] [exponent] Raises the values in the specified columns to some power (or exponent). [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [exponent] An exponent value, such as 0.5, 2, etc. pseudoinverse [dataset] Compute the Moore-Penrose pseudo-inverse of the specified matrix of real values. reducedrowechelonform [dataset] Convert a matrix to reduced row echelon form. Results are printed to stdout. rotate [dataset] [col_x] [col_y] [angle_degrees] Rotate angle degrees around the origin in in the col_x,col_y plane. Only affects the values in col_x and col_y. [dataset] The filename of a dataset. [col_x] The zero-based index of an attribute to serve as the x coordinate in the plane of rotation. Rotation from x to y will be 90 degrees. col_x must be a real-valued attribute. [col_y] The zero-based index of an attribute to serve as the y coordinate in the plane of rotation. Rotation from y to x will be 270 degrees. col_y must be a real-valued attribute. [angle_degrees] The angle in degrees to rotate around the origin in the col_x,col_y plane. samplerows [dataset] [portion] Samples from the rows in the specified dataset and prints them to stdout. This tool reads each row one-at-a-time, so it is well-suited for reducing the size of datasets that are too big to fit into memory. (Note that unlike most other tools, this one does not convert CSV to ARFF format internally. If the input is CSV, the output will be CSV too.) [dataset] The filename of a dataset. ARFF, CSV, and a few other formats are supported. [portion] A value between 0 and 1 that specifies the likelihood that each row will be printed to stdout. <options> -seed [value] Specify a seed for the random number generator. scalecolumns [dataset] [column-list] [scalar] Multiply the values in the specified columns by a scalar. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [scalar] A scalar value. shiftcolumns [dataset] [column-list] [offset] Add [offset] to all of the values in the specified columns. [dataset] The filename of a dataset. [column-list] A comma-separated list of zero-indexed columns to transform. A hypen may be used to specify a range of columns. Example: 0,2-5,7 [offset] A positive or negative value to add to the values in the specified columns. shuffle [dataset] <options> Shuffle the row order. [dataset] The filename of a dataset <options> -seed [value] Specify a seed for the random number generator. significance [dataset] [attr1] [attr2] <options> Compute statistical significance values for the two specified attributes. [dataset] The filename of a .arff file. [attr1] A zero-indexed column number. [attr2] A zero-indexed column number. <options> -tol [value] Sets the tolerance value for the Wilcoxon Signed Ranks test. The default value is 0.001. sortcolumn [dataset] [col] <options> Sort the rows in [dataset] such that the values in the specified column are in ascending order and print the results to to stdout. (The input file is not modified.) [dataset] The filename of a dataset. [col] The zero-indexed column number in which to sort <options> -descending Sort in descending order instead of ascending order. split [dataset] [rows] [filename1] [filename2] <options> Split a dataset into two datasets. (Nothing is printed to stdout.) [dataset] The filename of a datset. [rows] The number of rows to go into the first file. The rest go in the second file. <options> -seed [value] Specify a seed for the random number generator. -shuffle Shuffle the input data before splitting it. [filename1] The filename for one half of the data. [filename2] The filename for the other half of the data. splitclass [data] [attr] <options> Splits a dataset by a class attribute, such that a separate file is created for each unique class label. The generated filenames will be "[data]_[value]", where [value] is the unique class label value. [data] The filename of a dataset. [attr] The zero-indexed column number of the class attribute. <options> -dropclass Drop the class attribute after splitting the data. (The default is to include the class attribute in each of the output datasets, which is rather redundant since every row in the file will have the same class label.) splitfold [dataset] [i] [n] <options> Divides a dataset into [n] parts of approximately equal size, then puts part [i] into one file, and the other [n]-1 parts in another file. (This tool may be useful, for example, to implement n-fold cross validation.) [dataset] The filename of a datset. [i] The (zero-based) index of the fold, or the part to put into the training set. [i] must be less than [n]. [n] The number of folds. <options> -out [train_filename] [test_filename] Specify the filenames for the training and test portions of the data. The default values are train.arff and test.arff. squareddistance [a] [b] Computesthe sum and mean squared distance between dataset [a] and [b]. ([a] and [b] are each the names of files in .arff format. They must have the same dimensions.) [a] The filename of a dataset. [b] The filename of a dataset. svd [matrix] <options> Compute the singular value decomposition of a matrix. [matrix] The filename of the matrix. <options> -ufilename [filename] Set the filename to which U will be saved. U is the matrix in which the columns are the eigenvectors of [matrix] times its transpose. The default is u.arff. -sigmafilename [filename] Set the filename to which Sigma will be saved. Sigma is the matrix that contains the singular values on its diagonal. All values in Sigma except the diagonal will be zero. If this option is not specified, the default is to only print the diagonal values (not the whole matrix) to stdout. If this options is specified, nothing is printed to stdout. -vfilename [filename] Set the filename to which V will be saved. V is the matrix in which the row are the eigenvectors of the transpose of [matrix] times [matrix]. The default is v.arff. -maxiters [n] Specify the number of times to iterate before giving up. The default is 100, which should be sufficient for most problems. swapcolumns [dataset] [col1] [col2] Swap two columns in the specified dataset and prints the results to stdout. (Columns are zero-indexed.) [dataset] The filename of a dataset [col1] A zero-indexed column number. [col2] A zero-indexed column number. transition [action-sequence] [state-sequence] <options> Given a sequence of actions and a sequence of states (each in separate datasets), this generates a single dataset to map from action-state pairs to the next state. This would be useful for generating the data to train a transition function. <options> -delta Predict the delta of the state transition instead of the new state. threshold [dataset] [column] [threshold] Outputs a copy of dataset such that any value v in the given column becomes 0 if v <= threshold and 1 otherwise. Only works on continuous attributes. [dataset] The filename of a dataset. [column] The zero-indexed column number to threshold. [threshold] The threshold value. transpose [dataset] Transpose the data such that columns become rows and rows become columns. zeromean [dataset] Subtracts the mean from all values of all continuous attributes, so that their means in the result are zero. Leaves nominal attributes untouched. usage Print usage information.