GClasses
|
Represents a matrix or a database table. Elements can be discrete or continuous. References a GRelation object, which stores the meta-information about each column. More...
#include <GMatrix.h>
Public Member Functions | |
GMatrix (size_t rows, size_t cols, GHeap *pHeap=NULL) | |
Construct a rows x cols matrix. All elements of the matrix are assumed to be continuous. (It is okay to initially set rows to 0 and later call newRow to add each row. Adding columns later, however, is not very computationally efficient.) | |
GMatrix (std::vector< size_t > &attrValues, GHeap *pHeap=NULL) | |
Construct a matrix with a mixed relation. That is, one with some continuous attributes (columns), and some nominal attributes (columns). attrValues specifies the number of nominal values suppored in each attribute (column), or 0 for a continuous attribute. Initially, this matrix will have 0 rows, but you can add more rows by calling newRow or newRows. | |
GMatrix (sp_relation &pRelation, GHeap *pHeap=NULL) | |
pRelation is a smart-pointer to a relation, which specifies the type of each attribute (column) in the data set. Initially, this matrix will have 0 rows, but you can add more rows by calling newRow or newRows. | |
GMatrix (GDomNode *pNode, GHeap *pHeap=NULL) | |
Load from a DOM. | |
~GMatrix () | |
double * | newRow () |
Adds a new row to the dataset. (The values in the row are not initialized) | |
void | newRows (size_t nRows) |
Adds "nRows" uninitialized rows to the data set. | |
void | add (GMatrix *pThat, bool transpose) |
Matrix add. Adds the values in pThat to this. (If transpose is true, adds the transpose of pThat to this.) Both datasets must have the same dimensions. Behavior is undefined for nominal columns. | |
GMatrix * | attrSubset (size_t firstAttr, size_t attrCount) |
Returns a new dataset that contains a subset of the attributes in this dataset. | |
GMatrix * | cholesky () |
This computes the square root of this matrix. (If you take the matrix that this returns and multiply it by its transpose, you should get the original dataset again.) Behavior is undefined if there are nominal attributes. If this matrix is not positive definate, it will throw an exception. | |
GMatrix * | clone () |
Makes a deep copy of this dataset. | |
GMatrix * | cloneSub (size_t rowStart, size_t colStart, size_t rowCount, size_t colCount) |
Makes a deep copy of the specified rectangular region of this matrix. | |
void | col (size_t index, double *pOutVector) |
Copies the specified column into pOutVector. | |
size_t | cols () const |
Returns the number of columns in the dataset. | |
void | copy (GMatrix *pThat) |
Copies all the data from pThat. (Just references the same relation) | |
void | copyColumns (size_t nDestStartColumn, GMatrix *pSource, size_t nSourceStartColumn, size_t nColumnCount) |
Copies the specified block of columns from pSource to this dataset. pSource must have the same number of rows as this dataset. | |
void | copyRow (const double *pRow) |
Adds a copy of the row to the data set. | |
double | determinant () |
Computes the determinant of this matrix. | |
double | eigenValue (const double *pEigenVector) |
Computes the eigenvalue that corresponds to the specified eigenvector of this matrix. | |
void | eigenVector (double eigenvalue, double *pOutVector) |
Computes the eigenvector that corresponds to the specified eigenvalue of this matrix. Note that this method trashes this matrix, so make a copy first if you care. | |
bool | gaussianElimination (double *pVector) |
Computes y in the equation M*y=x (or y=M^(-1)x), where M is this dataset, which must be a square matrix, and x is pVector as passed in, and y is pVector after the call. If there are multiple solutions, it finds the one for which all the variables in the null-space have a value of 1. If there are no solutions, it returns false. Note that this method trashes this dataset (so make a copy first if you care). | |
GHeap * | heap () |
Returns the heap used to allocate rows for this dataset. | |
void | LUDecomposition () |
Performs an in-place LU-decomposition, such that the lower triangle of this matrix (including the diagonal) specifies L, and the uppoer triangle of this matrix (not including the diagonal) specifies U, and all values of U along the diagonal are ones. (The upper triangle of L and the lower triangle of U are all zeros.) | |
void | makeIdentity () |
Sets this dataset to an identity matrix. (It doesn't change the number of columns or rows. It just stomps over existing values.) | |
void | mirrorTriangle (bool upperToLower) |
If upperToLower is true, copies the upper triangle of this matrix over the lower triangle If upperToLower is false, copies the lower triangle of this matrix over the upper triangle. | |
void | mergeVert (GMatrix *pData) |
Steals all the rows from pData and adds them to this set. (You still have to delete pData.) Both datasets must have the same number of columns. | |
GMatrix * | eigs (size_t nCount, double *pEigenVals, GRand *pRand, bool mostSignificant) |
Computes nCount eigenvectors and the corresponding eigenvalues using the power method. (This method is only accurate if a small number of eigenvalues/vectors are needed.) If mostSignificant is true, the largest eigenvalues are found. If mostSignificant is false, the smallest eigenvalues are found. | |
void | multiply (double scalar) |
Multiplies every element in the dataset by scalar. Behavior is undefined for nominal columns. | |
void | multiply (const double *pVectorIn, double *pVectorOut, bool transpose=false) |
Multiplies this matrix by the column vector pVectorIn to get pVectorOut. (If transpose is true, then it multiplies the transpose of this matrix by pVectorIn to get pVectorOut.) pVectorIn should have the same number of elements as columns (or rows if transpose is true) and pVectorOut should have the same number of elements as rows (or cols, if transpose is true.) Note that if transpose is true, it is the same as if pVectorIn is a row vector and you multiply it by this matrix to get pVectorOut. | |
GMatrix * | pseudoInverse () |
Computes the Moore-Penrose pseudoinverse of this matrix (using the SVD method). You are responsible to delete the matrix this returns. | |
sp_relation & | relation () |
Returns a relation object, which holds meta-data about the attributes (columns) | |
void | reserve (size_t n) |
Allocates space for the specified number of patters (to avoid superfluous resizing) | |
size_t | rows () const |
Returns the number of rows in the dataset. | |
void | saveArff (const char *szFilename) |
Saves the dataset to a file in ARFF format. | |
void | setRelation (sp_relation &pRelation) |
Sets the relation for this dataset. | |
void | singularValueDecomposition (GMatrix **ppU, double **ppDiag, GMatrix **ppV, bool throwIfNoConverge=false, size_t maxIters=80) |
Performs SVD on A, where A is this m-by-n matrix. *ppU will be set to an m-by-m matrix where the columns are the eigenvectors of A(A^T). *ppDiag will be set to an array of n doubles holding the square roots of the corresponding eigenvalues. *ppV will be set to an n-by-n matrix where the rows are the eigenvectors of (A^T)A. You are responsible to delete(*ppU), delete(*ppV), and delete[] *ppDiag. | |
void | subtract (GMatrix *pThat, bool transpose) |
Matrix subtract. Subtracts the values in pThat from this. (If transpose is true, subtracts the transpose of pThat from this.) Both datasets must have the same dimensions. Behavior is undefined for nominal columns. | |
double | sumSquaredDiffWithIdentity () |
Returns the sum squared difference between this matrix and an identity matrix. | |
void | takeRow (double *pRow) |
Adds an already-allocated row to this dataset. The row must be allocated in the same heap that this dataset uses. (There is no way for this method to verify that, so be careful.) | |
size_t | toReducedRowEchelonForm () |
Converts the matrix to reduced row echelon form. | |
void | toVector (double *pVector) |
Copies all the data from this dataset into pVector. pVector must be big enough to hold rows() x cols() doubles. | |
GDomNode * | serialize (GDom *pDoc) |
Marshalls this object to a DOM, which may be saved to a variety of serial formats. | |
double | trace () |
Returns the sum of the diagonal elements. | |
GMatrix * | transpose () |
Returns a dataset that is this dataset transposed. (All columns in the returned dataset will be continuous.) | |
void | fromVector (const double *pVector, size_t nRows) |
Copies the data from pVector over this dataset. nRows specifies the number of rows of data in pVector. | |
double * | row (size_t index) |
Returns a pointer to the specified row. | |
double * | operator[] (size_t index) |
Returns a pointer to the specified row. | |
const double * | row (size_t index) const |
Returns a const pointer to the specified row. | |
const double * | operator[] (size_t index) const |
Returns a const pointer to the specified row. | |
void | setAll (double val) |
Sets all elements in this dataset to the specified value. | |
void | setCol (size_t index, const double *pVector) |
Copies pVector over the specified column. | |
void | swapRows (size_t a, size_t b) |
Swaps the two specified rows. | |
void | swapColumns (size_t nAttr1, size_t nAttr2) |
Swaps two columns. | |
void | deleteColumn (size_t index) |
Deletes a column. | |
double * | releaseRow (size_t index) |
Swaps the specified row with the last row, and then releases it from the dataset. If this dataset does not have its own heap, then you must delete the row this returns. | |
void | deleteRow (size_t index) |
Swaps the specified row with the last row, and then deletes it. | |
double * | releaseRowPreserveOrder (size_t index) |
Releases the specified row from the dataset and shifts everything after it up one slot. If this dataset does not have its own heap, then you must delete the row this returns. | |
void | deleteRowPreserveOrder (size_t index) |
Deletes the specified row and shifts everything after it up one slot. | |
void | fixNans () |
Replaces any occurrences of NAN in the matrix with the corresponding values from an identity matrix. | |
void | flush () |
Deletes all the data. | |
void | releaseAllRows () |
Abandons (leaks) all the rows of data. | |
void | shuffle (GRand &rand, GMatrix *pExtension=NULL) |
Randomizes the order of the rows. If pExtension is non-NULL, then it will also be shuffled such that corresponding rows are preserved. | |
void | shuffle2 (GRand &rand, GMatrix &other) |
Shuffles the order of the rows. Also shuffles the rows in "other" in the same way, such that corresponding rows are preserved. | |
void | shuffleLikeCards () |
This is an inferior way to shuffle the data. | |
void | sort (size_t nDimension) |
Sorts the data from smallest to largest in the specified dimension. | |
void | sortPartial (size_t row, size_t col) |
This partially sorts the specified column, such that the specified row will contain the same row as if it were fully sorted, and previous rows will contain a value <= to it in that column, and later rows will contain a value >= to it in that column. Unlike sort, which has O(m*log(m)) complexity, this method has O(m) complexity. This might be useful, for example, for efficiently finding the row with a median value in some attribute, or for separating data by a threshold in some value. | |
void | reverseRows () |
Reverses the row order. | |
template<typename CompareFunc > | |
void | sort (CompareFunc &pComparator) |
Sorts rows according to the specified compare function. (Return true to indicate thate the first row comes before the second row.) | |
void | splitByPivot (GMatrix *pGreaterOrEqual, size_t nAttribute, double dPivot, GMatrix *pExtensionA=NULL, GMatrix *pExtensionB=NULL) |
Splits this set of data into two sets. Values greater-than-or-equal-to dPivot stay in this data set. Values less than dPivot go into pLessThanPivot If pExtensionA is non-NULL, then it will also split pExtensionA such that corresponding rows are preserved. | |
void | splitByNominalValue (GMatrix *pSingleClass, size_t nAttr, int nValue, GMatrix *pExtensionA=NULL, GMatrix *pExtensionB=NULL) |
Moves all rows with the specified value in the specified attribute into pSingleClass If pExtensionA is non-NULL, then it will also split pExtensionA such that corresponding rows are preserved. | |
void | splitBySize (GMatrix *pOtherData, size_t nOtherRows) |
Removes the last nOtherRows rows from this data set and puts them in pOtherData. | |
double | entropy (size_t nColumn) |
Measures the entropy of the specified attribute. | |
void | minAndRange (size_t nAttribute, double *pMin, double *pRange) |
Finds the min and the range of the values of the specified attribute. | |
void | minAndRangeUnbiased (size_t nAttribute, double *pMin, double *pRange) |
Estimates the actual min and range based on a random sample. | |
void | centerMeanAtOrigin () |
Shifts the data such that the mean occurs at the origin. Only continuous values are affected. Nominal values are left unchanged. | |
double | mean (size_t nAttribute) |
Computes the arithmetic mean of the values in the specified column. | |
double | median (size_t nAttribute) |
Computes the median of the values in the specified column. | |
void | centroid (double *pOutCentroid) |
Computes the arithmetic means of all attributes. | |
double | variance (size_t nAttr, double mean) |
Computes the average variance of a single attribute. | |
void | normalize (size_t nAttribute, double dInputMin, double dInputRange, double dOutputMin, double dOutputRange) |
Normalizes the specified attribute values. | |
double | baselineValue (size_t nAttribute) |
Returns the mean if the specified attribute is continuous, otherwise returns the most common nominal value in the attribute. | |
bool | isAttrHomogenous (size_t col) |
Returns true iff the specified attribute contains homogenous values. (Unknowns are counted as homogenous with anything) | |
bool | isHomogenous () |
Returns true iff each of the last labelDims columns in the data are homogenous. | |
void | replaceMissingValuesWithBaseline (size_t nAttr) |
If the specified attribute is continuous, replaces all missing values in that attribute with the mean. If the specified attribute is nominal, replaces all missing values in that attribute with the most common value. | |
void | replaceMissingValuesRandomly (size_t nAttr, GRand *pRand) |
Replaces all missing values by copying a randomly selected non-missing value in the same attribute. | |
void | principalComponent (double *pOutVector, size_t dims, const double *pMean, GRand *pRand) |
This is an efficient algorithm for iteratively computing the principal component vector (the eigenvector of the covariance matrix) of the data. See "EM Algorithms for PCA and SPCA" by Sam Roweis, 1998 NIPS. nIterations should be a small constant. 20 seems work well for most applications. (To compute the next principal component, call RemoveComponent, then call this again.) | |
void | principalComponentAboutOrigin (double *pOutVector, size_t dims, GRand *pRand) |
Computes the first principal component assuming the mean is already subtracted out of the data. | |
void | principalComponentIgnoreUnknowns (double *pOutVector, size_t dims, const double *pMean, GRand *pRand) |
Computes principal components, while ignoring missing values. | |
void | weightedPrincipalComponent (double *pOutVector, size_t dims, const double *pMean, const double *pWeights, GRand *pRand) |
Computes the first principal component of the data with each row weighted according to the vector pWeights. (pWeights must have an element for each row.) | |
double | eigenValue (const double *pMean, const double *pEigenVector, size_t dims, GRand *pRand) |
After you compute the principal component, you can call this to obtain the eigenvalue that corresponds to that principal component vector (eigenvector). | |
void | removeComponent (const double *pMean, const double *pComponent, size_t dims) |
Removes the component specified by pComponent from the data. (pComponent should already be normalized.) This might be useful, for example, to remove the first principal component from the data so you can then proceed to compute the second principal component, and so forth. | |
void | removeComponentAboutOrigin (const double *pComponent, size_t dims) |
Removes the specified component assuming the mean is zero. | |
size_t | countPrincipalComponents (double d, GRand *pRand) |
Computes the minimum number of principal components necessary so that less than the specified portion of the deviation in the data is unaccounted for. (For example, if the data projected onto the first 3 principal components contains 90 percent of the deviation that the original data contains, then if you pass the value 0.1 to this method, it will return 3.) | |
double | sumSquaredDistance (const double *pPoint) |
Computes the sum-squared distance between pPoint and all of the points in the dataset. (If pPoint is NULL, it computes the sum-squared distance with the origin.) (Note that this is equal to the sum of all the eigenvalues times the number of dimensions, so you can efficiently compute eigenvalues as the difference in sumSquaredDistance with the mean after removing the corresponding component, and then dividing by the number of dimensions. This is more efficient than calling eigenValue.) | |
double | columnSumSquaredDifference (GMatrix &that, size_t col) |
Computes the sum-squared distance between the specified column of this and that. If the column is a nominal attribute, then Hamming distance is used. | |
double | sumSquaredDifference (GMatrix &that, bool transpose=false) |
Computes the squared distance between this and that. (If transpose is true, computes the difference between this and the transpose of that.) | |
double | linearCorrelationCoefficient (size_t attr1, double attr1Origin, size_t attr2, double attr2Origin) |
Computes the linear coefficient between the two specified attributes. Usually you will want to pass the mean values for attr1Origin and attr2Origin. | |
double | covariance (size_t nAttr1, double dMean1, size_t nAttr2, double dMean2) |
Computes the covariance between two attributes. | |
GMatrix * | covarianceMatrix () |
Computes the covariance matrix of the data. | |
void | pairedTTest (size_t *pOutV, double *pOutT, size_t attr1, size_t attr2, bool normalize) |
Performs a paired T-Test with data from the two specified attributes. pOutV will hold the degrees of freedom. pOutT will hold the T-value. You can use GMath::tTestAlphaValue to convert these to a P-value. | |
void | wilcoxonSignedRanksTest (size_t attr1, size_t attr2, double tolerance, int *pNum, double *pWMinus, double *pWPlus) |
Performs the Wilcoxon signed ranks test from the two specified attributes. If two values are closer than tolerance, they are considered to be equal. | |
void | print (std::ostream &stream) |
Prints the data to the specified stream. | |
size_t | countValue (size_t attribute, double value) |
Returns the number of ocurrences of the specified value in the specified attribute. | |
bool | doesHaveAnyMissingValues () |
Returns true iff this matrix is missing any values. | |
void | ensureDataHasNoMissingReals () |
Throws an exception if this data contains any missing values in a continuous attribute. | |
void | ensureDataHasNoMissingNominals () |
Throws an exception if this data contains any missing values in a nominal attribute. | |
double | measureInfo () |
Computes the sum entropy of the data (or the sum variance for continuous attributes) | |
bool | leastCorrelatedVector (double *pOut, GMatrix *pThat, GRand *pRand) |
Computes the vector in this subspace that has the greatest distance from its projection into pThat subspace. Returns true if the results are computed. Returns false if the subspaces are so nearly parallel that pOut cannot be computed with accuracy. | |
double | dihedralCorrelation (GMatrix *pThat, GRand *pRand) |
Computes the cosine of the dihedral angle between this subspace and pThat subspace. | |
void | project (double *pDest, const double *pPoint) |
Projects pPoint onto this hyperplane (where each row defines one of the orthonormal basis vectors of this hyperplane) This computes (A^T)Ap, where A is this matrix, and p is pPoint. | |
void | project (double *pDest, const double *pPoint, const double *pOrigin) |
Projects pPoint onto this hyperplane (where each row defines one of the orthonormal basis vectors of this hyperplane) | |
Static Public Member Functions | |
static GMatrix * | kabsch (GMatrix *pA, GMatrix *pB) |
This computes K=kabsch(A,B), such that K is an n-by-n matrix, where n is pA->cols(). K is the optimal orthonormal rotation matrix to align A and B, such that A(K^T) minimizes sum-squared error with B, and BK minimizes sum-squared error with A. (This rotates around the origin, so typically you will want to subtract the centroid from both pA and pB before calling this.) | |
static GMatrix * | align (GMatrix *pA, GMatrix *pB) |
This uses the Kabsch algorithm to rotate and translate pB in order to minimize RMS with pA. (pA and pB must have the same number of rows and columns.) | |
static GMatrix * | loadArff (const char *szFilename) |
Loads an ARFF file and returns the data. This will throw an exception if there's an error. | |
static GMatrix * | loadCsv (const char *szFilename, char separator, bool columnNamesInFirstRow, bool tolerant) |
Loads a file in CSV format. | |
static GMatrix * | mergeHoriz (GMatrix *pSetA, GMatrix *pSetB) |
Merges two datasets side-by-side. The resulting dataset will contain the attributes of both datasets. Both pSetA and pSetB (and the resulting dataset) must have the same number of rows. | |
static GMatrix * | multiply (GMatrix &a, GMatrix &b, bool transposeA, bool transposeB) |
Matrix multiply. For convenience, you can also specify that neither, one, or both of the inputs are virtually transposed prior to the multiplication. (If you want the results to come out transposed, you can use the equality AB=((B^T)(A^T))^T to figure out how to specify the parameters.) | |
static GMatrix * | parseArff (const char *szFile, size_t nLen) |
Parses an ARFF file and returns the data. This will throw an exception if there's an error. | |
static GMatrix * | parseCsv (const char *pFile, size_t len, char separator, bool columnNamesInFirstRow, bool tolerant=false) |
Imports data from a text file. Determines the meta-data automatically. Note: This method does not support Mac line-endings. You should first replace all '' with ' ' if your data comes from a Mac. As a special case, if separator is '\0', then it assumes data elements are separated by any number of whitespace characters, that element values themselves contain no whitespace, and that there are no missing elements. (This is the case when you save a Matlab matrix to an ascii file.) | |
static double | normalize (double dVal, double dInputMin, double dInputRange, double dOutputMin, double dOutputRange) |
Normalize a value from the input min and range to the output min and range. | |
static size_t * | bipartiteMatching (GMatrix &a, GMatrix &b, GDistanceMetric &metric, size_t k=0) |
Performs bipartite matching of the rows in the specified matrices. 'a' and 'b' must have the same number of columns. 'b' must have at least as many rows as 'a'. Returns an array of indexes, i[], where i[j] is the row in b that corresponds with row j of a. "metric" is the distance metric that will be minimized. For example, if metric computes the squared distance between two vectors, then this method will find the pairings that minimize sum squared distance. k specifies the number of nearest-neighbors of each row to consider as candidates for pairing. If k is equal to the number of rows in a, then optimal pairings are guaranteed. If k is smaller, then results will be obtained faster, but optimal results are not guaranteed. (An efficient neighbor-finder that assumes metric conforms to the triangle inequality is used to find neighbors.) If the number of columns is not too big, then small values for k will usually return optimal or near-optimal results anyway. sqrt(rows) might be a good general value to use for k. As a special value, if k is 0, then all pairs are considered, and optimal results are guaranteed. | |
static void | test () |
Performs unit tests for this class. Throws an exception if there is a failure. | |
Protected Member Functions | |
double | determinantHelper (size_t nEndRow, size_t *pColumnList) |
void | inPlaceSquareTranspose () |
void | singularValueDecompositionHelper (GMatrix **ppU, double **ppDiag, GMatrix **ppV, bool throwIfNoConverge, size_t maxIters) |
Protected Attributes | |
sp_relation | m_pRelation |
GHeap * | m_pHeap |
std::vector< double * > | m_rows |
Represents a matrix or a database table. Elements can be discrete or continuous. References a GRelation object, which stores the meta-information about each column.
GClasses::GMatrix::GMatrix | ( | size_t | rows, |
size_t | cols, | ||
GHeap * | pHeap = NULL |
||
) |
Construct a rows x cols matrix. All elements of the matrix are assumed to be continuous. (It is okay to initially set rows to 0 and later call newRow to add each row. Adding columns later, however, is not very computationally efficient.)
GClasses::GMatrix::GMatrix | ( | std::vector< size_t > & | attrValues, |
GHeap * | pHeap = NULL |
||
) |
Construct a matrix with a mixed relation. That is, one with some continuous attributes (columns), and some nominal attributes (columns). attrValues specifies the number of nominal values suppored in each attribute (column), or 0 for a continuous attribute. Initially, this matrix will have 0 rows, but you can add more rows by calling newRow or newRows.
GClasses::GMatrix::GMatrix | ( | sp_relation & | pRelation, |
GHeap * | pHeap = NULL |
||
) |
pRelation is a smart-pointer to a relation, which specifies the type of each attribute (column) in the data set. Initially, this matrix will have 0 rows, but you can add more rows by calling newRow or newRows.
GClasses::GMatrix::~GMatrix | ( | ) |
void GClasses::GMatrix::add | ( | GMatrix * | pThat, |
bool | transpose | ||
) |
Matrix add. Adds the values in pThat to this. (If transpose is true, adds the transpose of pThat to this.) Both datasets must have the same dimensions. Behavior is undefined for nominal columns.
This uses the Kabsch algorithm to rotate and translate pB in order to minimize RMS with pA. (pA and pB must have the same number of rows and columns.)
GMatrix* GClasses::GMatrix::attrSubset | ( | size_t | firstAttr, |
size_t | attrCount | ||
) |
Returns a new dataset that contains a subset of the attributes in this dataset.
double GClasses::GMatrix::baselineValue | ( | size_t | nAttribute | ) |
Returns the mean if the specified attribute is continuous, otherwise returns the most common nominal value in the attribute.
static size_t* GClasses::GMatrix::bipartiteMatching | ( | GMatrix & | a, |
GMatrix & | b, | ||
GDistanceMetric & | metric, | ||
size_t | k = 0 |
||
) | [static] |
Performs bipartite matching of the rows in the specified matrices. 'a' and 'b' must have the same number of columns. 'b' must have at least as many rows as 'a'. Returns an array of indexes, i[], where i[j] is the row in b that corresponds with row j of a. "metric" is the distance metric that will be minimized. For example, if metric computes the squared distance between two vectors, then this method will find the pairings that minimize sum squared distance. k specifies the number of nearest-neighbors of each row to consider as candidates for pairing. If k is equal to the number of rows in a, then optimal pairings are guaranteed. If k is smaller, then results will be obtained faster, but optimal results are not guaranteed. (An efficient neighbor-finder that assumes metric conforms to the triangle inequality is used to find neighbors.) If the number of columns is not too big, then small values for k will usually return optimal or near-optimal results anyway. sqrt(rows) might be a good general value to use for k. As a special value, if k is 0, then all pairs are considered, and optimal results are guaranteed.
void GClasses::GMatrix::centerMeanAtOrigin | ( | ) |
Shifts the data such that the mean occurs at the origin. Only continuous values are affected. Nominal values are left unchanged.
void GClasses::GMatrix::centroid | ( | double * | pOutCentroid | ) |
Computes the arithmetic means of all attributes.
GMatrix* GClasses::GMatrix::cholesky | ( | ) |
This computes the square root of this matrix. (If you take the matrix that this returns and multiply it by its transpose, you should get the original dataset again.) Behavior is undefined if there are nominal attributes. If this matrix is not positive definate, it will throw an exception.
GMatrix* GClasses::GMatrix::clone | ( | ) |
Makes a deep copy of this dataset.
GMatrix* GClasses::GMatrix::cloneSub | ( | size_t | rowStart, |
size_t | colStart, | ||
size_t | rowCount, | ||
size_t | colCount | ||
) |
Makes a deep copy of the specified rectangular region of this matrix.
void GClasses::GMatrix::col | ( | size_t | index, |
double * | pOutVector | ||
) |
Copies the specified column into pOutVector.
size_t GClasses::GMatrix::cols | ( | ) | const [inline] |
Returns the number of columns in the dataset.
double GClasses::GMatrix::columnSumSquaredDifference | ( | GMatrix & | that, |
size_t | col | ||
) |
Computes the sum-squared distance between the specified column of this and that. If the column is a nominal attribute, then Hamming distance is used.
void GClasses::GMatrix::copy | ( | GMatrix * | pThat | ) |
Copies all the data from pThat. (Just references the same relation)
void GClasses::GMatrix::copyColumns | ( | size_t | nDestStartColumn, |
GMatrix * | pSource, | ||
size_t | nSourceStartColumn, | ||
size_t | nColumnCount | ||
) |
Copies the specified block of columns from pSource to this dataset. pSource must have the same number of rows as this dataset.
void GClasses::GMatrix::copyRow | ( | const double * | pRow | ) |
Adds a copy of the row to the data set.
size_t GClasses::GMatrix::countPrincipalComponents | ( | double | d, |
GRand * | pRand | ||
) |
Computes the minimum number of principal components necessary so that less than the specified portion of the deviation in the data is unaccounted for. (For example, if the data projected onto the first 3 principal components contains 90 percent of the deviation that the original data contains, then if you pass the value 0.1 to this method, it will return 3.)
size_t GClasses::GMatrix::countValue | ( | size_t | attribute, |
double | value | ||
) |
Returns the number of ocurrences of the specified value in the specified attribute.
double GClasses::GMatrix::covariance | ( | size_t | nAttr1, |
double | dMean1, | ||
size_t | nAttr2, | ||
double | dMean2 | ||
) |
Computes the covariance between two attributes.
GMatrix* GClasses::GMatrix::covarianceMatrix | ( | ) |
Computes the covariance matrix of the data.
void GClasses::GMatrix::deleteColumn | ( | size_t | index | ) |
Deletes a column.
void GClasses::GMatrix::deleteRow | ( | size_t | index | ) |
Swaps the specified row with the last row, and then deletes it.
void GClasses::GMatrix::deleteRowPreserveOrder | ( | size_t | index | ) |
Deletes the specified row and shifts everything after it up one slot.
double GClasses::GMatrix::determinant | ( | ) |
Computes the determinant of this matrix.
double GClasses::GMatrix::determinantHelper | ( | size_t | nEndRow, |
size_t * | pColumnList | ||
) | [protected] |
Computes the cosine of the dihedral angle between this subspace and pThat subspace.
bool GClasses::GMatrix::doesHaveAnyMissingValues | ( | ) |
Returns true iff this matrix is missing any values.
double GClasses::GMatrix::eigenValue | ( | const double * | pMean, |
const double * | pEigenVector, | ||
size_t | dims, | ||
GRand * | pRand | ||
) |
After you compute the principal component, you can call this to obtain the eigenvalue that corresponds to that principal component vector (eigenvector).
double GClasses::GMatrix::eigenValue | ( | const double * | pEigenVector | ) |
Computes the eigenvalue that corresponds to the specified eigenvector of this matrix.
void GClasses::GMatrix::eigenVector | ( | double | eigenvalue, |
double * | pOutVector | ||
) |
Computes the eigenvector that corresponds to the specified eigenvalue of this matrix. Note that this method trashes this matrix, so make a copy first if you care.
GMatrix* GClasses::GMatrix::eigs | ( | size_t | nCount, |
double * | pEigenVals, | ||
GRand * | pRand, | ||
bool | mostSignificant | ||
) |
Computes nCount eigenvectors and the corresponding eigenvalues using the power method. (This method is only accurate if a small number of eigenvalues/vectors are needed.) If mostSignificant is true, the largest eigenvalues are found. If mostSignificant is false, the smallest eigenvalues are found.
void GClasses::GMatrix::ensureDataHasNoMissingNominals | ( | ) |
Throws an exception if this data contains any missing values in a nominal attribute.
void GClasses::GMatrix::ensureDataHasNoMissingReals | ( | ) |
Throws an exception if this data contains any missing values in a continuous attribute.
double GClasses::GMatrix::entropy | ( | size_t | nColumn | ) |
Measures the entropy of the specified attribute.
void GClasses::GMatrix::fixNans | ( | ) |
Replaces any occurrences of NAN in the matrix with the corresponding values from an identity matrix.
void GClasses::GMatrix::flush | ( | ) |
Deletes all the data.
void GClasses::GMatrix::fromVector | ( | const double * | pVector, |
size_t | nRows | ||
) |
Copies the data from pVector over this dataset. nRows specifies the number of rows of data in pVector.
bool GClasses::GMatrix::gaussianElimination | ( | double * | pVector | ) |
Computes y in the equation M*y=x (or y=M^(-1)x), where M is this dataset, which must be a square matrix, and x is pVector as passed in, and y is pVector after the call. If there are multiple solutions, it finds the one for which all the variables in the null-space have a value of 1. If there are no solutions, it returns false. Note that this method trashes this dataset (so make a copy first if you care).
GHeap* GClasses::GMatrix::heap | ( | ) | [inline] |
Returns the heap used to allocate rows for this dataset.
void GClasses::GMatrix::inPlaceSquareTranspose | ( | ) | [protected] |
bool GClasses::GMatrix::isAttrHomogenous | ( | size_t | col | ) |
Returns true iff the specified attribute contains homogenous values. (Unknowns are counted as homogenous with anything)
bool GClasses::GMatrix::isHomogenous | ( | ) |
Returns true iff each of the last labelDims columns in the data are homogenous.
This computes K=kabsch(A,B), such that K is an n-by-n matrix, where n is pA->cols(). K is the optimal orthonormal rotation matrix to align A and B, such that A(K^T) minimizes sum-squared error with B, and BK minimizes sum-squared error with A. (This rotates around the origin, so typically you will want to subtract the centroid from both pA and pB before calling this.)
Computes the vector in this subspace that has the greatest distance from its projection into pThat subspace. Returns true if the results are computed. Returns false if the subspaces are so nearly parallel that pOut cannot be computed with accuracy.
double GClasses::GMatrix::linearCorrelationCoefficient | ( | size_t | attr1, |
double | attr1Origin, | ||
size_t | attr2, | ||
double | attr2Origin | ||
) |
Computes the linear coefficient between the two specified attributes. Usually you will want to pass the mean values for attr1Origin and attr2Origin.
static GMatrix* GClasses::GMatrix::loadArff | ( | const char * | szFilename | ) | [static] |
Loads an ARFF file and returns the data. This will throw an exception if there's an error.
static GMatrix* GClasses::GMatrix::loadCsv | ( | const char * | szFilename, |
char | separator, | ||
bool | columnNamesInFirstRow, | ||
bool | tolerant | ||
) | [static] |
Loads a file in CSV format.
void GClasses::GMatrix::LUDecomposition | ( | ) |
Performs an in-place LU-decomposition, such that the lower triangle of this matrix (including the diagonal) specifies L, and the uppoer triangle of this matrix (not including the diagonal) specifies U, and all values of U along the diagonal are ones. (The upper triangle of L and the lower triangle of U are all zeros.)
void GClasses::GMatrix::makeIdentity | ( | ) |
Sets this dataset to an identity matrix. (It doesn't change the number of columns or rows. It just stomps over existing values.)
double GClasses::GMatrix::mean | ( | size_t | nAttribute | ) |
Computes the arithmetic mean of the values in the specified column.
double GClasses::GMatrix::measureInfo | ( | ) |
Computes the sum entropy of the data (or the sum variance for continuous attributes)
double GClasses::GMatrix::median | ( | size_t | nAttribute | ) |
Computes the median of the values in the specified column.
Merges two datasets side-by-side. The resulting dataset will contain the attributes of both datasets. Both pSetA and pSetB (and the resulting dataset) must have the same number of rows.
void GClasses::GMatrix::mergeVert | ( | GMatrix * | pData | ) |
Steals all the rows from pData and adds them to this set. (You still have to delete pData.) Both datasets must have the same number of columns.
void GClasses::GMatrix::minAndRange | ( | size_t | nAttribute, |
double * | pMin, | ||
double * | pRange | ||
) |
Finds the min and the range of the values of the specified attribute.
void GClasses::GMatrix::minAndRangeUnbiased | ( | size_t | nAttribute, |
double * | pMin, | ||
double * | pRange | ||
) |
Estimates the actual min and range based on a random sample.
void GClasses::GMatrix::mirrorTriangle | ( | bool | upperToLower | ) |
If upperToLower is true, copies the upper triangle of this matrix over the lower triangle If upperToLower is false, copies the lower triangle of this matrix over the upper triangle.
void GClasses::GMatrix::multiply | ( | const double * | pVectorIn, |
double * | pVectorOut, | ||
bool | transpose = false |
||
) |
Multiplies this matrix by the column vector pVectorIn to get pVectorOut. (If transpose is true, then it multiplies the transpose of this matrix by pVectorIn to get pVectorOut.) pVectorIn should have the same number of elements as columns (or rows if transpose is true) and pVectorOut should have the same number of elements as rows (or cols, if transpose is true.) Note that if transpose is true, it is the same as if pVectorIn is a row vector and you multiply it by this matrix to get pVectorOut.
void GClasses::GMatrix::multiply | ( | double | scalar | ) |
Multiplies every element in the dataset by scalar. Behavior is undefined for nominal columns.
static GMatrix* GClasses::GMatrix::multiply | ( | GMatrix & | a, |
GMatrix & | b, | ||
bool | transposeA, | ||
bool | transposeB | ||
) | [static] |
Matrix multiply. For convenience, you can also specify that neither, one, or both of the inputs are virtually transposed prior to the multiplication. (If you want the results to come out transposed, you can use the equality AB=((B^T)(A^T))^T to figure out how to specify the parameters.)
double* GClasses::GMatrix::newRow | ( | ) |
Adds a new row to the dataset. (The values in the row are not initialized)
void GClasses::GMatrix::newRows | ( | size_t | nRows | ) |
Adds "nRows" uninitialized rows to the data set.
void GClasses::GMatrix::normalize | ( | size_t | nAttribute, |
double | dInputMin, | ||
double | dInputRange, | ||
double | dOutputMin, | ||
double | dOutputRange | ||
) |
Normalizes the specified attribute values.
static double GClasses::GMatrix::normalize | ( | double | dVal, |
double | dInputMin, | ||
double | dInputRange, | ||
double | dOutputMin, | ||
double | dOutputRange | ||
) | [static] |
Normalize a value from the input min and range to the output min and range.
const double* GClasses::GMatrix::operator[] | ( | size_t | index | ) | const [inline] |
Returns a const pointer to the specified row.
double* GClasses::GMatrix::operator[] | ( | size_t | index | ) | [inline] |
Returns a pointer to the specified row.
void GClasses::GMatrix::pairedTTest | ( | size_t * | pOutV, |
double * | pOutT, | ||
size_t | attr1, | ||
size_t | attr2, | ||
bool | normalize | ||
) |
Performs a paired T-Test with data from the two specified attributes. pOutV will hold the degrees of freedom. pOutT will hold the T-value. You can use GMath::tTestAlphaValue to convert these to a P-value.
static GMatrix* GClasses::GMatrix::parseArff | ( | const char * | szFile, |
size_t | nLen | ||
) | [static] |
Parses an ARFF file and returns the data. This will throw an exception if there's an error.
static GMatrix* GClasses::GMatrix::parseCsv | ( | const char * | pFile, |
size_t | len, | ||
char | separator, | ||
bool | columnNamesInFirstRow, | ||
bool | tolerant = false |
||
) | [static] |
Imports data from a text file. Determines the meta-data automatically. Note: This method does not support Mac line-endings. You should first replace all '' with '
' if your data comes from a Mac. As a special case, if separator is '\0', then it assumes data elements are separated by any number of whitespace characters, that element values themselves contain no whitespace, and that there are no missing elements. (This is the case when you save a Matlab matrix to an ascii file.)
void GClasses::GMatrix::principalComponent | ( | double * | pOutVector, |
size_t | dims, | ||
const double * | pMean, | ||
GRand * | pRand | ||
) |
This is an efficient algorithm for iteratively computing the principal component vector (the eigenvector of the covariance matrix) of the data. See "EM Algorithms for PCA and SPCA" by Sam Roweis, 1998 NIPS. nIterations should be a small constant. 20 seems work well for most applications. (To compute the next principal component, call RemoveComponent, then call this again.)
void GClasses::GMatrix::principalComponentAboutOrigin | ( | double * | pOutVector, |
size_t | dims, | ||
GRand * | pRand | ||
) |
Computes the first principal component assuming the mean is already subtracted out of the data.
void GClasses::GMatrix::principalComponentIgnoreUnknowns | ( | double * | pOutVector, |
size_t | dims, | ||
const double * | pMean, | ||
GRand * | pRand | ||
) |
Computes principal components, while ignoring missing values.
void GClasses::GMatrix::print | ( | std::ostream & | stream | ) |
Prints the data to the specified stream.
void GClasses::GMatrix::project | ( | double * | pDest, |
const double * | pPoint | ||
) |
Projects pPoint onto this hyperplane (where each row defines one of the orthonormal basis vectors of this hyperplane) This computes (A^T)Ap, where A is this matrix, and p is pPoint.
void GClasses::GMatrix::project | ( | double * | pDest, |
const double * | pPoint, | ||
const double * | pOrigin | ||
) |
Projects pPoint onto this hyperplane (where each row defines one of the orthonormal basis vectors of this hyperplane)
GMatrix* GClasses::GMatrix::pseudoInverse | ( | ) |
Computes the Moore-Penrose pseudoinverse of this matrix (using the SVD method). You are responsible to delete the matrix this returns.
sp_relation& GClasses::GMatrix::relation | ( | ) | [inline] |
Returns a relation object, which holds meta-data about the attributes (columns)
void GClasses::GMatrix::releaseAllRows | ( | ) |
Abandons (leaks) all the rows of data.
double* GClasses::GMatrix::releaseRow | ( | size_t | index | ) |
Swaps the specified row with the last row, and then releases it from the dataset. If this dataset does not have its own heap, then you must delete the row this returns.
double* GClasses::GMatrix::releaseRowPreserveOrder | ( | size_t | index | ) |
Releases the specified row from the dataset and shifts everything after it up one slot. If this dataset does not have its own heap, then you must delete the row this returns.
void GClasses::GMatrix::removeComponent | ( | const double * | pMean, |
const double * | pComponent, | ||
size_t | dims | ||
) |
Removes the component specified by pComponent from the data. (pComponent should already be normalized.) This might be useful, for example, to remove the first principal component from the data so you can then proceed to compute the second principal component, and so forth.
void GClasses::GMatrix::removeComponentAboutOrigin | ( | const double * | pComponent, |
size_t | dims | ||
) |
Removes the specified component assuming the mean is zero.
void GClasses::GMatrix::replaceMissingValuesRandomly | ( | size_t | nAttr, |
GRand * | pRand | ||
) |
Replaces all missing values by copying a randomly selected non-missing value in the same attribute.
void GClasses::GMatrix::replaceMissingValuesWithBaseline | ( | size_t | nAttr | ) |
If the specified attribute is continuous, replaces all missing values in that attribute with the mean. If the specified attribute is nominal, replaces all missing values in that attribute with the most common value.
void GClasses::GMatrix::reserve | ( | size_t | n | ) | [inline] |
Allocates space for the specified number of patters (to avoid superfluous resizing)
void GClasses::GMatrix::reverseRows | ( | ) |
Reverses the row order.
double* GClasses::GMatrix::row | ( | size_t | index | ) | [inline] |
Returns a pointer to the specified row.
const double* GClasses::GMatrix::row | ( | size_t | index | ) | const [inline] |
Returns a const pointer to the specified row.
size_t GClasses::GMatrix::rows | ( | ) | const [inline] |
Returns the number of rows in the dataset.
void GClasses::GMatrix::saveArff | ( | const char * | szFilename | ) |
Saves the dataset to a file in ARFF format.
Marshalls this object to a DOM, which may be saved to a variety of serial formats.
void GClasses::GMatrix::setAll | ( | double | val | ) |
Sets all elements in this dataset to the specified value.
void GClasses::GMatrix::setCol | ( | size_t | index, |
const double * | pVector | ||
) |
Copies pVector over the specified column.
void GClasses::GMatrix::setRelation | ( | sp_relation & | pRelation | ) | [inline] |
Sets the relation for this dataset.
Randomizes the order of the rows. If pExtension is non-NULL, then it will also be shuffled such that corresponding rows are preserved.
Shuffles the order of the rows. Also shuffles the rows in "other" in the same way, such that corresponding rows are preserved.
void GClasses::GMatrix::shuffleLikeCards | ( | ) |
This is an inferior way to shuffle the data.
void GClasses::GMatrix::singularValueDecomposition | ( | GMatrix ** | ppU, |
double ** | ppDiag, | ||
GMatrix ** | ppV, | ||
bool | throwIfNoConverge = false , |
||
size_t | maxIters = 80 |
||
) |
Performs SVD on A, where A is this m-by-n matrix. *ppU will be set to an m-by-m matrix where the columns are the eigenvectors of A(A^T). *ppDiag will be set to an array of n doubles holding the square roots of the corresponding eigenvalues. *ppV will be set to an n-by-n matrix where the rows are the eigenvectors of (A^T)A. You are responsible to delete(*ppU), delete(*ppV), and delete[] *ppDiag.
void GClasses::GMatrix::singularValueDecompositionHelper | ( | GMatrix ** | ppU, |
double ** | ppDiag, | ||
GMatrix ** | ppV, | ||
bool | throwIfNoConverge, | ||
size_t | maxIters | ||
) | [protected] |
void GClasses::GMatrix::sort | ( | size_t | nDimension | ) |
Sorts the data from smallest to largest in the specified dimension.
void GClasses::GMatrix::sort | ( | CompareFunc & | pComparator | ) | [inline] |
Sorts rows according to the specified compare function. (Return true to indicate thate the first row comes before the second row.)
void GClasses::GMatrix::sortPartial | ( | size_t | row, |
size_t | col | ||
) |
This partially sorts the specified column, such that the specified row will contain the same row as if it were fully sorted, and previous rows will contain a value <= to it in that column, and later rows will contain a value >= to it in that column. Unlike sort, which has O(m*log(m)) complexity, this method has O(m) complexity. This might be useful, for example, for efficiently finding the row with a median value in some attribute, or for separating data by a threshold in some value.
void GClasses::GMatrix::splitByNominalValue | ( | GMatrix * | pSingleClass, |
size_t | nAttr, | ||
int | nValue, | ||
GMatrix * | pExtensionA = NULL , |
||
GMatrix * | pExtensionB = NULL |
||
) |
Moves all rows with the specified value in the specified attribute into pSingleClass If pExtensionA is non-NULL, then it will also split pExtensionA such that corresponding rows are preserved.
void GClasses::GMatrix::splitByPivot | ( | GMatrix * | pGreaterOrEqual, |
size_t | nAttribute, | ||
double | dPivot, | ||
GMatrix * | pExtensionA = NULL , |
||
GMatrix * | pExtensionB = NULL |
||
) |
Splits this set of data into two sets. Values greater-than-or-equal-to dPivot stay in this data set. Values less than dPivot go into pLessThanPivot If pExtensionA is non-NULL, then it will also split pExtensionA such that corresponding rows are preserved.
void GClasses::GMatrix::splitBySize | ( | GMatrix * | pOtherData, |
size_t | nOtherRows | ||
) |
Removes the last nOtherRows rows from this data set and puts them in pOtherData.
void GClasses::GMatrix::subtract | ( | GMatrix * | pThat, |
bool | transpose | ||
) |
Matrix subtract. Subtracts the values in pThat from this. (If transpose is true, subtracts the transpose of pThat from this.) Both datasets must have the same dimensions. Behavior is undefined for nominal columns.
double GClasses::GMatrix::sumSquaredDifference | ( | GMatrix & | that, |
bool | transpose = false |
||
) |
Computes the squared distance between this and that. (If transpose is true, computes the difference between this and the transpose of that.)
double GClasses::GMatrix::sumSquaredDiffWithIdentity | ( | ) |
Returns the sum squared difference between this matrix and an identity matrix.
double GClasses::GMatrix::sumSquaredDistance | ( | const double * | pPoint | ) |
Computes the sum-squared distance between pPoint and all of the points in the dataset. (If pPoint is NULL, it computes the sum-squared distance with the origin.) (Note that this is equal to the sum of all the eigenvalues times the number of dimensions, so you can efficiently compute eigenvalues as the difference in sumSquaredDistance with the mean after removing the corresponding component, and then dividing by the number of dimensions. This is more efficient than calling eigenValue.)
void GClasses::GMatrix::swapColumns | ( | size_t | nAttr1, |
size_t | nAttr2 | ||
) |
Swaps two columns.
void GClasses::GMatrix::swapRows | ( | size_t | a, |
size_t | b | ||
) |
Swaps the two specified rows.
void GClasses::GMatrix::takeRow | ( | double * | pRow | ) |
Adds an already-allocated row to this dataset. The row must be allocated in the same heap that this dataset uses. (There is no way for this method to verify that, so be careful.)
static void GClasses::GMatrix::test | ( | ) | [static] |
Performs unit tests for this class. Throws an exception if there is a failure.
size_t GClasses::GMatrix::toReducedRowEchelonForm | ( | ) |
Converts the matrix to reduced row echelon form.
void GClasses::GMatrix::toVector | ( | double * | pVector | ) |
double GClasses::GMatrix::trace | ( | ) |
Returns the sum of the diagonal elements.
GMatrix* GClasses::GMatrix::transpose | ( | ) |
Returns a dataset that is this dataset transposed. (All columns in the returned dataset will be continuous.)
double GClasses::GMatrix::variance | ( | size_t | nAttr, |
double | mean | ||
) |
Computes the average variance of a single attribute.
void GClasses::GMatrix::weightedPrincipalComponent | ( | double * | pOutVector, |
size_t | dims, | ||
const double * | pMean, | ||
const double * | pWeights, | ||
GRand * | pRand | ||
) |
Computes the first principal component of the data with each row weighted according to the vector pWeights. (pWeights must have an element for each row.)
void GClasses::GMatrix::wilcoxonSignedRanksTest | ( | size_t | attr1, |
size_t | attr2, | ||
double | tolerance, | ||
int * | pNum, | ||
double * | pWMinus, | ||
double * | pWPlus | ||
) |
Performs the Wilcoxon signed ranks test from the two specified attributes. If two values are closer than tolerance, they are considered to be equal.
GHeap* GClasses::GMatrix::m_pHeap [protected] |
sp_relation GClasses::GMatrix::m_pRelation [protected] |
std::vector<double*> GClasses::GMatrix::m_rows [protected] |