GClasses

GClasses::GSupervisedLearner Class Reference

This is the base class of algorithms that learn with supervision and have an internal hypothesis model that allows them to generalize rows that were not available at training time. More...

#include <GLearner.h>

Inheritance diagram for GClasses::GSupervisedLearner:
GClasses::GTransducer GClasses::GBaselineLearner GClasses::GBucket GClasses::GDecisionTree GClasses::GEnsemble GClasses::GIdentityFunction GClasses::GIncrementalLearner GClasses::GLinearRegressor GClasses::GMeanMarginsTree GClasses::GPolynomial GClasses::GRandomForest GClasses::GWag

List of all members.

Public Member Functions

 GSupervisedLearner (GRand &rand)
 General-purpose constructor.
 GSupervisedLearner (GDomNode *pNode, GLearnerLoader &ll)
 Deserialization constructor.
virtual ~GSupervisedLearner ()
 Destructor.
virtual GDomNodeserialize (GDom *pDoc)=0
 Marshal this object into a DOM that can be converted to a variety of formats. (Implementations of this method should use baseDomNode.)
virtual bool canGeneralize ()
 Returns true because fully supervised learners have an internal model that allows them to generalize previously unseen rows.
size_t featureDims ()
 Returns the number of feature dims.
size_t labelDims ()
 Returns the number of label dims.
GTwoWayIncrementalTransformfeatureFilter ()
 Returns the current feature filter (or NULL if none has been set).
GTwoWayIncrementalTransformlabelFilter ()
 Returns the current label filter (or NULL if none has been set).
void setFeatureFilter (GTwoWayIncrementalTransform *pFilter)
 Sets the filter for features. (Note that the "train" method automatically sets the filters, replacing any filters that you have set, so this method is really only useful in conjunction with incremental learning.)
void setLabelFilter (GTwoWayIncrementalTransform *pFilter)
 Sets the filter for labels. (Note that the "train" method automatically sets the filters, replacing any filters that you have set, so this method is really only useful in conjunction with incremental learning.)
void train (GMatrix &features, GMatrix &labels)
 Call this method to train the model. It automatically determines which filters are needed to convert the training features and labels into a form that the model's training algorithm can handle, and then calls trainInner to do the actual training.
void predict (const double *pIn, double *pOut)
 Evaluate pIn to compute a prediction for pOut. The model must be trained (by calling train) before the first time that this method is called. pIn and pOut should point to arrays of doubles of the same size as the number of columns in the training matrices that were passed to the train method.
void calibrate (GMatrix &features, GMatrix &labels)
 Calibrate the model to make predicted distributions reflect the training data. This method should be called after train is called, but before the first time predictDistribution is called. Typically, the same matrices passed as parameters to the train method are also passed as parameters to this method. By default, the mean of continuous labels is predicted as accurately as possible, but the variance only reflects a heuristic measure of confidence. If calibrate is called, however, then logistic regression will be used to map from the heuristic variance estimates to the actual variance as measured in the training data, such that the predicted variance becomes more reliable. Likewise with categorical labels, the mode is predicted as accurately as possible, but the distribution of probability among the categories may not be a very good prediction of the actual distribution of probability unless this method has been called to calibrate them. If you never plan to call predictDistribution, there is no reason to ever call this method.
void predictDistribution (const double *pIn, GPrediction *pOut)
 Evaluate pIn and compute a prediction for pOut. pOut is expected to point to an array of GPrediction objects which have already been allocated. There should be labelDims() elements in this array. The distributions will be more accurate if the model is calibrated before the first time that this method is called.
virtual void clear ()=0
 Discards all training for the purpose of freeing memory. If you call this method, you must train before making any predictions. No settings or options are discarded, so you should be able to train again without specifying any other parameters and still get a comparable model.
void accuracy (GMatrix &features, GMatrix &labels, double *pOutResults, std::vector< GMatrix * > *pNominalLabelStats=NULL)
 This method assumes that this learner has already been trained. It computes the predictive accuracy for nominal labels and mean-squared error for continuous labels. pOutResults should be the size of the number of columns in labels. If pNominalLabelStats is non-NULL, then it will be filled with confusion-matrix statistics about predictions for nominal labels. pNominalLabelStats should point to an empty vector when it is passed in. It will be resized to the number of columns in labels. Elements corresponding with continuous label attributes will be set to NULL. Elements corresponding to nominal label attributes will be set to contain an n x n matrix, where n is the number of possible values in that label column. Each row refers to the expected value, each column refers to the predicted value, and each element contains the count of the number of times that each expected/predicted value occurred over the test set. The caller is responsible to delete each element in pNominalLabelStats.
void precisionRecall (double *pOutPrecision, size_t nPrecisionSize, GMatrix &features, GMatrix &labels, size_t label, size_t nReps)
 label specifies which output to measure. (It should be 0 if there is only one label dimension.) The measurement will be performed "nReps" times and results averaged together nPrecisionSize specifies the number of points at which the function is sampled pOutPrecision should be an array big enough to hold nPrecisionSize elements for every possible label value. (If the attribute is continuous, it should just be big enough to hold nPrecisionSize elements.) If bLocal is true, it computes the local precision instead of the global precision.
virtual void trainAndTest (GMatrix &trainFeatures, GMatrix &trainLabels, GMatrix &testFeatures, GMatrix &testLabels, double *pOutResults, std::vector< GMatrix * > *pNominalLabelStats=NULL)
 Trains and tests this learner.
void setAutoFilter (bool b)
 If b is true, enable automatic filter setup. If b is false, disable automatic filter setup. It is enabled by default, so you must explicitly disable it if you do not want this feature. If automatic filter setup is enabled then, when train is called, it will discard any existing filters that have been attached to this learner, and will automatically analyze the training data and create any filters that it determines are needed.
void basicTest (double minAccuracy1, double minAccuracy2, double deviation=1e-6, bool printAccuracy=false)
 This is a helper method used by the unit tests of several model learners.

Static Public Member Functions

static void test ()
 Runs some unit tests related to supervised learning. Throws an exception if any problems are found.

Protected Member Functions

virtual void trainInner (GMatrix &features, GMatrix &labels)=0
 This is the implementation of the model's training algorithm. (This method is called by train).
virtual void predictInner (const double *pIn, double *pOut)=0
 This is the implementation of the model's prediction algorithm. (This method is called by predict).
virtual void predictDistributionInner (const double *pIn, GPrediction *pOut)=0
 This is the implementation of the model's prediction algorithm. (This method is called by predictDistribution).
virtual GMatrixtransduceInner (GMatrix &features1, GMatrix &labels1, GMatrix &features2)
 See GTransducer::transduce.
void setupFilters (GMatrix &features, GMatrix &labels)
 This method determines which data filters (normalize, discretize, and/or nominal-to-cat) are needed and trains them.
size_t precisionRecallNominal (GPrediction *pOutput, double *pFunc, GMatrix &trainFeatures, GMatrix &trainLabels, GMatrix &testFeatures, GMatrix &testLabels, size_t label, int value)
 This is a helper method used by precisionRecall.
size_t precisionRecallContinuous (GPrediction *pOutput, double *pFunc, GMatrix &trainFeatures, GMatrix &trainLabels, GMatrix &testFeatures, GMatrix &testLabels, size_t label)
 This is a helper method used by precisionRecall.
GDomNodebaseDomNode (GDom *pDoc, const char *szClassName)
 Child classes should use this in their implementation of serialize.

Protected Attributes

GTwoWayIncrementalTransformm_pFeatureFilter
GTwoWayIncrementalTransformm_pLabelFilter
bool m_autoFilter
size_t m_featureDims
size_t m_labelDims
GNeuralNet ** m_pCalibrations

Detailed Description

This is the base class of algorithms that learn with supervision and have an internal hypothesis model that allows them to generalize rows that were not available at training time.


Constructor & Destructor Documentation

GClasses::GSupervisedLearner::GSupervisedLearner ( GRand rand)

General-purpose constructor.

GClasses::GSupervisedLearner::GSupervisedLearner ( GDomNode pNode,
GLearnerLoader ll 
)

Deserialization constructor.

virtual GClasses::GSupervisedLearner::~GSupervisedLearner ( ) [virtual]

Destructor.


Member Function Documentation

void GClasses::GSupervisedLearner::accuracy ( GMatrix features,
GMatrix labels,
double *  pOutResults,
std::vector< GMatrix * > *  pNominalLabelStats = NULL 
)

This method assumes that this learner has already been trained. It computes the predictive accuracy for nominal labels and mean-squared error for continuous labels. pOutResults should be the size of the number of columns in labels. If pNominalLabelStats is non-NULL, then it will be filled with confusion-matrix statistics about predictions for nominal labels. pNominalLabelStats should point to an empty vector when it is passed in. It will be resized to the number of columns in labels. Elements corresponding with continuous label attributes will be set to NULL. Elements corresponding to nominal label attributes will be set to contain an n x n matrix, where n is the number of possible values in that label column. Each row refers to the expected value, each column refers to the predicted value, and each element contains the count of the number of times that each expected/predicted value occurred over the test set. The caller is responsible to delete each element in pNominalLabelStats.

GDomNode* GClasses::GSupervisedLearner::baseDomNode ( GDom pDoc,
const char *  szClassName 
) [protected]

Child classes should use this in their implementation of serialize.

void GClasses::GSupervisedLearner::basicTest ( double  minAccuracy1,
double  minAccuracy2,
double  deviation = 1e-6,
bool  printAccuracy = false 
)

This is a helper method used by the unit tests of several model learners.

void GClasses::GSupervisedLearner::calibrate ( GMatrix features,
GMatrix labels 
)

Calibrate the model to make predicted distributions reflect the training data. This method should be called after train is called, but before the first time predictDistribution is called. Typically, the same matrices passed as parameters to the train method are also passed as parameters to this method. By default, the mean of continuous labels is predicted as accurately as possible, but the variance only reflects a heuristic measure of confidence. If calibrate is called, however, then logistic regression will be used to map from the heuristic variance estimates to the actual variance as measured in the training data, such that the predicted variance becomes more reliable. Likewise with categorical labels, the mode is predicted as accurately as possible, but the distribution of probability among the categories may not be a very good prediction of the actual distribution of probability unless this method has been called to calibrate them. If you never plan to call predictDistribution, there is no reason to ever call this method.

virtual bool GClasses::GSupervisedLearner::canGeneralize ( ) [inline, virtual]

Returns true because fully supervised learners have an internal model that allows them to generalize previously unseen rows.

Reimplemented from GClasses::GTransducer.

virtual void GClasses::GSupervisedLearner::clear ( ) [pure virtual]

Discards all training for the purpose of freeing memory. If you call this method, you must train before making any predictions. No settings or options are discarded, so you should be able to train again without specifying any other parameters and still get a comparable model.

Implemented in GClasses::GDecisionTree, GClasses::GMeanMarginsTree, GClasses::GRandomForest, GClasses::GBag, GClasses::GAdaBoost, GClasses::GWag, GClasses::GBucket, GClasses::GKNN, GClasses::GInstanceTable, GClasses::GBaselineLearner, GClasses::GIdentityFunction, GClasses::GLinearRegressor, GClasses::GNaiveBayes, GClasses::GNaiveInstance, GClasses::GNeuralNet, and GClasses::GPolynomial.

size_t GClasses::GSupervisedLearner::featureDims ( ) [inline]

Returns the number of feature dims.

GTwoWayIncrementalTransform* GClasses::GSupervisedLearner::featureFilter ( ) [inline]

Returns the current feature filter (or NULL if none has been set).

size_t GClasses::GSupervisedLearner::labelDims ( ) [inline]

Returns the number of label dims.

GTwoWayIncrementalTransform* GClasses::GSupervisedLearner::labelFilter ( ) [inline]

Returns the current label filter (or NULL if none has been set).

void GClasses::GSupervisedLearner::precisionRecall ( double *  pOutPrecision,
size_t  nPrecisionSize,
GMatrix features,
GMatrix labels,
size_t  label,
size_t  nReps 
)

label specifies which output to measure. (It should be 0 if there is only one label dimension.) The measurement will be performed "nReps" times and results averaged together nPrecisionSize specifies the number of points at which the function is sampled pOutPrecision should be an array big enough to hold nPrecisionSize elements for every possible label value. (If the attribute is continuous, it should just be big enough to hold nPrecisionSize elements.) If bLocal is true, it computes the local precision instead of the global precision.

size_t GClasses::GSupervisedLearner::precisionRecallContinuous ( GPrediction pOutput,
double *  pFunc,
GMatrix trainFeatures,
GMatrix trainLabels,
GMatrix testFeatures,
GMatrix testLabels,
size_t  label 
) [protected]

This is a helper method used by precisionRecall.

size_t GClasses::GSupervisedLearner::precisionRecallNominal ( GPrediction pOutput,
double *  pFunc,
GMatrix trainFeatures,
GMatrix trainLabels,
GMatrix testFeatures,
GMatrix testLabels,
size_t  label,
int  value 
) [protected]

This is a helper method used by precisionRecall.

void GClasses::GSupervisedLearner::predict ( const double *  pIn,
double *  pOut 
)

Evaluate pIn to compute a prediction for pOut. The model must be trained (by calling train) before the first time that this method is called. pIn and pOut should point to arrays of doubles of the same size as the number of columns in the training matrices that were passed to the train method.

void GClasses::GSupervisedLearner::predictDistribution ( const double *  pIn,
GPrediction pOut 
)

Evaluate pIn and compute a prediction for pOut. pOut is expected to point to an array of GPrediction objects which have already been allocated. There should be labelDims() elements in this array. The distributions will be more accurate if the model is calibrated before the first time that this method is called.

virtual void GClasses::GSupervisedLearner::predictDistributionInner ( const double *  pIn,
GPrediction pOut 
) [protected, pure virtual]
virtual void GClasses::GSupervisedLearner::predictInner ( const double *  pIn,
double *  pOut 
) [protected, pure virtual]
void GClasses::GSupervisedLearner::setAutoFilter ( bool  b) [inline]

If b is true, enable automatic filter setup. If b is false, disable automatic filter setup. It is enabled by default, so you must explicitly disable it if you do not want this feature. If automatic filter setup is enabled then, when train is called, it will discard any existing filters that have been attached to this learner, and will automatically analyze the training data and create any filters that it determines are needed.

void GClasses::GSupervisedLearner::setFeatureFilter ( GTwoWayIncrementalTransform pFilter)

Sets the filter for features. (Note that the "train" method automatically sets the filters, replacing any filters that you have set, so this method is really only useful in conjunction with incremental learning.)

void GClasses::GSupervisedLearner::setLabelFilter ( GTwoWayIncrementalTransform pFilter)

Sets the filter for labels. (Note that the "train" method automatically sets the filters, replacing any filters that you have set, so this method is really only useful in conjunction with incremental learning.)

void GClasses::GSupervisedLearner::setupFilters ( GMatrix features,
GMatrix labels 
) [protected]

This method determines which data filters (normalize, discretize, and/or nominal-to-cat) are needed and trains them.

static void GClasses::GSupervisedLearner::test ( ) [static]
void GClasses::GSupervisedLearner::train ( GMatrix features,
GMatrix labels 
)

Call this method to train the model. It automatically determines which filters are needed to convert the training features and labels into a form that the model's training algorithm can handle, and then calls trainInner to do the actual training.

virtual void GClasses::GSupervisedLearner::trainAndTest ( GMatrix trainFeatures,
GMatrix trainLabels,
GMatrix testFeatures,
GMatrix testLabels,
double *  pOutResults,
std::vector< GMatrix * > *  pNominalLabelStats = NULL 
) [virtual]

Trains and tests this learner.

Reimplemented from GClasses::GTransducer.

virtual void GClasses::GSupervisedLearner::trainInner ( GMatrix features,
GMatrix labels 
) [protected, pure virtual]
virtual GMatrix* GClasses::GSupervisedLearner::transduceInner ( GMatrix features1,
GMatrix labels1,
GMatrix features2 
) [protected, virtual]

Member Data Documentation