GClasses

GClasses::GDecisionTree Class Reference

This is an efficient learning algorithm. It divides on the attributes that reduce entropy the most, or alternatively can make random divisions. More...

#include <GDecisionTree.h>

Inheritance diagram for GClasses::GDecisionTree:
GClasses::GSupervisedLearner GClasses::GTransducer

List of all members.

Public Types

enum  DivisionAlgorithm { MINIMIZE_ENTROPY, RANDOM }

Public Member Functions

 GDecisionTree (GRand &rand)
 GDecisionTree (GDomNode *pNode, GLearnerLoader &ll)
 Loads from a DOM.
virtual ~GDecisionTree ()
virtual GDomNodeserialize (GDom *pDoc)
 Marshal this object into a DOM, which can then be converted to a variety of serial formats.
void useRandomDivisions (size_t randomDraws=1)
 Specifies for this decision tree to use random divisions (instead of divisions that reduce entropy). Random divisions make the algorithm train somewhat faster, and also increase model variance, so it is better suited for ensembles, but random divisions also make the decision tree vulnerable to problems with irrelevant attributes.
size_t leafThresh ()
 Returns the leaf threshold.
void setLeafThresh (size_t n)
 Sets the leaf threshold. When the number of samples is <= this value, it will no longer try to divide the data, but will create a leaf node. The default value is 1. For noisy data, a larger value may be advantageous.
void setMaxLevels (size_t n)
 Sets the max levels. When a path from the root to the current node contains n nodes (including the root), it will no longer try to divide the data, but will create a leaf node. If set to 0, then there is no maximum. 0 is the default.
virtual void clear ()
 Frees the model.
size_t treeSize ()
 Returns the number of nodes in this tree.
void print (std::ostream &stream, GArffRelation *pFeatureRel=NULL, GArffRelation *pLabelRel=NULL)
 Prints an ascii representation of the decision tree to the specified stream. pRelation is an optional relation that can be supplied in order to provide better meta-data to make the print-out richer.
void autoTune (GMatrix &features, GMatrix &labels)
 Uses cross-validation to find a set of parameters that works well with the provided data.

Static Public Member Functions

static void test ()
 Performs unit tests for this class. Throws an exception if there is a failure.

Protected Member Functions

virtual void trainInner (GMatrix &features, GMatrix &labels)
 See the comment for GSupervisedLearner::trainInner.
virtual void predictInner (const double *pIn, double *pOut)
 See the comment for GSupervisedLearner::predictInner.
virtual void predictDistributionInner (const double *pIn, GPrediction *pOut)
 See the comment for GSupervisedLearner::predictDistributionInner.
GDecisionTreeLeafNode * findLeaf (const double *pIn, size_t *pDepth)
 Finds the leaf node that corresponds with the specified feature vector.
GDecisionTreeNode * buildBranch (GMatrix &features, GMatrix &labels, std::vector< size_t > &attrPool, size_t nDepth, size_t tolerance)
 A recursive helper method used to construct the decision tree.
double measureInfoGain (GMatrix *pData, size_t nAttribute, double *pPivot)
 InfoGain is defined as the difference in entropy in the data before and after dividing it based on the specified attribute. For continuous attributes it uses the difference between the original variance and the sum of the variances of the two parts after dividing at the point the maximizes this value.
size_t pickDivision (GMatrix &features, GMatrix &labels, double *pPivot, std::vector< size_t > &attrPool, size_t nDepth)

Protected Attributes

sp_relation m_pFeatureRel
sp_relation m_pLabelRel
GDecisionTreeNode * m_pRoot
DivisionAlgorithm m_eAlg
size_t m_leafThresh
size_t m_randomDraws
size_t m_maxLevels

Detailed Description

This is an efficient learning algorithm. It divides on the attributes that reduce entropy the most, or alternatively can make random divisions.


Member Enumeration Documentation

Enumerator:
MINIMIZE_ENTROPY 
RANDOM 

Constructor & Destructor Documentation

GClasses::GDecisionTree::GDecisionTree ( GRand rand)
GClasses::GDecisionTree::GDecisionTree ( GDomNode pNode,
GLearnerLoader ll 
)

Loads from a DOM.

virtual GClasses::GDecisionTree::~GDecisionTree ( ) [virtual]

Member Function Documentation

void GClasses::GDecisionTree::autoTune ( GMatrix features,
GMatrix labels 
)

Uses cross-validation to find a set of parameters that works well with the provided data.

GDecisionTreeNode* GClasses::GDecisionTree::buildBranch ( GMatrix features,
GMatrix labels,
std::vector< size_t > &  attrPool,
size_t  nDepth,
size_t  tolerance 
) [protected]

A recursive helper method used to construct the decision tree.

virtual void GClasses::GDecisionTree::clear ( ) [virtual]

Frees the model.

Implements GClasses::GSupervisedLearner.

GDecisionTreeLeafNode* GClasses::GDecisionTree::findLeaf ( const double *  pIn,
size_t *  pDepth 
) [protected]

Finds the leaf node that corresponds with the specified feature vector.

size_t GClasses::GDecisionTree::leafThresh ( ) [inline]

Returns the leaf threshold.

double GClasses::GDecisionTree::measureInfoGain ( GMatrix pData,
size_t  nAttribute,
double *  pPivot 
) [protected]

InfoGain is defined as the difference in entropy in the data before and after dividing it based on the specified attribute. For continuous attributes it uses the difference between the original variance and the sum of the variances of the two parts after dividing at the point the maximizes this value.

size_t GClasses::GDecisionTree::pickDivision ( GMatrix features,
GMatrix labels,
double *  pPivot,
std::vector< size_t > &  attrPool,
size_t  nDepth 
) [protected]
virtual void GClasses::GDecisionTree::predictDistributionInner ( const double *  pIn,
GPrediction pOut 
) [protected, virtual]
virtual void GClasses::GDecisionTree::predictInner ( const double *  pIn,
double *  pOut 
) [protected, virtual]
void GClasses::GDecisionTree::print ( std::ostream &  stream,
GArffRelation pFeatureRel = NULL,
GArffRelation pLabelRel = NULL 
)

Prints an ascii representation of the decision tree to the specified stream. pRelation is an optional relation that can be supplied in order to provide better meta-data to make the print-out richer.

virtual GDomNode* GClasses::GDecisionTree::serialize ( GDom pDoc) [virtual]

Marshal this object into a DOM, which can then be converted to a variety of serial formats.

Implements GClasses::GSupervisedLearner.

void GClasses::GDecisionTree::setLeafThresh ( size_t  n) [inline]

Sets the leaf threshold. When the number of samples is <= this value, it will no longer try to divide the data, but will create a leaf node. The default value is 1. For noisy data, a larger value may be advantageous.

void GClasses::GDecisionTree::setMaxLevels ( size_t  n) [inline]

Sets the max levels. When a path from the root to the current node contains n nodes (including the root), it will no longer try to divide the data, but will create a leaf node. If set to 0, then there is no maximum. 0 is the default.

static void GClasses::GDecisionTree::test ( ) [static]

Performs unit tests for this class. Throws an exception if there is a failure.

Reimplemented from GClasses::GSupervisedLearner.

virtual void GClasses::GDecisionTree::trainInner ( GMatrix features,
GMatrix labels 
) [protected, virtual]
size_t GClasses::GDecisionTree::treeSize ( )

Returns the number of nodes in this tree.

void GClasses::GDecisionTree::useRandomDivisions ( size_t  randomDraws = 1) [inline]

Specifies for this decision tree to use random divisions (instead of divisions that reduce entropy). Random divisions make the algorithm train somewhat faster, and also increase model variance, so it is better suited for ensembles, but random divisions also make the decision tree vulnerable to problems with irrelevant attributes.


Member Data Documentation

GDecisionTreeNode* GClasses::GDecisionTree::m_pRoot [protected]