GClasses

GClasses::GVocabulary Class Reference

This is a helper class which is useful for text-mining. It collects words, stems them, filters them through a list of stop-words, and assigns a discrete number to each word. More...

#include <GText.h>

List of all members.

Public Member Functions

 GVocabulary (bool stemWords)
 ~GVocabulary ()
void setMinWordSize (size_t n)
 Sets the minimum word size. Smaller words will be ignored. The default is 4.
void addStopWord (const char *szWord)
 Adds a stop word (a common word that should always be ignored)
void addTypicalStopWords ()
 Adds a typical set of stop words.
size_t wordCount ()
 Returns the number of unique words in this vocabulary.
void addWord (const char *szWord, size_t nLen)
 Adds a word to the vocabulary. (If the word is too short or is in the stop-word list, it will not be added.)
void addWordsFromTextBlock (const char *text, size_t len)
 Adds all the words in the text block to the vocabulary.
size_t wordIndex (const char *szWord, size_t len)
 Returns the index of the specified word. Returns -1 if the word is not in the vocabulary (or is too short or is a stop word).
GHeapheap ()
 Returns a pointer to the heap this uses to store strings.
void newDoc ()
 If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception.
GWordStatsstats (size_t word)
 Returns the stats about a word. Throws if you weren't tracking stats (ie if you didn't call newDoc before each new document).
size_t docCount ()
 Returns the number of documents from which words have been added so far.
double weight (size_t word)
 Computes the weight that should be added to a document vector for each occurrence of a word in the vector-space document model. It is log(number_of_docs/docs_containing_word)/max_word_frequency.

Protected Attributes

GStemmerm_pStemmer
size_t m_minWordSize
size_t m_vocabSize
GConstStringHashTablem_pStopWords
GConstStringToIndexHashTablem_pVocabulary
GHeapm_pHeap
char wordBuf [64]
size_t m_docNumber
std::vector< GWordStats > * m_pWordStats

Detailed Description

This is a helper class which is useful for text-mining. It collects words, stems them, filters them through a list of stop-words, and assigns a discrete number to each word.


Constructor & Destructor Documentation

GClasses::GVocabulary::GVocabulary ( bool  stemWords)
GClasses::GVocabulary::~GVocabulary ( )

Member Function Documentation

void GClasses::GVocabulary::addStopWord ( const char *  szWord)

Adds a stop word (a common word that should always be ignored)

void GClasses::GVocabulary::addTypicalStopWords ( )

Adds a typical set of stop words.

void GClasses::GVocabulary::addWord ( const char *  szWord,
size_t  nLen 
)

Adds a word to the vocabulary. (If the word is too short or is in the stop-word list, it will not be added.)

void GClasses::GVocabulary::addWordsFromTextBlock ( const char *  text,
size_t  len 
)

Adds all the words in the text block to the vocabulary.

size_t GClasses::GVocabulary::docCount ( ) [inline]

Returns the number of documents from which words have been added so far.

GHeap* GClasses::GVocabulary::heap ( ) [inline]

Returns a pointer to the heap this uses to store strings.

void GClasses::GVocabulary::newDoc ( )

If you want this to track statistics about the number of docs that contain each word, and the max number of times each word occurs in any doc, then you should call this method each time you start adding words from a new document (including the first one). If you don't want to track such stats, you need never call this method. If you call this method, but you didn't call it before the first word was added, it will throw an exception.

void GClasses::GVocabulary::setMinWordSize ( size_t  n) [inline]

Sets the minimum word size. Smaller words will be ignored. The default is 4.

GWordStats& GClasses::GVocabulary::stats ( size_t  word)

Returns the stats about a word. Throws if you weren't tracking stats (ie if you didn't call newDoc before each new document).

double GClasses::GVocabulary::weight ( size_t  word)

Computes the weight that should be added to a document vector for each occurrence of a word in the vector-space document model. It is log(number_of_docs/docs_containing_word)/max_word_frequency.

size_t GClasses::GVocabulary::wordCount ( ) [inline]

Returns the number of unique words in this vocabulary.

size_t GClasses::GVocabulary::wordIndex ( const char *  szWord,
size_t  len 
)

Returns the index of the specified word. Returns -1 if the word is not in the vocabulary (or is too short or is a stop word).


Member Data Documentation

char GClasses::GVocabulary::wordBuf[64] [protected]