Functions and Data Defintion in TEXminer 1.0
TEXminer allows to analyze Texts in Unicode Format.
Save your Text in Unicode/UTF8 Format to get all characters correctly.
The Text Database can be saved in XML where the orginal Text, the Sentence and Word Lists and
additional Parameters (e.g. Abbreviations) are stored.
The Functionality will be enlarged in the next months.
The use of the program is very easy using the Text Database samples (Installation at the end of this document):
- Just load the Text Database Sample1.xml
- Click Menu Analyze - Search Word
- Enter a Word to be found
- The Sentence List shows a Column x
TEXminer has Abbreviations Lists for the following Languages (in the Data Dictionary, create there your own List):
- English : AbbreviationsENG.txt
- French : AbbreviationsFRA.txt
- German : AbbreviationsGER.txt
The Analysis Functionality provides the following Topics:
- 1: Search Word
- 2: ...
- 3: ...
- 4: ...
- 5: ...
- 6: ...
- 7: ...
- 8: ...
- 9: ...
- 10: ...
- 11: ...
- 12: ...
- 13: ...
- 14: ...
A Text Database consists of a Sentence List and of a Word List. You can load more than one Text an build a common Text Database.
The Sentence List and the Word List each have a Column which indicates the Text ID. The Word list counts duplicated Word Forms.
To build a Text Database for one Text:
- Click Menu File - Import - Unicode Text.
- The File Input Dialog lets you choose the Text File to load (default Directory is \bin\Debug\Data).
- After pressing OK the Text-Datasets Tab shows you the Dataset-Number, the File Name, the Length (Characters) and the Begin of the Text.
- Click Menu File - Import - Abbreviations.
- The File Input Dialog lets you choose an Abbreviations List containing Abbreviations ending with a Full Stop (default Directory is \bin\Debug\Data).
- As the Abbreviations Lists are Language-specific, you may have to create one on your own (use a Foreign Language Dictionary).
- The Text Samples of TEXminer are Englisch Text, so choose AbbreviationsENG.txt for the English Variante.
- After loading the Abbreviations File the Text-Dataset Tab shows you the Number of Abbreviations.
- Click Menu Analyze - Build Database.
- The Text-Database Tab shows you on the left the Sentence List and on the right the Word List generated and sorted alphabetically.
- Above the Word List the Label All Forms says how much Word were found in the Text altogether whereas the List will contain duplicated Forms
which can be seen if you click two times on the List Header count to sort the List by the Number each Word Form appears.
- The List Column DS indicates the number of the Dataset, which can be useful if you loaded more than one Text.
- Click Menu Export - DB as HTML.
- The File Output Dialog lets you enter a Name for the HTML File (default Directory is \bin\Debug\Result).
- After pressing OK a HTML Document gets created containing the Dataset Overview and the Word List.
- View this Document with any HTML Browser.
The most important Additional Parameters are the Abbreviations Lists. They are needed to be able to segment the Sentences correctly.
In these Lists only Abbreviations ending with a Full Stop are needed, so if such an Abbreviation is detected in your Texts the Sentence won't be ended.
Before creating your own Lists or extending the existant ones, have a look into one of the provided List, they are just Unicode Files using a line for each Abbreviation
(the following are the Abbreviations beginning with a/A in the English Abbreviations List, alphabetically Order gives a better Overview):
- a.m.
- abt.
- abv.
- AD.
- Adm.
- admin.
- alt.
- amp.
- Apr.
- Aug.
- av.
- Av.
- Ave.
A Text Database can be saved in a TEXminer-specific XML Document, so after Retrieval you can just start to analyze (Analysis Results are not stored).
Storage and Retrieval of a Text Database is very easy:
- To store just click Menu File - Save.
- The File Output Dialog lets you enter a Name for your Database (default Directory is \bin\Debug\Serial).
- After pressing OK an XML Document gets created containing the orginal Text, the Sentence and Word Lists and additional Parameters (e.g. Abbreviations).
- To retrieve just click Menu File - Open.
- The File Input Dialog lets you choose an XML Document containing a Text Database (default Directory is \bin\Debug\Serial).
- After pressing OK the first two Tabs of TEXminer should show the Text Database.
The Analysis of the Text is the main aim which will be extended in the future.
Here is a List of the Analysis Functionality:
- Search Word
- ...
- ...
- ...
- ...
Now all Functions in Detail:
Searching a Word
- Click Menu Analyze - Search Word
- An Input Box lets you enter the Word Form to Search.
- After entering the Word form press OK.
- A Message box says how many times the Word Form was found.
- The Sentence List of the Text Database has a Column x which is filled by an x for a Sentence containing the Word Form.
- Click two times on the List Header x to sort the List by the filled x Rows.
- Hover the Mouse a Second over a Sentence of interest to get a Tool Tip showing the whole Sentence.
...
- Click Menu Analyze - ...
- ...
- ...
- ...
- ...
So far no Adaption or Configuration neede.
This table gives an overview of the programmed Functions in the VB.NET Source Code:
| Name | Parameters | Callback | Explanation |
| initTEXdatasets | var As Integer | Boolean | Initialization of Text-Dataset List |
| initTEXabbreviations | var As Integer | Boolean | Initialization of Text-Abbreviations List |
| initTEXdatabase | var As Integer | Boolean | Initialization of Text-Database Lists |
| minOfTwo | first As Long, second As Long | Long | Minimum of two Values |
| maxOfTwo | first As Long, second As Long | Long | Maximum of two Values |
| readSerialXML | fSerial As String | Boolean | Load from XML Serialisation |
| saveSerialXML | fSerial As String | Boolean | Save to XML Serialisation |
| readASCII | ASCIIfile As String | Boolean | read Unicode/UTF8 File |
| readAbbreviations | fSerial As String | Boolean | read Language-specific Abbreviations |
| saveSortedWlist | HTMLfile As String | Boolean | save HTML File |
| displayTextDatasets | var As Integer | Boolean | display all Text-Datasets in TabView |
| buildTextDatabase | var As Integer | Long | build Text-Database |
| segmentToSentences | dsIndex As Integer | Long | segment Text-Dataset dsIndex into Sentences |
| checkAbbreviation | tstSentence As String | Boolean | check if Sentence ends with an Abbreviation |
| checkNumber | tstSentence As String | Boolean | check if Sentence ends with a Number |
| segmentToWords | dsIndex As Integer | Long | segment Text-Dataset dsIndex into Words |
| refineWord | rawWord As String | String | refine raw Word (trimming) |
| testWordInDB | tstWord As String | Long | test if Word is in Database |
| displayTextDatabase | var As Integer | Boolean | display Text-Database in TabViews |
| getWordDSindex | wordIndex As Long | Long | give Dataset Index for Word Index |
| searchWordInSents | tstWord As String, sentsDSind As Integer | Long | search Word in Sentences |
| displayMarkedSentence | var As Integer | Boolean | display Marked Sentences (e.g. by Word Search) |
Requesites
- Maybe download the .NET-Framework from MicroSoft (Redistributable Package)
Installation
- Extract the Project ZIP file to a new folder
Start
- Open the VB.NET Project with MS Visual Basic 2008 (Express Edition) or start the EXE file in the TEXminer/bin/Debug directory
State: Beta Version 1.0 / Nov 2012 by gearwheelsoft