Cerolobo Parser
v 8.0
By: Cerolobo (aka, Craig Williams)
Index 1.0 - Overview The Cero Parser is designed to be/have: Portable No Conditional Compilation Simple To Use No Obsecure Data Types No Macros Documented No Tabs Block Formatting/Alignment Dynamic, Fast, and Safe A Complete Package in one File 2.0 - Features 2.1 - File Handling 2.2 - Tokens & Token Sets Token Parameters/Defineable Logic Return switchto ignore Sample Program: New Line Counter fun params Token Sets Sample Program: Print comments and Strings 2.3 - Text Mode Functions 2.4 - Binary Mode Functions 2.5 - Function Callbacks 2.6 - String Manipulation Functions 3.0 - Using the Parser 3.1 - Compiling the Parser 3.2 - Initialization 3.3 - Setting up the Token Set 3.4 - Parsing the file 3.5 - Deinitialization 4.0 - Functions 4.1 - Public Parser Functions 4.1.01 - AddTokenSep() 4.1.02 - AddTokenSeparator() 4.1.03 - AddTokenSet() 4.1.04 - End() 4.1.05 - ErrorCode() 4.1.06 - GetFilePosition() 4.1.07 - GrabBinaryFloat() 4.1.08 - GrabBinaryInt() 4.1.09 - GrabBytes() 4.1.10 - GrabChar() 4.1.11 - GrabFloat() 4.1.12 - GrabInt() 4.1.13 - GrabToken() 4.1.14 - LoadFile() 4.1.15 - LoadMemory() 4.1.16 - LoadMemoryLen() 4.1.17 - ParserDeInit() 4.1.18 - ParserInit() 4.1.19 - PeekToken() 4.1.20 - PrintErrorCode() 4.1.21 - Seek() 4.1.22 - SetFilePosition() 4.1.23 - SetTokenSet() 4.1.24 - GetParserState() 4.1.25 - SetParserState() 4.1.26 - GenericDiscard() 4.2 - Private Parser Functions 4.2.01 - GrabLeftover() 4.2.02 - GrabNextChunk() 4.2.03 - ProcessToken() 4.3 - String Manipulation Functions 4.3.01 - RemoveWhiteSpaces() 4.3.02 - ToUpper() 4.3.03 - ToLower() 4.3.04 - Dup() 4.3.05 - DupLen() 4.3.06 - DupRange() 4.3.07 - DupRangeFile() 4.3.08 - Cmp() 5.0 - Change Log 1.0 - Overview - Top of Page The name of this library is the "Cerolobo Parser"; however, I usually refer to it as the "Cero Parser", or just "Parser". The Cero parser was originally designed to meet my own needs. Since it's original creation, the parser has been expanding into something simply beautiful. A lot of time and though has been put into the library, to ensure that it behaviors as it should. The library is entirely written by me. One quick note. This entire document was hand written in Crimson Editor. That, combined with my horriable spelling & grammer, will lead to various errors in the document. Warning: You many find a few "Rant" blocks of text. They are annotated before hand, and are ignorable. They my provide a better look into how my mind/logic works though. The parser was designed with several specific concepts/features in mind Portable: - Top of Page Portability is one of the main concepts that I spent a lot of time on. If the platform you are compiling on supports ANSI C with stdio.h, stdlib.h, and string.h support, the library should compile without any issues. ~ Rant Alert - Ignorable ~ I have seen countless open source projects that fail on anything other then the original designer's machine. I've seen projects that are so important that they just have to be in the root directory or environment path of the computer. Why? Yes, it might made development easier, but it makes it a hell of a lot harder for anyone else to use it. Then you have the whole automake/configure crowd. While automake can be useful, it is not portable at all. ~End of Rant ~ In order to ensure portability, the Cero Parser was tested on a Windows and a Linux machine. Unfortunately, I don't have access to a Mac development box, so I'm forced to make do. Since the parser does not require any GUI, I was able to rely on strict ANSI C (C89). Multiple compilers are used and tested as follows Windows: MS Compiler cl *.c /W4 GNU gcc *.c -Wall -Wextra -ansi -pedantic gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin g++ *.c -Wall -Wextra -ansi -pedantic Borland Compiler bcc -w *.c Digital Mars dmc file.c -A Intel Complier icl *.c /W3 DeSmet C c88 <file - one at a time> Watcom wcc <file - one at a time> -wx Linux: GNU gcc *.c -Wall -Wextra -ansi -pedantic g++ *.c -Wall -Wextra -ansi -pedantic For each compiler, the maximum number of warnings are turned on. No warnings or errors are acceptable for a release version. The only exception to this policy lies with the Microsoft Compiler. /Wall is the maximum; however, the warnings /Wall generates far to many warnings, half of which come from the MS headers. Out of all the compilers, DeSmet C is a interesting one. DeSmet C is a 16 bit dos compiler from 1987. This compiler is used to ensure that the code will scale correctly in terms of variable type lengths. It also does not support any extensions of the C language. In fact, it does some things incorrectly (usually due to how the syntax is parsed). Making the code DeSmet C compatible has produced some weird looking code; however, I usually attach a comment as to why it's that way. ~ Rant Alert - Ignorable ~ No Conditional Compilation: - Top of Page Conditional compilation (#ifdef & such) is just bad. Not only does it destroy portability, but it greatly increases the difficulty to just jump in and use someone else's library. I have seen several open source projects that enforce the use of automake. Through their automake/configure script, several defines are used; however, they are rarely documented. They also make code really ugly to look at. The only place I use a #ifdef is for header protection, which stops redefinitions of function in horrible include hierarchies. Again, in several open source projects, I have seen some really ugly include hierarchies. Including just one file will usually end up including up nearly every other file in the project. Not only that, but they end up including the same file several times! Gah!!!! Why must every file be dependent on every other file?!?! ~End of Rant ~ Simple To Use: - Top of Page To make the library as simple to use as possible, several conventions are followed. Firstly, all the function names and such follow my coding standards (included). Other then that, all of the code is in two files, a .c and a .h. I find it far simpler to just copy two files into your project, and then included one file (#include "Parser.h", by default) to get the library to work with other code. No defines or special compilation flags are required. In order to increase the simplicity, a file handler was built into the project. Personally, I find file I/O to be really ugly to look at, not to mention fairly inefficient to read in an entire file, so all of that has been taken care of. No Obsecure Data Types: - Top of Page No defined or typedef types beyond the remove of the "struct" keyword is used. This should make it obvious as to what type of data each variable takes. No Macros: - Top of Page Yes, macros can be really helpful; however, they add another level of obscurity to your code. They are not used in the library to make everything as straight forward as possible. Documented: - Top of Page To improve you understanding of the project, documentation is added to the project. Every function has a function header and there is a file header in every file. Comments are added to the code; however, worthless comments are avoided as much as possible. No Tabs: - Top of Page Tabs are evil. While they may bet set to 4 spaces in one program, they may bet set to x spaces in another program. Tabs destroy alignment and code flow, so they are all removed. Block Formatting/Alignment: - Top of Page I tend to be a alignment whore. I find it far easier to look at and read code if it is separated into blocks. Dynamic, Fast, and Safe: - Top of Page The library was designed to be as dynamic as possible, without sacrificing a huge number of cycles. All code is benchmarked, and weight against the usefulness of the feature. If the feature eats up a huge number of cycles while not being very useful, it will not be implemented. To ensure that the Parser is safe, a few cycles are spent on error checking. All allocated memory is checked to ensure that it is valid (not null), and buffer under/over read is checked. To check for any memory related issues, Valgrind and my own memory manager are used to check for any issues and memory leaks. No memory leaks are tolerated. That being said, it is still possible to crash the program through the parser. If you pass in a pointer to a bad chunk of memory, the parser will most probably crash. A Complete Package in one File: - Top of Page The parser does not depend on any other libraies, other then the standard C library. Namely, stdio.h, stdlib.h, and string.h.
2.0 - Features - Top of Page
2.1 - File Handling - Top of Page A built in file handling system is implemented in the parser. The file handling system includes a file fragmentation/caching system. When ParserInit(<file>, <bufsize>) is called, a buffer size is specified. When the data is read in from the file, the specified buffer size determines how many bytes to read in from the file. Since a token separator can fall on the ends of the data read in, the parser accounts for this fragmentation. For example, take bufsize - 3 data - "0123456789ABCDEF" With a bufsize of 3, the parser will fragment the file into Fragment 1 - "012" Fragment 2 - "345" Fragment 3 - "678" Fragment 4 - "90A" Fragment 5 - "BCD" Fragment 6 - "EF" If a token separator was declared as "234", it would not be detected, since the entire string would never be in the input buffer. To handle this, the buffer size is expanded to the length of the longest token separator - 1. IE, if "234" was the only separator, then the buffer size would be expanded by 2 (Length("234") - 1). Original Buffer: -- -- -- | | | | -- -- -- Expanded: -- -- -- -- -- | | | | | | -- -- -- -- -- Note: A null terminator is attached as well, but it is not represented. When the data is actually read in, last longest sep - 1 is are attached to the front of the buffer. This ensures that all the characters are checked against all the possible token separators. bufsize - 3 Longest Sep - 3 Actual buf - 5 (plus a null terminator, so it's actually 6 bytes) data - "0123456789ABCDEF" Fragment 1 - "012" Fragment 2 - "12345" Fragment 3 - "45678" Fragment 4 - "7890A" Fragment 5 - "0ABCD" Fragment 6 - "CDEF" This does force some redundant checking; however, it is far more important that the parser correctly locates the tokens. If performance is an issue, a low number of short token separators with a larger buffer size will greatly increase performance. Larger buffer sizes decrease the number of reads from the hard drive; however, the memory footprint of the parser will increase. The smallest possible memory footprint can be achieved by setting the buffer size to 1; however, it is far slower. A buffer size of 1024 bytes (1 KB) is recommended for general purposes. 2.2 - Tokens & Token Sets - Top of Page Token Separators: The actual parsing syntax of the parser is defined as "Token Separators". There are several tokenizers on the market (including one in string.h) that are fast; however, a single character is not always enough to make parsing a file simple and easy. As such, full strings are used and scanned for. I refer to the "delimiters" as "Token Separators", since addition logic can be attached to them. This library is call a parser, instead of a tokienizer, due to this addition logic. Not to mention, that function callbacks are supported as well. The order that the tokens are added do matter. A token added before another token will have a higher priority. The Token Separators can have the following logic attached to them via AddTokenSeparator(<token>, <Return>, <switchto>, <ignore>, <fun>, <params>) Return - Should the token be returned when GrabToken() is called? If this argument is set to true (1), the token will be returned. If it's set to false (0), the token will not be returned. This is extremely useful to filter out unneeded tokens. For example, data - "foo bar" /* Notice the two space between foo & bar */ if the Separator is a space, and return is set to 1 GrabToken(); /* Returns "foo" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns 0 - End of the data was reached */ If return is set to 0 GrabToken(); /* Returns "foo" */ /* Both of the spaces are not returned! */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns 0 - End of the data was reached */ switchto - Should the active token set be changed, when the sep is found? Setting this parameter to -1 disables this feature. If it's set to anything else, the token set will automatically be changed to the specified token set, if it exists. See Token Sets ignore - Should the token separator be ignored? This logic was originally designed to be used when parsing strings. For Example: string - "foo\"bar" /* Start and end quotation marks are part * of the string */ Token Sep is a quotation mark GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "foo\\" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns 0 - End of the data was reached */ If you had wanted to preserve the string, you probably didn't the parser to pick out the quotation mark from \". To fix this, add another token sep (\") with ignore set to 1 AddTokenSeparator("\"", 1, -1, 0, 0, 0); /* " */ AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); /* \" */ ^- Ignore GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "foo\"bar" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns 0 - End of the data was reached */ Ignore is also useful when combined with function callback. For example, say you wanted to count all the new lines (\n) in a file. You could set ignore to 1, and then set the function pointer to a function that would increment a global variable that counts the number of new lines. When GrabToken() is called, It'll call the callback function when ever it runs into a new line. Since Ignore is set to one, it'll continue to do this until it gets to the end of the file. From there, it'll return the entire file, but you'll have the total number of new lines in the file Note: You cannot call GrabToken() or similar from a callback with ignore set to 1 (true). /************************************************************* * Full Program - Retrieves the number of new lines from a file * * * Note: You must change <any file> in ParserInit() to the * * name of the file you want to get the number of * * new lines from. * *************************************************************/ #include "Parser.h" #include <stdlib.h> /* free() */ #include <stdio.h> /* printf() */ int NewLineCounter(int *newlines); int main(void) { int NewLines = 1; ParserInit(<any file>, 1024); AddTokenSeparator("\n", 0, -1, 1, (PARSER_CALLBACK)NewLineCounter, &NewLines); free(GrabToken()); /* Scan the whole file, and free what * * ever is returned */ printf("New Lines: %d\n", NewLines); ParserDeInit(); return 0; } int NewLineCounter(int *newlines) { (*newlines)++; return 0; } /************************************************************* * End of the program * *************************************************************/ fun - Function to call when ever a token is found. The prototype for the function is int <function name>(void *params); or int (*fun)(void *params); /* function pointer form */ The return of the funciton should be a 1 - Return the token to what ever called GrabToken(). If ignore is set to 1, the return value is ignored. 0 - Free the token, and continue to search for another token. In the above program, if you changed the ignore parameter to a 0 (false), the same result would be generated. params - Paramters to pass to the callback. Check New Line Counter for an example on how to use this. To add a new Token Separator, you can call two different functions AddTokenSep(); - Basic version that attaches default behavior AddTokenSepataor(); - Advanced version that allows you to define the logic. To use AddTokenSep(), you only have to pass in a pointer to a string. A 0 will be returned if a error occurred, and a 1 will be returned if the token was added. Default behavior: Return - 1 - Return the token when GrabToken() is called. switchto - -1 - Don't change the token set. ignore - 0 - Don't ignore the token separator. fun - 0 - Don't call a function. params - 0 - No params to pass to the callback AddTokenSepataor() allows you specify the logic of the token separator. Token Sets: To allow the Cero Parser to be more dynamic, multiple Token Sets can be defined. A "Token Set" is just a set of tokens. Each token set is completely separated from one another. This allows the parser to switch the parsing syntax at runtime. Once the parser is initialized (by calling ParserInit()), the initial token set will automatically be created, and set as the active token set. The initial token set has a index of 0. To create a new token set, simply call AddTokenSet(); The function will create a new token set, set the new token set as the active token set, and then return the index of the token set. To properly handle the return value, you should create a descriptive variable to store the index. For Example: ParserInit(0, 1024); /*Set up parser and create token set 0*/ int tset_comments = AddTokenSet();/* Create tset that handles comments */ int tset_strings = AddTokenSet();/* Create tset that handles strings */ While this is the proper way to handle token set indexes, feel free to just hard code the value. The initial token set is 0. Each call to AddTokenSet() will increase the index by 1. IE, tset_comments will be 1, and tset_strings will be set to 2. All calls to AddTokenSep(), AddTokenSeparator(), GrabToken(), PeekToken(), ect will use the active token set. To Change the active token set, simply call SetTokenSet(); SetTokenSet(tset_comments); /* Set the active token set to handle comments*/ SetTokenSet(tset_strings); /* Set the active token set to handle strings */ SetTokenSet(0); /* Set the active tset to the initial tset */ SetTokenSet() will return a -1 if the token set index you specified is invalid, Otherwise, SetTokenSet() will return the index you passed in. To add further automation to the parser, you can specify which token set the parser will use when ever a token separator is located. To do this, specify the token set the parser should switch to, as the switchto parameter. For example, lets write a program that will print only the C style comments and strings from Parser.c. First, we Initialize the Parser. That will create the initial token set (0) that will handle all the switching between token sets, and calling the proper function. Once the initial token set is created, create two addition token sets. The first token set will handle all the strings. The 2nd, will handle all the comments. Once we have all the token sets, we switch back to the initial token set, otherwise we would add the token separators to the last created token set. Now that the initial token set is active, and all the addition token sets have been created, we can start adding the parsing syntax. Initial Token Set: AddTokenSeparator("\"", 0, tset_strings, 0, PrintString, 0); " - Return - 0 - Don't return it switchto - tset_strings - switch to tset that handles strings ignore - 0 - Don't ignore the token. fun - PrintString - Print the string to the screen params - 0 - Don't pass any paramaters AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0); /* - Return - 0 - Don't return it switchto - tset_comments- swithc to tset that handles comments ignore - 0 - Don't ignore the token fun - PrintComment - Print comment to command prompt params - 0 - Don't pass any paramters With the initial token set up, we need to set up the two addition token sets. SetTokenSet(tset_strings); tset_strings: AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); \" - Return - 0 - Don't return it switchto - -1 - Don't change the current token set ignore - 1 - Ignore the token, so we don't break up strings fun - 0 - Don't call any function params - 0 - Don't pass any paramaters AddTokenSeparator("\"", 0, 0, 0, 0, 0); " - Return - 0 - Don't return it switchto - 0 - Switch to intial token set ignore - 0 - Don't ignore it fun - 0 - Don't call a function params - 0 - Don't pass any paramaters SetTokenSet(tset_comments); tset_comments: AddTokenSeparator("*/", 0, 0, 0, 0, 0); */ - Return - 0 - Don't return it switchto - 0 - Switch to intial token set ignore - 0 - Don't ignore it fun - 0 - Don't call a function params - 0 - Don't pass any paramaters With all the token sets set up, we need to switch back to the intial token set before we can start parseing the file. SetTokenSet(0); With every thing set up, the actual parsing is fairly automatic. When GrabToken() is called, It'll first scan the file for a separator. If it finds a /*, it'll switch to tset_comments, and then call the function PrintComment(). In PrintComment(), a printf() is called along with another GrabToken(). In PrintComment(), GrabToken() will return everything up to the */. Once */ is found, the token set will automatically be switched back to the initial token set, and then continue to search for another token. The same procedure happens for strings as well. /************************************************************************ * Start of Program. Prints out all comments and strings from Parser.c * ************************************************************************/ #include "Parser.h" #include <stdio.h> /* printf() */ #include <stdlib.h> /* *alloc(), free() */ int PrintString (void *); int PrintComment(void *); int main(void) { int tset_strings; int tset_comments; char *buffer; ParserInit("Parser.c", 1024); tset_strings = AddTokenSet(); tset_comments = AddTokenSet(); SetTokenSet(0); AddTokenSeparator("\"", 0, tset_strings, 0, PrintString , 0); AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0); /* Set up the token set that will handle the strings */ SetTokenSet(tset_strings); AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); AddTokenSeparator("\"", 0, 0, 0, 0, 0); /* Set up the token set that will handle all comments */ SetTokenSet(tset_comments); AddTokenSeparator("*/", 0, 0, 0, 0, 0); SetTokenSet(0); /* Loop through all the code. GrabToken() will return chunks of code* * so, we need to free it until we get a null buffer/end of file */ for(buffer = GrabToken(); buffer; buffer = GrabToken()); free(buffer); return 0; } int PrintString (void *a) { char *buffer = GrabToken(); printf("String Found!\n\"%s\"\n\n", buffer); free(buffer); return 0; /* Don't return the " */ } int PrintComment(void *a) { char *buffer = GrabToken(); printf("Comment Found!\n/*%s*/\n\n", buffer); free(buffer); return 0; /* Don't return the / * */ } /************************************************************************ * End of Program * ************************************************************************/ 2.3 - Text Mode Functions - Top of Page GrabToken() and PeekToken() are the main text file functions. They search through the file for a Token Separator and return a pointer to a null terminated array. These functions will work with binary files as well; however, due to NULL terminator, using these functions will cause all tokens that start with a 0 to be ignored. GrabInt() and GrabFloat() utilize GrabToken(), the conversion into the respective variable type. NOTE: Unicode is not supported. Text Mode Compliant Functions: GrabToken(); PeekToken(); GrabInt (); GrabFloat(); Seek (); 2.4 - Binary Mode Functions - Top of Page Seek() is the main driving force behind binary mode. It allows you to scan the file, until you get to the position you want. GrabBinaryInt(), GrabBinaryFloat(), GrabChar(), and GrabBytes() allow you to retrieve the data. You can use Grab/PeekToken; however, you must not have any 0s in the file. Note: The buffer size you specifed must be larger then sizeof(int), in order for GrabBinaryInt() or GrabBinaryFloat() to work. Binary Mode Compliant Functions: Seek (); GrabBinaryInt (); GrabBinaryFloat(); GrabBytes (); GrabChar (); 2.5 - Function Callbacks - Top of Page The parser supports callback functions. The prototype for the function is int <function name>(void *params); The return value of the function should be 1 - Return the token to what ever called GrabToken(). If ignore is set to 1, the return value is ignored. 0 - Free the token, and continue to search for another token. To implement a function callback, call AddTokenSeparator(), and fill in the parameter "fun" with the name of the function. params allows you to pass a pointer of data to the callback function. More detailed explanation 2.6 - String Manipulation Functions - Top of Page RemoveWhiteSpaces() - Removes all white spaces from a string (Space, New Line, Carriage Return, and Tab) ToUpper () - Converts all letters in a string to upper case ToLower () - Converts all letters in a string to lower case Dup () - Creates a copy of the specified string DupLen () - Same as above, but takes the length DupRange () - Creates a copy of a specific part of a string DupRangeFile() - Creates a copy of a specific part of a file Cmp () - Compares two strings. If they are equal, 1 is returned. If not, 0 is returned.
3.0 - Using the Parser - Top of Page
3.1 - Compiling the Parser - Top of Page To compile the parser, simply add Parser.h and Parser.c to your project. Add Parser.c to your make file, command line, or what ever. No compile time defines or special switches are required. Note: If you are working on a C++ project, you might want to rename Parser.c to Parser.cpp. The compiler has been compiled and tested with Windows: MS Compiler cl *.c /W4 GNU gcc *.c -Wall -Wextra -ansi -pedantic gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin g++ *.c -Wall -Wextra -ansi -pedantic Borland Compiler bcc -w *.c Digital Mars dmc file.c -A Intel Complier icl *.c /W3 DeSmet C c88 <file - one at a time> Watcom wcc -wx Linux: GNU gcc *.c -Wall -Wextra -ansi -pedantic g++ *.c -Wall -Wextra -ansi -pedantic 3.2 - Initialization - Top of Page Before you can use the Parser, you must first call ParserInit(<file to parse>, <buffer size>); 3.3 - Setting up the Token Set - Top of Page Once the parser has been initialized, you have to set up the token sets. If you are parsing a pure binary file, you do not need to add any token separators. To add a token separator, call AddTokenSep(<token>); AddTokenSeparator(<token>, <return>, <switchto>, <ignore>, <fun>, <params>); To create a new Token Set, call AddTokenSet(); To change the current token set, call SetTokenSet(<token set>); 3.4 - Parsing the file - Top of Page Once the token set(s) have been set up, you begin to parse the file. GrabToken() is the main parser function. When GrabToken() is called, it'll retrieve the next token from the file. If a 0 (null) is returned, the end of the file was reached or a error occurred. Text Mode Functions Binary Mode Functions 3.5 - Deinitialization - Top of Page Once you are done parsing, you should call ParserDeInit();
4.0 - Functions - Top of Page
4.1 - Public Parser Functions - Top of Page 4.1.01 - AddTokenSep() - Top of Page Prototype: int AddTokenSep(char *sp); Description: Adds a new token separator to the current token set. This function is an adapter for AddTokenSeparator(), that uses the default settings. Return - 1 - The token separator will be returned switchto - -1 - Don't change the token set ignore - 0 - Don't ignore the token fun - 0 - Don't call a callback function params - 0 - Don't pass any paramaters Inputs: *sp - Pointer to the string to use as a token separator. C style string - must be null terminated. Output: 0 - The specified token is not valid, or another error occurred. If a 0 is returned, call ErrorCode() to get the error status. If a 0 is returned by ErrorCode(), then the string was invalid. Otherwise, the error will be specified. 1 - Token has been added to the parser. Notes: The string passed in will not be duplicated. Freeing the string you passed in before deinitialization the parser will cause the Parser to fail. AddTokenSep() is currently designed to work with hard coded token; although, dynamic tokens can be used. You will be responsible for the clean up though. You can change this behavior by modifying a define at the top of Parser.c SEP_OWN_TOKEN - 0 - Parser does not own the token. The Parser will not attempt to free the token when cleaning up. 1 - Parser owns the token. It will call free on the token during cleanup. SEP_DUP_TOKEN - 0 - The token passed in will not be duplicated. 1 - The token will be duplicated, and freed on cleanup. This overrides SEP_OWN_TOKEN if set to 1. The order that add tokens matters. A token passed in before the other tokens will be detectected first. IE, if you passed in AddTokenSep("23"); AddTokenSep("2"); "23" will be check for before "2" is checked for. 4.1.02 - AddTokenSeparator() - Top of Page Prototype: int AddTokenSeparator(char *sp, char Return, signed char switchto, char ignore, int (*fun)(), void *params); Description: Inputs: *sp - Pointer to the string to consider as a separator Return - Should the separator be returned by Grab/PeekToken()? This is useful for filtering out specific strings. switchto - Automatically switch to the specified token set, when the token is found. ignore - If this Token is found, just keep going. Originally designed to be used with strings. For example, \" should be ignored; however, " will be picked out if we don't ignore \" fun - Callback function to call when ever the token is found. params - Pointer to pass to fun Output: 0 - The specified token is not valid, or another error occurred. If a 0 is returned, call ErrorCode() to get the error status. If a 0 is returned by ErrorCode(), then the string was invalid. Otherwise, the error will be specified. 1 - Token has been added to the parser. Notes: The string passed in will not be duplicated. Freeing the string you passed in before deinitialization the parser will cause the Parser to fail. AddTokenSeparator() is currently designed to work with hard coded token; although, dynamic tokens can be used. You will be responsible for the clean up though. You can change this behavior by modifying a define at the top of Parser.c SEP_OWN_TOKEN - 0 - Parser does not own the token. The Parser will not attempt to free the token when cleaning up. 1 - Parser owns the token. It will call free on the token during cleanup. SEP_DUP_TOKEN - 0 - The token passed in will not be duplicated. 1 - The token will be duplicated, and freed on cleanup. This overrides SEP_OWN_TOKEN if set to 1. The order that add tokens matters. A token passed in before the other tokens will be detectected first. IE, if you passed in AddTokenSep("23"); AddTokenSep("2"); "23" will be check for before "2" is checked for. If ignore is set to 1 and there is a function callback for a token, you will not be able to call GrabToken() or similar from within the callback. 4.1.03 - AddTokenSet() - Top of Page Prototype: int AddTokenSet(void); Description: Creates a new Token Set, and then sets it as the active one. Inputs: N/A Output: -1 - Error occurred. Call ErrorCode() to find out what went wrong. 0+ - Index of the new token set. Generally, you should use a variable to store the return result, and then use that variable when SetTokenSet() is called. Notes: N/A 4.1.04 - End() - Top of Page Prototype: int End(void); Description: Returns 1 if the end of the file was reached, a error occurred, or if the parser was not initialized. Inputs: N/A Output: 1 - The end of the file was reached, a error occurred, or the parser was not initialized. 0 - The parser can still retrieve data from the file. Notes: N/A 4.1.05 - ErrorCode() - Top of Page Prototype: int ErrorCode(void); Description: Returns a 0 if no error has occurred. Otherwise, an error has occurred. Inputs: N/A Output: -1 - Not Initialized 0 - No Error 1 - Could not open the specified file. 2 - Could not allocate the required memory 3 - Reached the end of the file 4 - Data passed in to Load Memory was null 5 - GrabToken() or similar was called from a function callback with ignore set to 1. This is not supported. Notes: Check Parser.h for the defines of the above error codes. You can also call PrintErrorCode() to print out a human readable error code to the command prompt. 4.1.06 - GetFilePosition() - Top of Page Prototype: long GetFilePosition(void); Description: Returns the absolute position of the parser in the file. Inputs: N/A Output: Absolute position in the file. Notes: Long is used to add support for 16 bit compilers. In a 16 bit environment, longs are 32 bits. On 32 bit processors, longs are the same size as a int, 32 bits. For 64 bit processors, longs are usually 64 bits. These numbers are compiler and platform depended though. The position returned will not be accurate if GetFilePosition() is called from within a callback that had a token with ignore set to 1. 4.1.07 - GrabBinaryFloat() - Top of Page Prototype: float GrabBinaryFloat(void); Description: Grabs the next four bytes in the file, and converts them to a float. Inputs: N/A Output: Next four bytes in the file as a float. If there are not four bytes left in the file, a 0.0f will be returned instead. Notes: The buffer size specifed in ParserInit() size must be greater then sizeof(float) for this function to work. 4.1.08 - GrabBinaryInt() - Top of Page Prototype: int GrabBinaryInt(void); Description: Returns the next sizeof(int) bytes in the file as an int. Inputs: N/A Output: Next sizeof(int) bytes in the file as an int. If there are less then 4 bytes left in the file, a 0 will be returned. Notes: The buffer size specifed in ParserInit() size must be greater then sizeof(int) for this function to work. 4.1.09 - GrabBytes() - Top of Page Prototype: char *GrabBytes(int bytes); Description: Grabs the requested number of bytes from the file, and then returns them. Inputs: bytes - How many bytes to grab from the file. Output: 0 - Requested number of bytes is invalid or the end of the file was reached 1+ - Pointer to the memory that contains the data from the file. Notes: GrabBytes will fail if you request more bytes then the specified buffer size when ParserInit() was called. You are responsible for cleaning up the data when you are done with it. IE, you must call free(). 4.1.10 - GrabChar() - Top of Page Prototype: char GrabChar(void); Description: Grabs the next character (byte) in the file, regardless of the token separators. Inputs: N/A Output: Next character (byte) from the file. Notes: This function can be used to read in binary big Endean files. Generally, you would call GrabChar(), and then just shift the bits. 4.1.11 - GrabFloat() - Top of Page Prototype: float GrabFloat(void); Description: Grabs the next token in the file, and then attempts to convert it to a float via the atof() function declared in stdlib.h. All token separators are taken into account. Function callbacks & such will still be called. Inputs: N/A Output: Next token converted to a float. Notes: N/A 4.1.12 - GrabInt() - Top of Page Prototype: int GrabInt(void); Description: Grabs the next token in the file, and then attempts to convert it to an int via the atoi() function declared in stdlib.h. All token separators are taken into account. Function callbacks & such will still be called. Inputs: N/A Output: Next token converted to an int. Notes: N/A 4.1.13 - GrabToken() - Top of Page Prototype: char *GrabToken(void); Description: The main function of the Cero Parser. This function will scan the file for any of the token separators you specified with AddTokenSeparator(), as well as to apply the specified logic of the token separator. Inputs: N/A Output: 0 - End of the file was reached, or an error occurred 1+ - Character pointer to the next token in the file. Notes: You are responsible for the cleanup. IE, calling free(). 4.1.14 - LoadFile() - Top of Page Prototype: int LoadFile(char *file); Description: Loads in a new file into the parser for processing. Inputs: *file - C style string that contains the name/path of the file to parse. Output: 1 - File was loaded and the parser was set up 0 - Error occurred. Most likely do to an incorrect file name. Notes: All positional data from the previous file will be lost when LoadFile is called. You can call GetFilePosition() before hand to save the state of the parser. The token sets will not be affected by this function. All files are read in as binary. 4.1.15 - LoadMemory() - Top of Page Prototype: int LoadMemory(char *memory); Description: Loads in the specified chunk of memory into the parser for parsing. Currently, only C style strings are supported by this function. Inputs: *memory - Pointer to the chuck of memory to load into the parser. Output: 0 - The specified memory is not valid or the parser was not initialized. 1 - The memory was loaded into the parser. Notes: The specified chuck of memory will be duplicated before being loaded into parser. In other words, you are still responsible for the clean up of the memory that *memory points to. You can change the behavior of the above, but setting LOADMEM_DUP at the top of Parser.c to 0. If LOADMEM_DUP is set to 0, the Parser will own the memory that you pass in. Setting LOADMEM_DUP to 1 will cause the memory to be duplicated. 4.1.16 - LoadMemoryLen() - Top of Page Prototype: int LoadMemoryLen(char *memory, int len); Description: Loads in the specified chunk of memory into the parser for parsing. This is the same as LoadMemory(), however, you can load in binary memory Inputs: *memory - Pointer to the chuck of memory to load into the parser. len - Size of the memory to load into the Parser Output: 0 - The specified memory is not valid or the parser was not initialized. 1 - The memory was loaded into the parser. Notes: The specified chuck of memory will be duplicated before being loaded into parser. In other words, you are still responsible for the clean up of the memory that *memory points to. You can change the behavior of the above, but setting LOADMEM_DUP at the top of Parser.c to 0. If LOADMEM_DUP is set to 0, the Parser will own the memory that you pass in. Setting LOADMEM_DUP to 1 will cause the memory to be duplicated. 4.1.17 - ParserDeInit() - Top of Page Prototype: void ParserDeInit(void); Description: Frees all the memory that the parser was using. Inputs: N/A Output: N/A Notes: N/A 4.1.18 - ParserInit() - Top of Page Prototype: void ParserInit(char *file, int bufsize)' Description: Allocates and initializes all the memory that the parser needs to function. Once everything has been allocated and initialized, the Parser will load in the requested number of bytes from the file. Inputs: *file - Name/Path of the file to load into the parser. A NULL pointer can be passed in if you do not wish to load in an initial file. bufsize - How many bytes to read in from the file at one time. If a 0 is passed in, bufsize will default to 1024 - 1 KB. This value can not be changed once it is specified. Output: N/A Notes: If this function is called more then once, the parser will automatically call ParserDeInit(), in order to prevent leaking memory. 4.1.19 - PeekToken() - Top of Page Prototype: char *PeekToken(void); Description: Same behavior as GrabToken(); although, the parser's position in the file is not updated. Callback functions will still be called. Inputs: N/A Output: Pointer to the next token the in file. Notes: You are responsible for cleaning up the memory when you are done. 4.1.20 - PrintErrorCode() - Top of Page Prototype: void PrintErrorCode(void); Description: Prints out the current status of the parser to the command prompt. Format: <File Name>: <Error Message> The file name will be the name of the parser file (Parser.c, by default). The error message will be determined by the Error Code. Inputs: N/A Output: N/A Notes: N/A 4.1.21 - Seek() - Top of Page Prototype: int Seek(char *search); Description: Scans the file for the specified token. If the token is found, the position of the parser will be updated to the character directly after the token. If the token is not found, nothing in the parser will change. Inputs: *search - C style string to search for in the file. Output: 0 - The token was not found. The parser was not updated. 1 - The token was found. The parser was updated. Notes: Token sets are not factored in. 4.1.22 - SetFilePosition() - Top of Page Prototype: int SetFilePosition(long fpos); Description: Changes the position in the file that the parser scans for the tokens. Inputs: fpos - Where the parser should start parsing the file. Output: 0 - Error occurred. Call ErrorCode() or PrintErrorCode() for more info. 1 - Parser's position was updated. Notes: N/A 4.1.23 - SetTokenSet() - Top of Page Prototype: int SetTokenSet(int tokenset); Description: Changes the current token set. Inputs: tokenset - Index of the token set to change to. Output: -1 - The parser was not initialized or the requested token set was not valid. 0+ - Index of the token set switched to. Notes: N/A 4.1.24 - GetParserState() - Top of Page Prototype: char * GetParserState(void); Description: Returns a pointer to the current Parser state. This is useful to preserve the Parser's current state. Inputs: N/A Output: Pointer to the current Parser state. Notes: N/A 4.1.25 - SetParserState() - Top of Page Prototype: void SetParserState(char *state); Description: Sets the Parser state to the specified Parser state. Inputs: *state - State the parser should use. Output: N/A Notes: Settings state to 0, followed by calling ParserInit() will create a new Parser state. 4.1.26 - GenericDiscard() - Top of Page Prototype: int GenericDiscard()(void *unused); Description: Generic Parser callback designed to discard the next token. Inputs: N/A Output: 0 - Don't return the token we are discarding. Notes: N/A
4.2 - Private Parser Functions - Top of Page
The following functions are only meant to be called from the functions in the parser. Making these functions public, and calling them externally will have undefined results.
4.2.01 - GrabLeftover() - Top of Page Prototype: char *GrabLeftover(void) Description: Returns any data that was left in the parser. This function is called once the end of the file is reached, and no more tokens have been found. Inputs: N/A Output: 0 - No data is left, or there was an error. 1+ - Pointer token to return. Notes: N/A 4.2.02 - GrabNextChunk() - Top of Page Prototype: void GrabNextChunk(void); Description: This function handles all file input. It will allocate the space for the buffer, if required, and then read in the next chunk of the file. Inputs: N/A Output: N/A Notes: N/A 4.2.03 - ProcessToken() - Top of Page Prototype: char *ProcessToken(Separator_str *sep, int *spos); Description: When ever a token is found, the function is called to handle all logic attached to the token. Inputs: *sep - Pointer to the token separator that was found *spos - Temporary position where the parser is located in SP. This is a pointer, in case the token needs to be ignored. Output: 0 - Continue to search for another token sep. 1+ - Token to return. Notes: N/A
4.3 - String Manipulation Functions - Top of Page
The following functions are not implemented in string.h or operate on different principals.
4.3.01 - RemoveWhiteSpaces() - Top of Page Prototype: int RemoveWhiteSpaces(char *sp); Description: Removes all spaces, new lines, carriage returns, and tabs from the specified string. Inputs: *sp - string pointer - string to remove the white spaces from. Output: -1 - sp was not valid 0+ - New length of the string. The pointer will not be reallocated, so the original string pointer should be valid. Notes: N/A 4.3.02 - ToUpper() - Top of Page Prototype: void ToUpper(char *sp); Description: Converts a c style string to upper case. Inputs: *sp - pointer to the string to convert to upper case. Output: N/A Notes: N/A 4.3.03 - ToLower() - Top of Page Prototype: void ToLower(char *sp) Description: Converts a c style string to lower case. Inputs: *sp - pointer to the sting to convert to lower case. Output: N/A Notes: N/A 4.3.04 - Dup() - Top of Page Prototype: char *Dup(const char *sp); Description: Creates a copy of the specified string. Inputs: *sp - String to make a copy of. Output: Pointer to the new chunk of memory. Notes: N/A 4.3.05 - DupLen() - Top of Page Prototype: char *DupLen(const char *sp, int len); Description: Creates a copy of the specified string. The NULL terminator is automatically attached. IE, you can just call strlen() for the param len. Inputs: *sp - String to make a copy of. len - Length of the string/position of the null terminator. Output: Pointer to the new chunk of memory. Notes: N/A 4.3.06 - DupRange() - Top of Page Prototype: char *DupRange(const char *sp, int start, int end); Description: Creates a copy of a specific part of a string Inputs: *sp - String to make a partial copy of start - Index in the string to start copying data from end - Where to stop/last character to copy Output: Pointer to the duplicated chunk of the string. Notes: N/A 4.3.07 - DupRangeFile() - Top of Page Prototype: char *DupRangeFile(const char *file, int start, int end); Description: Opens up the specified file, and then reads in the data range to a buffer. Inputs: *file - Name/path of the file to read start - Where in the file to start reading in the data end - Where to stop reading in data Output: 0 - The file name was not valid, or the memory couldn't be allocated. 1+ - Pointer to the new buffer containing the requested data. Notes: N/A 4.3.08 - Cmp() - Top of Page Prototype: char Cmp(const char *osp, const char *osp2); Description: Compares two strings together. Cmp() differ from strcmp() (string.h) in two ways. First, Cmp() returns a 1 if the strings match, and a 0 if not. 2nd, Cmp() has a non case sensitive and non white space sensitive mode. To disable case/white space sensitivity, set CaseSensitive (towards the top of Parser.c) to 1. Inputs: *osp - First string to compare *osp2 - Second string to compare Output: 0 - Strings don't match 1 - Strings match Notes: N/A 5.0 - Change Log - Top of Page Parser v 8.0 Function callback is now passed a void * - AddTokenSeparator() now takes a additonal paramater Restricted GrabToken() from callback from Token with ignore == 1 This would cause a infinate recursion loop. Fixed a possiable buffer overread Token order is now preserved correctly Various Optimizations Bulk of String Manipulation Funtions now use const when possiable Preformance Delta: GNU: ~6% Faster MS : No performance difference Note: A few new warnings have been introduced, and need to be fixed Parser v 7.1 Bug fix relateing to recursion caused by Parser callback function calling GrabToken(). Added Get/SetParserState() Added GenericDiscard() Parser callback, since it is a fairly common function. Parser v 7.0 Moved most of the documentation to this html file A few bug/broke logic fixes Parser v 6.0 Dropped C++ build Added DeSmet C support - strict ANSI C Several bug/broken logic fixes Reduced requested frees and allocs by ~66% Massive Code Cleanup Removed a lot of redundant code Improved internal error handler Added function callbacks Cleaned up documentation Parser v 5.0 Began testing on Linux Several bug/broken logic fixes Massive performance boost to internal file handler (~60% faster!) Parser v 4.0 Implemented binary support Expanded interal File Handler Load Files Dynamically Load Memory Dynamically Improved Internal Error Handler Parser v 3.0 Implemented internal File Handler Parser v 2.0 Added a C++ build Added the bulk of the String Manipulation Functions Parser v 1.0 - Original build with Multiple Token Set Support Token Separators with logic: Return switchto ignore