hashgen.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public Licencse as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later option.
The hashgen utility generates the source of a lookup function
for a given static keyset (reserved words of a programming
language, assembler mnemonics, XML-based language elements,
etc.)
Every keyset's entry contains a key (either string or binary)
and arbitrary user-defined atributes associated with that key.
The keys must be unique.
The generated function will find any keyword in constant time using exactly one comparison. A typical lookup may take as little as 100ns on an average desktop machine.
The source code can be generated in C++, ANSI C, or K&R C.
The hashgen utility does basically the same thing and uses the same
basic algorithm as the gperf utility by Douglas C. Schmidt
(see Acknowledgements).
While gperf is a great piece of software, I found it a bit
difficult to use. The major points addressed by hashgen are these:
gperf one has to select the character key positions
by hand, which may be a daunting task for large keysets with
long, highly redundant keys. Just selecting all positions,
may yield an unnecessary complex hash function (and will not necessarily
resolve collisions).
Hashgen will automatically select a minimal set of positions
that is enough to distinguish the keys of a given keyset.
gperf, the hash table of a keyset with two
or more keys that consist of the same characters and differ only in their
order will always have collisions (so gperf cannot generate
a perfect hash function for, say, Bourne shell keywords with its
if/fi and case/esac pairs.)
Hashgen will always generate a perfect hash function
for such keysets (but see Bugs).
With hashgen you don't have to care about the algorithmic
details (although there are options that allow you to finetune the
results).
Gperf's input file format is not very intuitive (especially if you are not familiar with lex/yacc) and is not self-contained.
There are two (interdependent) sources of input information--the
input file and the command-line options.
With hashgen, all options can be specified in the input
file. Its format is comprehensible and one can easily tailor one
of the samples packaged with hashgen even without reading
any documentation (Example Input). Any option still can be
specified on the command-line.
results).
hashgen is generation of the
lookup test source that both tests the generated lookup function
and measures the average lookup time.
Hashgen allways retains the original order of the keys.
So in addition to hashed lookups you can also traverse the keys in a
well defined order, or do binary searches to find the closest
key if the key are sorted. For C++ this functionality can be
easily added in a class derived from the generated class.
Synopsis:
hashgen [options]... [input-file]
options are expected in either standard short or GNU-style long format.
input-file contains the keylist, options, and C declarations. If input-file is not specified, the standard input is read. Only one input file can be handled at a time.
The generated code is written to the standard output or to the files
specified with -o, -c, and -h options.
An option can be specified in either the [options] section
of the input file or as a command-line option. The latter may
be specified in either short or GNU-style long form.
The long form of a command-line option has the same name as
in the [options] section of the input file with two
dashes prepended.
Example. A fragment of an input file like this
[options] size=3 header=keylookup.h ...
is equivalent to the following command line fragment:
./hashgen --size=3 --header=keylookup.h ...
or, in the short form,
./hashgen -N3 -h keylookup.h ...
A Boolean option foo can be turned on and off like this
foo=yes foo=no
or like this
foo=1 foo=0
or just
foo no-foo
All of the above forms can be used on the command-line with
two preceding dashes, e.g. --no-foo.
For a short form of a Boolean option, say i,
-i and -i+ turn i on,
-i- turns it off.
Short Boolean options can be combined; the last option in a cluster
of options can have non-Boolean argument, e.g.
-zi-l C++, will turn z on,
i off, and l will have the argument C++.
The blank between a short non-Boolean command-line option and its
argument is not required.
Options specified in the input file override the options specified
on the command-line. That is, if in the [options] section
of the input file you specified output=foo, but on the
command-line --output=bar, the output will still go to
foo, no warnings will be issued.
This section list all program's options. All Boolean options are
given in their non-default form. E.g., zero-term is on by
default, so it's listed here as ---no-zero-term or -z-.
--output=FILE
-o FILE
output option is specified, the generated
lookup code goes to the specified file.
In order to separate the header and the implementation, use
header and source options instead.
--header=HEADER
--source=SOURCE
-h HEADER
-c SOURCE
-h option,
and the name of the implementation file with -c option.
If you specify, say, only -h, the implementation source
will go to the file specified with -o if it's given, or
to the standard output if it's not.
--language=LANG
-l LANG
--namespace-name=NAME
-n NAME
--lookup-func-name=NAME
-f NAME
struct entry* lookup(const char* key, size_t len);
--entry-struct-name=NAME
-e NAME
--key-name=NAME
-k NAME
--lookup-struct-name=NAME
-s NAME
Even if the selected language is C, all the lookup tables are bundled
into a struct with this name.
--lookup-data-name=NAME
-d NAME
--key-table-name=NAME
--weight-table-name=NAME
--hash-table-name=NAME
--length-table-name=NAME
--collision-table-name=NAME
--test=FILE
-T FILE
If the header option was specified, the generated test source
will include the lookup header (and you will have to link the test program
with the lookup implementation). Otherwise, it will contain a copy
of the lookup source, and the test executable may be compiled
from the test source alone.
--inline-lookup-func
-I
The inline prefix will be #ifdefed, it will expand to
inline for C++, __inline__ for GNU, or,
if INLINE macro is defined (define it in the [declarations]
section), to INLINE.
--static-lookup-data
-t
--no-include-string-h
-i-
strcmp()
(or memcmp()), and NULL (for C.)
If the option is disabled, it becomes you responsibility.
--size=SIZE
-N SIZE
SIZE must be between 0 and 9, the default value is 0.
Affects the size of the generated hash table, the larger is the
value, the larger will be the hash table (this is true
only for large keysets).
Larger hash tables will decrease time necessary to decide that a
key does not belong to the set (there will be more empty slots in the
hash table; when we hit an empty slot there is no need to compare
keys). At the same time, very large hash tables may increase time of
successful searches a negligible amount because of worse locality
(more CPU cache misses.)
--min-expected-char=MINCHARCODE
--max-expected-char=MAXCHARCODE
-m MINCHARCODE
-M MAXCHARCODE
M??CHARCODE is decimal, hexadecimal (0xnn),
or octal (0nnn) character code.
The default values are 0x01 and 0xff respectively.
More likely than not the keys you are going to look up are not random byte arrays but rather the result of some kind of parsing. E.g., a C keyword is going to consist of nothing but lower case letters.
Narrowing the range of expected characters will result in smaller lookup tables. But be warned that if the character range is violated at runtime, the results may be disastrous.
If min-expected-char is 0, the zero-term option
is turned off (a zero terminated string cannot contain zeroes).
--no-zero-term
-z-
hashgen assumes that the input keys are standard
zero terminated C strings and the generated code will use
strcmp() for key comparison.
If the option is disabled, memcmp() will be used instead,
the keys at runtime don't have to be zero terminated and may
contain NUL characters.
The option is automattically disabled when a NUL character is
encountered in one of the input keys, or the min-expected-char
option is set to 0.
--no-length-switch
-S-
If you know that the target compiler cannot do that, disable the option,
in this case if/goto pairs will be generated instead.
--optimize-hits
-H
hashgen will try to minimize the time
of both successful searches and failures. When the option
is enabled, successful searches may become a couple of cycles
faster at the cost of failures, which will become a bit
slower.
This option affects only the generated lookup code, not the algorithm (the generated hash table will be the same.)
You may want to enable this option if you know that the lookup function
is going to be called much more often for valid keys (in this case it
also makes sense to try to generate a smaller hash table).
--random
-r
hashgen execution time will decrease.
--randimize
-R
--random option.
--debug
--D
--version
--help
The input file format is similar to that of MS Windows (R) .ini files.
An input file is divided into sections. Every section starts with a section
header line (e.g. [options]) and
terminates at the beginning of the next section or end of file.
There are three kinds of sections: [options],
[declarations], and [entries].
The order of sections does not matter.
Empty lines, and lines starting with "#" are ignored (except for
the [definitions] section, which contains C/C++ declarations, and
is inserted into the generated source verbatim.)
All sections except [entries] are optional. The [entries]
section header may be omitted if this section is first in the file.
Thus, a minimal valid input file contains just a list of keywords,
each starting on a new line.
[options] Section
[declarations] Section
[entries] Section
[options] SectionThe [options] section contains all the program's options.
This section is optional, and if it's omitted, all the options
will get their default values, or values specified on the command
line.
Each option starts on a new line, and has name=value format.
Empty lines and lines starting with the hash sign ("#") are
ignored.
For detailed description of the program's options see the Invoking Chapter (Invoking.)
[declarations] SectionThe [declaration] section contains arbitrary C/C++ declarations
that are copied verbatim into the generated code.
Within the [declarations] section, lines starting with "#"
and empty lines are not ignored. If you need to comment
something, use C/C++ comments instead. No line may start with the
"[" character.
One thing that must be present in the [definitions]
section is declaration of the structure of the key table entry.
It must be a struct with the same name as specified by the
entry-struct-name option (the default value is entry).
The first field of the struct must be name (or the name
specified with the key-name option), and be of type
char* (const char*) or char[].
Instead of defining the entry structure in the input file, you can do it
in some other file and just #include it in the [declarations]
section.
If the [declarations] section is empty, the following entry
declaration will be generated:
struct entry {
char* name;
};
(the names entry and name may be different, as specified
by entry-struct-name and key-name options.)
If the [declarations] section is not empty, the entry
structure will not be defined automatically.
Example:
[options]
entry-struct-name = TokenEntry
[declarations]
struct TokenEntry {
const char* name;
int token;
int flags;
};
[entries]
"for", 1, 0
"while", 2, 0
"do", 3, 0
Also see the Example Chapter (Example).
[entries] SectionEach line of the [entries] section contains a key and
attributes associated with that key separated with commas. The
first field is the key itself, the attributes are optional. Empty
lines and lines starting with "#" are ignored.
The syntax of each line must be suitable for C structure initializer
(without the enclosing braces). It will be used to initialize an
entry in the key table, the structure of the entry is defined in
the [declarations] section (see [declarations]).
The [entries] section is assumed by default when the parsing begins,
so a minimal valid input file may contain just a list of keywords,
each starting on a new line.
Example:
[declarations]
struct entry {
char* name;
int id;
};
[entries]
"foo", 1
"bar", 2
"baz", 3
Within the key field C backslash escapes are allowed. If the key
does not contain double quotes, commas, spaces, backslash escapes,
and does not start with "[", the surrounding double quotes may
be omitted:
[entries] foo, 1 bar, 2 baz, 3
The attributes (after the first comma and up to the following newline character) are not parsed at all, just included in the generated initializer as is. It's your responsibility to ensure their syntax is correct.
For instance, for entries like these
key1, 1 /* one */, 2, something key2 key3, 3 ...
the following initializers will be generated
{ "key1", 1 /* one */, 2, something },
{ "key2" },
{ "key3", 3 },
...
The following sections show an example of the input file and lookup sources generated from it.
# Hashgen input file example:
# Lookup class for C++ reserved words.
# The [entries] section: lists keys and associated data.
# The [entries] section header is implied by default.
# Double quotes around the key values are omitted.
asm, TOKEN_ASM
auto, TOKEN_AUTO
break, TOKEN_BREAK
case, TOKEN_CASE
catch, TOKEN_CATCH
char, TOKEN_CHAR
class, TOKEN_CLASS
const, TOKEN_CONST
continue, TOKEN_CONTINUE
default, TOKEN_DEFAULT
delete, TOKEN_DELETE
do, TOKEN_DO
double, TOKEN_DOUBLE
else, TOKEN_ELSE
enum, TOKEN_ENUM
extern, TOKEN_EXTERN
float, TOKEN_FLOAT
for, TOKEN_FOR
friend, TOKEN_FRIEND
goto, TOKEN_GOTO
if, TOKEN_IF
inline, TOKEN_INLINE
int, TOKEN_INT
long, TOKEN_LONG
new, TOKEN_NEW
operator, TOKEN_OPERATOR
overload, TOKEN_OVERLOAD
private, TOKEN_PRIVATE
protected, TOKEN_PROTECTED
public, TOKEN_PUBLIC
register, TOKEN_REGISTER
return, TOKEN_RETURN
short, TOKEN_SHORT
signed, TOKEN_SIGNED
sizeof, TOKEN_SIZEOF
static, TOKEN_STATIC
struct, TOKEN_STRUCT
switch, TOKEN_SWITCH
template, TOKEN_TEMPLATE
this, TOKEN_THIS
typedef, TOKEN_TYPEDEF
union, TOKEN_UNION
unsigned, TOKEN_UNSIGNED
virtual, TOKEN_VIRTUAL
void, TOKEN_VOID
volatile, TOKEN_VOLATILE
while, TOKEN_WHILE
[options]
# All the options set explicitly for demonstration purposes
# (for most the default values would be okay.)
language = C++
header = cxx_keywords.h
source = cxx_keywords.cxx
test = cxx_keywords_test.cxx
namespace-name = Scanner
entry-struct-name = Token
key-name = text
lookup-struct-name = TokenHash
lookup-data-name = tokenHash
lookup-func-name = lookup
key-table-name = tokens
weight-table-name = charWeights
length-table-name = keyLengths
hash-table-name = hashTable
collision-table-name = hashCollisions
optimize-hits = yes
length-switch = yes
#inline-lookup-func = yes
static-lookup-data = no
min-expected-char = 0x30
max-expected-char = 0x7a
# The [declarations] sections: defines the structure
# of a key table entry (`Token' in this example).
# Also may contain any additional declarations,
# everything is copied verbatim into the generated header.
[declarations]
struct Token {
const char* text;
int token;
};
enum {
TOKEN_ASM,
TOKEN_AUTO,
TOKEN_BREAK,
TOKEN_CASE,
TOKEN_CATCH,
TOKEN_CHAR,
TOKEN_CLASS,
TOKEN_CONST,
TOKEN_CONTINUE,
TOKEN_DEFAULT,
TOKEN_DELETE,
TOKEN_DO,
TOKEN_DOUBLE,
TOKEN_ELSE,
TOKEN_ENUM,
TOKEN_EXTERN,
TOKEN_FLOAT,
TOKEN_FOR,
TOKEN_FRIEND,
TOKEN_GOTO,
TOKEN_IF,
TOKEN_INLINE,
TOKEN_INT,
TOKEN_LONG,
TOKEN_NEW,
TOKEN_OPERATOR,
TOKEN_OVERLOAD,
TOKEN_PRIVATE,
TOKEN_PROTECTED,
TOKEN_PUBLIC,
TOKEN_REGISTER,
TOKEN_RETURN,
TOKEN_SHORT,
TOKEN_SIGNED,
TOKEN_SIZEOF,
TOKEN_STATIC,
TOKEN_STRUCT,
TOKEN_SWITCH,
TOKEN_TEMPLATE,
TOKEN_THIS,
TOKEN_TYPEDEF,
TOKEN_UNION,
TOKEN_UNSIGNED,
TOKEN_VIRTUAL,
TOKEN_VOID,
TOKEN_VOLATILE,
TOKEN_WHILE
};
// Generated by hashgen 1.0b (Dec 8 2002)
// language: C++
// 47 keys, hash table size 68 (145%), 0 collision(s)
// expected char range: 0x30-0x7a
// Mon Dec 9 14:50:21 2002
#ifndef TokenHash_INCLUDED
#define TokenHash_INCLUDED
#include <cstring>
namespace Scanner {
struct Token {
const char* text;
int token;
};
enum {
TOKEN_ASM,
TOKEN_AUTO,
TOKEN_BREAK,
TOKEN_CASE,
TOKEN_CATCH,
TOKEN_CHAR,
TOKEN_CLASS,
TOKEN_CONST,
TOKEN_CONTINUE,
TOKEN_DEFAULT,
TOKEN_DELETE,
TOKEN_DO,
TOKEN_DOUBLE,
TOKEN_ELSE,
TOKEN_ENUM,
TOKEN_EXTERN,
TOKEN_FLOAT,
TOKEN_FOR,
TOKEN_FRIEND,
TOKEN_GOTO,
TOKEN_IF,
TOKEN_INLINE,
TOKEN_INT,
TOKEN_LONG,
TOKEN_NEW,
TOKEN_OPERATOR,
TOKEN_OVERLOAD,
TOKEN_PRIVATE,
TOKEN_PROTECTED,
TOKEN_PUBLIC,
TOKEN_REGISTER,
TOKEN_RETURN,
TOKEN_SHORT,
TOKEN_SIGNED,
TOKEN_SIZEOF,
TOKEN_STATIC,
TOKEN_STRUCT,
TOKEN_SWITCH,
TOKEN_TEMPLATE,
TOKEN_THIS,
TOKEN_TYPEDEF,
TOKEN_UNION,
TOKEN_UNSIGNED,
TOKEN_VIRTUAL,
TOKEN_VOID,
TOKEN_VOLATILE,
TOKEN_WHILE
};
class TokenHash {
public:
static size_t size() {
return 47;
}
static const Token* at(size_t index) {
return tokens + index;
}
static const Token* lookup(const char* key) {
return lookup(key, std::strlen(key));
}
static Token* lookup(const char* key, size_t len);
protected:
static unsigned char hashTable[68];
static unsigned char charWeights[75];
static Token tokens[47];
};
extern TokenHash tokenHash;
}
#endif
#include "cxx_keywords.h"
namespace Scanner {
TokenHash tokenHash;
unsigned char TokenHash::hashTable[68] = {
// Hash table. Each element either points to a key table entry,
// or is null (47) if there is no key with this hash value.
/* 0 */ 47, 47, 47, 0, 47, 47, 35, 40, 1, 47,
/* 10 */ 36, 3, 7, 13, 34, 8, 31, 38, 19, 4,
/* 20 */ 15, 6, 30, 25, 28, 11, 17, 23, 29, 10,
/* 30 */ 9, 39, 37, 20, 24, 18, 14, 16, 32, 2,
/* 40 */ 41, 45, 22, 47, 5, 33, 47, 26, 43, 47,
/* 50 */ 47, 42, 47, 21, 47, 27, 47, 12, 47, 47,
/* 60 */ 44, 47, 47, 47, 47, 47, 47, 46
};
unsigned char TokenHash::charWeights[75] = {
// Weight table (one character hash function).
/* 40 */ 255, 255,
/* 50 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
/* 60 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
/* 70 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
/* 80 */ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
/* 90 */ 255, 255, 255, 255, 255, 255, 255, 0, 28, 7,
/* 100 */ 23, 0, 23, 14, 27, 8, 255, 255, 9, 1,
/* 110 */ 31, 0, 9, 255, 6, 0, 0, 4, 33, 26,
/* 120 */ 14, 0, 255
};
Token TokenHash::tokens[47] = {
// Key table. Holds keys and associated data (order preserved.)
/* 0 */ { "asm", TOKEN_ASM },
/* 1 */ { "auto", TOKEN_AUTO },
/* 2 */ { "break", TOKEN_BREAK },
/* 3 */ { "case", TOKEN_CASE },
/* 4 */ { "catch", TOKEN_CATCH },
/* 5 */ { "char", TOKEN_CHAR },
/* 6 */ { "class", TOKEN_CLASS },
/* 7 */ { "const", TOKEN_CONST },
/* 8 */ { "continue", TOKEN_CONTINUE },
/* 9 */ { "default", TOKEN_DEFAULT },
/* 10 */ { "delete", TOKEN_DELETE },
/* 11 */ { "do", TOKEN_DO },
/* 12 */ { "double", TOKEN_DOUBLE },
/* 13 */ { "else", TOKEN_ELSE },
/* 14 */ { "enum", TOKEN_ENUM },
/* 15 */ { "extern", TOKEN_EXTERN },
/* 16 */ { "float", TOKEN_FLOAT },
/* 17 */ { "for", TOKEN_FOR },
/* 18 */ { "friend", TOKEN_FRIEND },
/* 19 */ { "goto", TOKEN_GOTO },
/* 20 */ { "if", TOKEN_IF },
/* 21 */ { "inline", TOKEN_INLINE },
/* 22 */ { "int", TOKEN_INT },
/* 23 */ { "long", TOKEN_LONG },
/* 24 */ { "new", TOKEN_NEW },
/* 25 */ { "operator", TOKEN_OPERATOR },
/* 26 */ { "overload", TOKEN_OVERLOAD },
/* 27 */ { "private", TOKEN_PRIVATE },
/* 28 */ { "protected", TOKEN_PROTECTED },
/* 29 */ { "public", TOKEN_PUBLIC },
/* 30 */ { "register", TOKEN_REGISTER },
/* 31 */ { "return", TOKEN_RETURN },
/* 32 */ { "short", TOKEN_SHORT },
/* 33 */ { "signed", TOKEN_SIGNED },
/* 34 */ { "sizeof", TOKEN_SIZEOF },
/* 35 */ { "static", TOKEN_STATIC },
/* 36 */ { "struct", TOKEN_STRUCT },
/* 37 */ { "switch", TOKEN_SWITCH },
/* 38 */ { "template", TOKEN_TEMPLATE },
/* 39 */ { "this", TOKEN_THIS },
/* 40 */ { "typedef", TOKEN_TYPEDEF },
/* 41 */ { "union", TOKEN_UNION },
/* 42 */ { "unsigned", TOKEN_UNSIGNED },
/* 43 */ { "virtual", TOKEN_VIRTUAL },
/* 44 */ { "void", TOKEN_VOID },
/* 45 */ { "volatile", TOKEN_VOLATILE },
/* 46 */ { "while", TOKEN_WHILE }
};
// Key characters must be in 0x30-0x7a range.
Token* TokenHash::lookup(const char* key, size_t len) {
using namespace std;
register struct Token* p;
register unsigned h = len;
switch (h) {
default: return 0;
case 9:
case 8:
case 7:
case 6:
case 5:
case 4:
h += tokenHash.charWeights[(unsigned char)key[3]-48];
case 3:
case 2:
h += tokenHash.charWeights[(unsigned char)key[1]-48];
case 1:
h += tokenHash.charWeights[(unsigned char)key[0]-48];
case 0: ;
}
if (h > 67)
return 0;
h = tokenHash.hashTable[h];
if (h == 47)
return 0;
p = &tokenHash.tokens[h];
if (*key == *p->text && !strcmp(key, p->text))
return p;
return 0;
}
// Please notice that in expression `a[x-c]', where `a' is a static
// array and `c' is a constant, `a-c' is a constant sub-expression,
// so the offset should not incur additional runtime overhead.
}
// Test for hashed lookup code generated by hashgen 1.0b.
// Measures the average time of both successful and unsuccessful lookups.
#include "cxx_keywords.h"
#include <ctime>
#include <iostream>
#include <iomanip>
using namespace std;
using namespace Scanner;
#ifndef CLK_TCK
#define CLK_TCK CLOCKS_PER_SEC
#endif
int main() {
clock_t tick;
double time;
unsigned count;
unsigned i;
// Hit test. If/goto overhead is negligible, so it's okay to test for
// both correctness and speed at the same time. There is no sense in
// checking the key pointed to by the lookup function's return value
// after a successful search, since the only place where a non-null
// pointer is returned is right after successful strcmp() or memcmp().
count = 0;
tick = clock();
do {
for (i = 0; i < 21276; i++) {
if (TokenHash::lookup("asm", 3) == 0) goto err_miss;
if (TokenHash::lookup("auto", 4) == 0) goto err_miss;
if (TokenHash::lookup("break", 5) == 0) goto err_miss;
if (TokenHash::lookup("case", 4) == 0) goto err_miss;
if (TokenHash::lookup("catch", 5) == 0) goto err_miss;
if (TokenHash::lookup("char", 4) == 0) goto err_miss;
if (TokenHash::lookup("class", 5) == 0) goto err_miss;
if (TokenHash::lookup("const", 5) == 0) goto err_miss;
if (TokenHash::lookup("continue", 8) == 0) goto err_miss;
if (TokenHash::lookup("default", 7) == 0) goto err_miss;
if (TokenHash::lookup("delete", 6) == 0) goto err_miss;
if (TokenHash::lookup("do", 2) == 0) goto err_miss;
if (TokenHash::lookup("double", 6) == 0) goto err_miss;
if (TokenHash::lookup("else", 4) == 0) goto err_miss;
if (TokenHash::lookup("enum", 4) == 0) goto err_miss;
if (TokenHash::lookup("extern", 6) == 0) goto err_miss;
if (TokenHash::lookup("float", 5) == 0) goto err_miss;
if (TokenHash::lookup("for", 3) == 0) goto err_miss;
if (TokenHash::lookup("friend", 6) == 0) goto err_miss;
if (TokenHash::lookup("goto", 4) == 0) goto err_miss;
if (TokenHash::lookup("if", 2) == 0) goto err_miss;
if (TokenHash::lookup("inline", 6) == 0) goto err_miss;
if (TokenHash::lookup("int", 3) == 0) goto err_miss;
if (TokenHash::lookup("long", 4) == 0) goto err_miss;
if (TokenHash::lookup("new", 3) == 0) goto err_miss;
if (TokenHash::lookup("operator", 8) == 0) goto err_miss;
if (TokenHash::lookup("overload", 8) == 0) goto err_miss;
if (TokenHash::lookup("private", 7) == 0) goto err_miss;
if (TokenHash::lookup("protected", 9) == 0) goto err_miss;
if (TokenHash::lookup("public", 6) == 0) goto err_miss;
if (TokenHash::lookup("register", 8) == 0) goto err_miss;
if (TokenHash::lookup("return", 6) == 0) goto err_miss;
if (TokenHash::lookup("short", 5) == 0) goto err_miss;
if (TokenHash::lookup("signed", 6) == 0) goto err_miss;
if (TokenHash::lookup("sizeof", 6) == 0) goto err_miss;
if (TokenHash::lookup("static", 6) == 0) goto err_miss;
if (TokenHash::lookup("struct", 6) == 0) goto err_miss;
if (TokenHash::lookup("switch", 6) == 0) goto err_miss;
if (TokenHash::lookup("template", 8) == 0) goto err_miss;
if (TokenHash::lookup("this", 4) == 0) goto err_miss;
if (TokenHash::lookup("typedef", 7) == 0) goto err_miss;
if (TokenHash::lookup("union", 5) == 0) goto err_miss;
if (TokenHash::lookup("unsigned", 8) == 0) goto err_miss;
if (TokenHash::lookup("virtual", 7) == 0) goto err_miss;
if (TokenHash::lookup("void", 4) == 0) goto err_miss;
if (TokenHash::lookup("volatile", 8) == 0) goto err_miss;
if (TokenHash::lookup("while", 5) == 0) goto err_miss;
}
count++;
time = (clock() - tick) / (double)CLK_TCK;
} while ((time < 0.5) && (count < 1000000));
cout
<< "successful search time "
<< setw(6)
<< int(time * 1e9 / 21276 / 47 / count)
<< "ns (" << setw(0)
<< int(999972 * count / time)
<< " searches/sec)\n";
// Miss test. Characters of the test keys are random within range
// from min-expected-char to max-expected-char, their length from 0
// to a bit more than maximum length of the existing keys.
// The results should not be taken too seriously, the keys are too
// random, so strcmp() or memcmp() is not called most of the time
// because the calculated hash values are out of range.
count = 0;
tick = clock();
do {
for (i = 0; i < 21276; i++) {
if (TokenHash::lookup("<T", 2)) goto err_hit;
if (TokenHash::lookup("hPHv]", 5)) goto err_hit;
if (TokenHash::lookup("UlkW", 4)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup("h16QrP@", 7)) goto err_hit;
if (TokenHash::lookup("=Wk", 3)) goto err_hit;
if (TokenHash::lookup("`JN4?", 5)) goto err_hit;
if (TokenHash::lookup("0", 1)) goto err_hit;
if (TokenHash::lookup("d1[GsilcF5", 10)) goto err_hit;
if (TokenHash::lookup("3jx>", 4)) goto err_hit;
if (TokenHash::lookup("77EDG5i", 7)) goto err_hit;
if (TokenHash::lookup("O<1GP", 5)) goto err_hit;
if (TokenHash::lookup("w9P@9H", 6)) goto err_hit;
if (TokenHash::lookup("v0Edr", 5)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup("??wFTtGZb", 9)) goto err_hit;
if (TokenHash::lookup("bWycwd_jmp", 10)) goto err_hit;
if (TokenHash::lookup(":", 1)) goto err_hit;
if (TokenHash::lookup("@:mteniZg", 9)) goto err_hit;
if (TokenHash::lookup("?fRm2e=", 7)) goto err_hit;
if (TokenHash::lookup("6", 1)) goto err_hit;
if (TokenHash::lookup("Ei", 2)) goto err_hit;
if (TokenHash::lookup("b6mT_8G<H", 9)) goto err_hit;
if (TokenHash::lookup("yB", 2)) goto err_hit;
if (TokenHash::lookup("Vd", 2)) goto err_hit;
if (TokenHash::lookup("BbGxnm0Xdz", 10)) goto err_hit;
if (TokenHash::lookup("7x5n_pJl", 8)) goto err_hit;
if (TokenHash::lookup("R8JjBHe7", 8)) goto err_hit;
if (TokenHash::lookup("O?", 2)) goto err_hit;
if (TokenHash::lookup("6?gt", 4)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup(";O8R", 4)) goto err_hit;
if (TokenHash::lookup("6X2e6L@ZW", 9)) goto err_hit;
if (TokenHash::lookup("tGDve", 5)) goto err_hit;
if (TokenHash::lookup("R9CutRJ", 7)) goto err_hit;
if (TokenHash::lookup("=SaE[9TaJ@", 10)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup("EEcmFFmCu", 9)) goto err_hit;
if (TokenHash::lookup("", 0)) goto err_hit;
if (TokenHash::lookup("5", 1)) goto err_hit;
if (TokenHash::lookup("Z", 1)) goto err_hit;
if (TokenHash::lookup("CfI", 3)) goto err_hit;
if (TokenHash::lookup("\\lP", 3)) goto err_hit;
if (TokenHash::lookup("MYKg\\Dl1Z6", 10)) goto err_hit;
if (TokenHash::lookup("L6crI]N", 7)) goto err_hit;
if (TokenHash::lookup("LQtKd", 5)) goto err_hit;
}
count++;
time = (clock() - tick) / (double)CLK_TCK;
} while ((time < 0.5) && (count < 1000000));
cout
<< "search failure time "
<< setw(6)
<< int(time * 1e9 / 21276 / 47 / count)
<< "ns (" << setw(0)
<< int(999972 * count / time)
<< " searches/sec)\n";
return 0;
err_hit:
cerr << "cxx_keywords.cxx: non-existing key found\n";
exit(1);
err_miss:
cerr << "cxx_keywords.cxx: key not found\n";
exit(1);
}
Bellow is a frament of the assembly code generated by GCC 2.95.3
(i386, FreeBSD 4.4) for the following line of the test source:
if (TokenHash::lookup("catch", 5) == 0) goto err_miss;
The lookup function is inlined and called with constant arguments.
GCC did not generate the code to check length of the key at all since
it's known at compile time. .LC2 contains "catch", it's
long enough to include all 3 hash positions, for shorter strings
the compliler generated access only to included positions (lines
marked with "*").
movzbl .LC2+3,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%edx * movzbl .LC2+1,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax * leal 5(%eax,%edx),%edx movzbl .LC2,%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax * addl %eax,%edx cmpl $67,%edx ja .L540 movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx cmpl $47,%edx je .L540 leal 0(,%edx,8),%eax leal _Q27Scanner9TokenHash$tokens(%eax),%ebx movl _Q27Scanner9TokenHash$tokens(%eax),%edx movb (%edx),%al cmpb %al,.LC2 jne .L577 addl $-8,%esp pushl %edx pushl $.LC2 call strcmp movl %eax,%edx addl $16,%esp movl %ebx,%eax testl %edx,%edx je .L567 .L577: xorl %eax,%eax .L567: testl %eax,%eax je .L540
If the lookup function is not inline and the invocation context is
not known, a jump table is generated for the switch on the key's length
(jmp *.L21(,%edx,4)):
lookup__Q27Scanner9TokenHashPCcUi: pushl %ebp movl %esp,%ebp subl $20,%esp pushl %ebx movl 8(%ebp),%ecx movl 12(%ebp),%edx cmpl $9,%edx ja .L24 jmp *.L21(,%edx,4) .p2align 2,0x90 .section .rodata .p2align 2 .p2align 2 .L21: .long .L9 .long .L19 .long .L18 .long .L18 .long .L16 .long .L16 .long .L16 .long .L16 .long .L16 .long .L16 .text .p2align 2,0x90 .L16: movzbl 3(%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L18: movzbl 1(%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L19: movzbl (%ecx),%eax movzbl _Q27Scanner9TokenHash$charWeights-48(%eax),%eax addl %eax,%edx .L9: cmpl $67,%edx ja .L24 movzbl _Q27Scanner9TokenHash$hashTable(%edx),%edx cmpl $47,%edx je .L24 leal 0(,%edx,8),%eax leal _Q27Scanner9TokenHash$tokens(%eax),%ebx movl _Q27Scanner9TokenHash$tokens(%eax),%edx movb (%edx),%al cmpb %al,(%ecx) jne .L24 addl $-8,%esp pushl %edx pushl %ecx call strcmp testl %eax,%eax jne .L24 movl %ebx,%eax jmp .L26 .p2align 2,0x90 .L24: xorl %eax,%eax .L26: movl -24(%ebp),%ebx leave ret
hashgenThe keys may be of any reasonable length, but only the first 32 characters are used for hashing, they must be unique, otherwise the generated hash function will not be perfect.
hashgen consumes lots of RAM when the keyset is large (~20Mb
for 3,000 keys, for larger keysets it switches to a slower and less
memory-hungry mode (~4Mb for 10,000 keys)).
This estimation is very inaccurate, the actual figures depend on the
average length of a key and its assotiated data, character code range,
etc.
There are many misspellings, misplaced definite articles, and other mistakes and inaccuracies in this manual. I would be grateful to anyone willing to help me to clean this mess up.
The hashgen utility was written by Vladimir Shiryaev.
The program does basically the same thing and uses the same
basic algorithm (with a few improvements) as the gperf
utility by Douglas C. Schmidt.
The differencies in functionality are briefly outlined in
the Overview Chapter (Overview).
As Doug says in the gperf manual, the general idea for the
perfect hash function generator was inspired by Keith Bostic's
algorithm, created at the University of California, Irvine.