The .html parser is a generic parsing capability built into iPAM.
It is used by several agents including getAltavista,
getExciteByKeword, getExciteNewstracker, getOnePage, and
getRootPlusReferences (see the list
of agents). The parser is contained in the Java package
org.mitre.pam.getter.search.parse
.
This guide is broken up into two sections, Tokens and Public Methods. The Tokens section lists all of the tokens generated by parser and the methods section list all of the available public methods. If the method takes an integer Tokend as an argument, then it must be any of the tokens defined in the Tokens section, unless otherwise noted.
All of the following are recognized as tokens by the parser. There are different states at which a Token will be matched. The states will be in boldface type.
DEFAULT
- These three tokens are matched when the state of the parser is in the first level.
Token: Value: STAGO "<" - when matched go to State TAG ETAGO "</" - when matched go to State TAG ANY matches any character except the above two tokens. Retain state
TAG
- The type of html tag - token immediately follows STAGO or ETAGO. When any of these tokens is matched the state will go to ATTLIST.
A "a" ADDRESS "address" APPLET "applet" AREA "area" B "b" BASE "base" BASEFONT "basefont" BIG "big" BLOCKQUOTE "blockquote" BODY "body" BR "br" CAPTION "caption" CENTER "center" CITE "cite" CODE "code" DD "dd" DFN "dfn" DIR "dir" DIV "div" DL "dl" DT "dt" EM "em" FONT "font" FORM "form" H1 "h1" H2 "h2" H3 "h3" H4 "h4" H5 "h5" H6 "h6" HEAD "head" HR "hr" HTML "html" I "i" IMG "img" INPUT "input" ISINDEX "isindex" KBD "kbd" LI "li" LINK "link" MAP "map" MENU "menu" META "meta" NOBR "nobr" OL "ol" OPTION "option" P "p" PARAM "param" PRE "pre" PROMPT "prompt" SAMP "samp" SCRIPT "script" SELECT "select" SMALL "small" STRIKE "strike" STRONG "strong" STYLE "style" SUB "sub" SUP "sup" TABLE "table" TD "td" TEXTAREA "textarea" TH "th" TITLE "title" TR "tr" TT "tt" U "u" UL "ul" VAR "var" UNKNOWN matches any word not matched above. This is for unknown tag types.
ATTLIST
- These three tokens are matched after one of the above TAG tokens are matched. They handle the rest of the HTML tag.
TAGC ">" - when matched go to state DEFAULT A_EQ "=" - when matched to to state ATTRVAL A_NAME #ALPHA ( #ALPHANUM )* - matches a word. WHITESPACE matches anything not already matched by the three above tokens. The following are used by the parser to determine token A_NAME. #ALPHA ["a"-"z","A"-"Z","_","-","."] - used #NUM ["0"-"9"] #ALPHANUM #ALPHA | #NUM
ATTRVAL
- This state contains only one token and it is used to match the word that follows the "=" inside an HTML tag. This token is the value of the attribute found before the "=".
CDATA This matches a word and changes to state ATTLIST. The regular expression for the word is as follows: "'" ( ~["'"] )* "'" | "\"" ( ~["\""] )* "\"" | ( ~[">", "\"", "'", " ", "\t", "\n", "\r"] )+
boolean SkipToToken(int tokenKind)Moves the pointer to the first token found with kind = tokenKind. The token variable may be any of the constants defined above. If the token is not found the pointer is at the end of the input stream upon return.boolean SkipToAfterToken(int tokenKind)
Same as SkipToToken except the token specified is consumed.boolean SkipToOpenTag(int tag)
Looks for the html tag specified by tag,and moves the current token pointer to the first token in the html tag("<"). The tag must be one of the tokens defined in the TAG state above (A through UNKNOWN); If the tag is not found the pointer is located at the end of the input stream upon return. An open html tag is something like <b>. SkipToOpenTag(html.B) looks for "<b"boolean SkipToAfterOpenTag(int tag)
Same as SkipToOpenTag above, except that the whole html tag is consumed.boolean SkipToCloseTag(int tag)
Looks for a closing html tag and moves the token pointer to the "</" token if found. If not the token pointer is at the end of the html page. SkipToCloseTag(html.B) looks for "</b".boolean SkipToAfterCloseTag(int tag)
Same as SkipToCloseTag except the entire html tag is consumed.boolean GetUntilToken(int tokenKind)
Returns a String containing everything (including html tags) before the first finding of the token with kind=tokenKind. The pointer points to the token when finished. Null is returned and the pointer is at the end of the input stream if the token is not found.Hashtable ProcessParameters()
If the pointer is in an html tag, it returns a Hashtable containing the parameters and their values. An empty Hashtable is returned if no parameters are found or the pointer is not in an html tag.boolean SkipToEndOfTag()
If the pointer is inside a tag the pointer is moved to the token immediately following the ">" of the tag;String GetUntilNextTag()
If the pointer is not in a tag, all of the text from the current pointer until a "<" is found is returned in a string. Null is returned if not outside a tag.boolean SkipToString(String str)
This method looks at the text outside of html tags searching for the String str. If the string is found, then the pointer is moved to the first token in the string. If the string is not found, the pointer is moved to the end of the inputstream and null is returned.boolean SkipToAfterString(String str)
Same as SkipToString except the string is consumed.boolean SkipToTag(String tag)
This method looks for an full html tag specified by a String, and moves the pointer to the first token("<") of the tag if found. If not found the pointer is moved to the end of the input stream. For example, SkipToTag("<b>") will look for the next <b> tag. It returns true if found, false otherwise.boolean SkipToAfterTag(String tag)
Same as SkipToTag except the tag is consumed if found.String GetUntilString(String str)
Looks for the string specified outside of html tags and returns the all text as a String starting from the first character after the last html tag found.String GetUntilTag(String tag)
Returns all of the text including html tags starting from the current pointer until the html tag specified as a String is found. It returns null if not found.String GetUntilOpenTagOfType(int tagKind)
Returns all the text including html tags until an open html tag (<tag) if the specified type is found. The pointer is moved to the "<" token upon return. If the tag is not found it returns null and the pointer is at the end of the input stream. The int tagKind must be one of the TAG constants defined above.String GetUntilClosedTagOfType(int tagKind)
Similar to GetUntilClosedTagOfType(int tagKind) except it looks for a closed tag (</tag). The pointer is moved to "</" if successful. Again, tagKind must be one of the TAG constants defined above.
Revised 12/1/98