The glrParser Quick Start Guide

The glrParser Quick Start Guide Ondrej Macek

This document alike all other parts of The glrParser Project can be distributed and/or modified under the terms of GNU General Public License version 2 or greater.

This guide is intended for programmers who start use The glrParser Template Library. It explains what is the library made for and then it will guide you step by step through the usage of the library starting with the concept fellowed by the analysis of two samples. Introduction The goal of The glrParser Project is to create generaly usable programmers tool for syntactical analysis of wide ambiguous grammars which works with the GLR(0) algorithm. GLR is well known algorithm published by Marasu Tomita in 1985. It is based on generalization of the LR analysis. In the documantation I suppose knowledge of the GLR algorithm and I use the terminology from Tomita's book Effecient Parsing For Natural Language. Particularly I use the term "vertex" for vertices of the "graph structored stack" and the term "node" for nodes of the "packed shared forest". glrParser defines nor actions to make while making reductions neither method of reading terminals from input. glrParser defines format of files with grammar, but you can simply make the parser to use different format (see glrParser and glrGrammar in the reference guide). Parser is oriented for creating the packed shared forest, but it is not the condition. Concept The analyser is realised in form of template library written in C++. In its center is the glrParser template and the glrNode class. glrNode is the base class for classes which should be used to represent nodes of generated packed shared forest. Before you begin you should know that the parser internally doesn't use text representation of symbols that appears in the grammar but it uses the type glrSymbol defined in the glrSymbolTable.h header. To convert values of type glrSymbol to C++ strings and backwards serves the object symbols of type glrSymbolTable declared in the glrParser template as protected. Usage of glrParser usually consists of fellowing steps: Definition of some class derived from the glrNode class. This class will represent the node of generated packed shared forest and should be used as the first argument to the glrParser template. It should also have some mantadory constructors which will be used by the parser when it makes reduction which leads to creating new node of the pa. sh. forest. You will also almoust override the addSubtree declared in the glrNode class as virtual. It is called when the parser makes reduction which leads to adding alternative subtree to existing node of the forest. Definition of the class derived from the class glrParser class glrNodeType where as the argument should be specified the class created in the first step. Definition of the class derived from the glrNode class to represent terminals in the packed shared forest. Overriding of the readToken function which is called when the parser want to read next terminal from input. This implies necessity of creating of some mechanism for reading terminals and associating them with appropriate preterminals which appears in grammar. The basic parser This chapter analyses the simpliest parser which can be created using the glrParser library. The presented sample covers just second and fourth step according to the concept above. It uses straightly the class glrNode to represent both -- terminal and nonterminal nodes. Therefore no packed shared forest is created by the basic parser. The parser can be used merely to detect wether the input is syntactically corect according to assigned grammar. The input is syntacticaly correct iff the glrParser::getForestRoot function returns non-NULL after parsing finishes. The source code of the sample looks thus: #include #include #include #include using namespace std; ]]> The common begin is fellowed by declaration of the basicParser class. The associative container terminals is used discover which preterminal symbols of the grammar are matched by the terminal on input. Note that the parser can handle situations when many (more tn one) preterminals are matched by one terminal on input. { private: map > terminals; istream *input; protected: virtual void readToken(); public: void readTerminals(istream &input); void setInput(istream *inputA); }; ]]> Function readTerminals reads the set of terminals which will be recognized by the parser. It stores them to the terminals associative container. To every terminal is associated the vector of appropriate symbols (preterminals). The symbols field is used to convert preterminals represented by strings to the glrSymbol type velues. The input sould be organized into the pairs where the terminal is on the first place and appropriate preterminal is on the second place. All should be delimited by white space. > terminal >> preterminal){ terminals[terminal].push_back(symbols.getSymbolFromString(preterminal)); } } ]]> Function setInput just assignes the input to parsing. Function readToken is called by the parser (during parsing) when next terminal should be read from input. It creates new object of type glrNode for each symbol which is matched by the terminal read from input. It places the pointers to newly created objects to the preterminals container, which is declared as protected in the glrParser template. If the preterminals container is empty after return from this function, the parser assumes that it is on the end of input. > terminal){ map >::iterator preter=terminals.find(terminal); if(preter!=terminals.end()){ for(vector::iterator t=preter->second.begin(); t!=preter->second.end();++t) preterminals.push_back(new glrNode(*t)); } } } ]]> To keep the basic parser as simple as it can be, the main function doesn't parse command line arguments but take the first argument as the name of the file with grammar, the second argument as the file with terminals and the third argument as the file with text to parse. The fourth argument is number which specifies how many times the input should be parsed. This I made for profiling and debugging purposes. > times; } struct timeval begin,end; gettimeofday(&begin,NULL); for(int i=0;i getForestRoot() should not be null if the * reduction was made along the rule number 0 upon whole input. */ if(parser.getForestRoot()) cout << "input is syntacticaly correct according to this grammar" << endl; else cout << "no derivation tree found for input" << endl; /* * Dealocating of the grammar. */ parser.releaseGrammar(); } ]]> More complex parser The parser in this sample uses the glrParser library to create packed shared forest, from which it will be possible to elict number of trees stored as well as print the trees. The declarations of the complex parser #include #include #include #include #include #include #include #include ]]> Number of nodes of packed shared forest will be counted for debugging reasons. Class stringForest is used to generate, store and print text representation of the forest. Note that this representation of forest is nor packed neither shared; therefore its size can grow exponentially with regard to number of tokens of parsed sentence. Operator *= replaces the original forest with the carthesian product of the original forest and the forest specified by the forest argument. The forest in the forest argument is dealocated. Operator += with the stringForest* type argument appends the forest in the forest argument to the end of the original forest. The forest in the forest argument is dealocated. Operator += with the string type argument appends the string in suffix to the end of all stored strings (trees). { public: stringForest &operator *= (stringForest *forest); stringForest &operator += (stringForest *forest); stringForest &operator += (const string &suffix); void print(ostream &output,int numberOfTokens); }; ]]> Class commonNode is the base class for the classes that represent terminals and nonterminals in the packed shared forest. It declares virtual functions that will be common to thous classes. Constructors and destructor are used to count number of nodes in the packed shared forest. Constructors also call appropriate constructors of the base class (this behaviour is mantadory). Function getNumberOfSubtrees serves for counting the number of subtrees of the node. The behaviour of returning 1 is inherited by the class which represents terminals nodes. Function getForest generates the text representation of subtrees of the node. &succA) : glrNode(ruleA,succA) { ++numNodes; } virtual ~commonNode() { --numNodes; } virtual unsigned long long getNumberOfSubtrees() { return 1; } virtual stringForest *getForest(const glrSymbolTable &symbols) { return (stringForest*)NULL; } }; ]]> Class nunterminalNode is used to represent the nonterminal nodes of the packed shared forest. It will be passed as the first argument to the glrParser template. The succ field is used to store subtrees of the node. The numberOfSubtrees field is used to store number of subtrees of the node. The passed flag is used to indicate that the node was already walked through by the getNumberOfSubtrees and the getForest recursive functions. The constructors are mantadory by the glrParser template as well as the behaviour of calling appripriate constructor of the base class. The addSubtree is overrided function which is declared in the glrNode class. It is called by the parser when new subtree should by added to the node. The getForest function returns the text representation of subtrees of the node. > succ; unsigned long long numberOfSubtrees; bool passed; public: nonterminalNode(const glrRule* const &ruleA); nonterminalNode(const glrRule* const &ruleA,const deque &succA); virtual void addSubtree(const glrRule* const &ruleA,const deque &succA); virtual ~nonterminalNode(); virtual unsigned long long getNumberOfSubtrees() ; virtual stringForest *getForest(const glrSymbolTable &symbols) ; }; ]]> Class terminalNode is used to represent terminal nodes of the packed shared forest. The terminal field is appropriate terminal. The constructor is called by the complexParser::readTerminals function. The destructor does nothing at this time. Function getForest returns the text representation of the terminal (and appropriate preterminal). Class complexParser is the whole parser. The mechanizm of storing the vocabulary of terminals is the same as in the basic parser. The readToken is almost the same as in the basic parser. (It is called by the parser when new terminal have to be read.) Functions getNumberOfTrees and printForest just calls appropriate functions of the root node of generated packed shared forest (if such node exists). { private: map > terminals; istream *input; protected: virtual void readToken(); public: void readTerminals(istream &input); void setInput(istream *inputA); unsigned long long getNumberOfTrees(); void printForest(ostream &output); }; ]]> The implementations of the complex parser Implementation of operators and print function of the stringForest class. begin(); i++; int len=size(); while(i!=forest->end()){ for(int j=0;jbegin()); delete forest; } stringForest &stringForest::operator += (stringForest *forest){ insert(end(),forest->begin(),forest->end()); delete forest; } stringForest &stringForest::operator += (const string &suffix){ for(stringForest::iterator i=begin();i!=end();++i)(*i)+=suffix; } void stringForest::print(ostream &output,int numberOfTokens){ for(stringForest::iterator i=begin();i!=end(); output << " 0 " << numberOfTokens << " " << (*(i++)) << endl); } ]]> Implementation of the nonterminalNode constructors. Their type is mantadory as well as the behaviour of calling of appropriate constructor of the base class. The first constructor is called when reduction along the epsilon rule is made. The second constructor is called when reduction along the nonepsilon rule is made. It just adds new subtree to the node. &succA) : commonNode(ruleA,succA) { passed=false; addSubtree(ruleA,succA); } ]]> Destructor of the nonterminalNode class. It calls the release function to all stored nodes of the forest to work the garbagge collector implemented by the glrGuard class (which is the base class of the glrNode class). It decreases stored number of pointers to the object and possibly calls its destructor. The condition is there because of some situations which can confuse the dealocation mechanism. This may come up just when using some type of cyclical grammars. >::iterator iSucc=succ.begin();iSucc!=succ.end();++iSucc) for(deque::iterator iNode=iSucc->begin();iNode!=iSucc->end();++iNode) if(*iNode!=this)(*iNode)->release(); } ]]> The addSubtree is called by the parser when new subtree has to be added to existing (this) node (the same reduction was made upon the same part of the input but different way). The shackle is called to work the garbagge collector implemented by the glrGuard class (which is the base class of the glrNode class. It increases the stored number of pointers to the object. The condition is there beacuse of some situations which may come up in some cyclical grammars. &succA){ succ.push_back((const deque&)succA); for(deque::iterator iNode=succ.back().begin();iNode!=succ.back().end();++iNode){ if(*iNode!=this)(*iNode)->shackle(); } } ]]> Implementation of the algorithm for counting the number of subtrees of the node. It uses the passed flag to indicate that the node was already walked through and the numberOfSubtrees to store counted number of subtrees of the node. This has two reasons. The first is that the time to count number of subtrees is polynomial even thought the number of subtrees itself can grow exponentially. The second is that the algorithm doesn't cycling in the case that there is a cyclus in the graph representing the forest. Note that this can come up just if parsing wickedly cyclical grammar. >::iterator iSub=succ.begin();iSub!=succ.end();++iSub){ unsigned long long one=1; for(deque::iterator iNode=iSub->begin();iNode!=iSub->end();++iNode){ one*=(*iNode)->getNumberOfSubtrees(); } numberOfSubtrees+=one; } passed=false; } return numberOfSubtrees; } ]]> Implementation of the getForest function that generates the text representation of subtrees of the node. The function is almost same as the getNumberOfSubtrees due to redefinition of operators upon the stringForest class. The difference is that there is no mechanism to store generated text representation of subtrees. There is no lack of polynomial time because the size of generated structore is exponential. Therefore it should be used only for forests with small number of trees. push_back(string(" { ")+symbols.getStringFromSymbol(getSymbol())+" } "); return all; }else{ passed=true; stringForest *all,*given; all=new stringForest; bool allSuccess=false; for(deque >::const_iterator iSucc=succ.begin();iSucc!=succ.end();++iSucc){ bool success=false; stringForest *one=new stringForest; one->push_back(string(" { ")+symbols.getStringFromSymbol(getSymbol())+" "); for(deque::const_iterator iNode=iSucc->begin();iNode!=iSucc->end();++iNode){ given=(*iNode)->getForest(symbols); if(given!=NULL){ (*one)*=given; success=true; } } if (success) { (*all)+=one; allSuccess=true; } else delete one; } if(allSuccess){ (*all)+="} "; }else{ delete all; all=NULL; } passed=false; return all; } } ]]> Constructor of the terminalNode class. The behaviour of calling appropriate constructor of the base class is mantadory. Function to return text representation of the terminal and the preterminal. push_back(string(" { ")+symbols.getStringFromSymbol(getSymbol())+"\\n"+terminal+" } "); return ret; } ]]> The readToken, readTerminals and setInput functions are almost same as in the basic parser. > terminal >> preterminal){ terminals[terminal].push_back(symbols.getSymbolFromString(preterminal)); } } void complexParser::readToken(){ string terminal; if(*input >> terminal){ map >::iterator preter=terminals.find(terminal); if(preter!=terminals.end()) for(vector::iterator t=preter->second.begin();t!=preter->second.end();++t) preterminals.push_back(new terminalNode(*t,preter->first)); } } void complexParser::setInput(istream *inputA){ input=inputA; } ]]> Implementation of the getNumberOfTrees and the printForest functions. getNumberOfSubtrees(); } void complexParser::printForest(ostream &output){ if(commonNode *root=getForestRoot()){ stringForest *forest=root->getForest(symbols); forest->print(output,getNumOfTokens()); delete forest; } } ]]> The command line help. The main consists particularly of parsing command line arguments, creating the parser (complexParser object and calling its functions. There is also the oprion for repetetive parsing of the input for profiling reasons. > repeat; break; } default: printUsage(cerr); exit(1); } } if((input=="")&&(tableOut=="")){ printUsage(cerr); exit(2); } complexParser parser; istream *file; file=new ifstream(grammar.c_str()); if(file->good()) parser.initGrammar(*file); else{ cerr << "can't read grammar from file \'" << grammar << "\'" << endl; delete file; exit(3); } delete file; if(table!=""){ file=new ifstream(table.c_str()); if(file->good()) parser.readTable(*file); else{ cerr << "can't read glr table from file '" << table << "' => defaulting to glr table computation" << endl; parser.computeTable(); } delete file; }else{ parser.computeTable(); } file=new ifstream(terminals.c_str()); if(file->good()) parser.readTerminals(*file); else{ cerr << "can't read terminals from file '" << terminals << "'" << endl; delete file; exit(4); } delete file; if(tableOut!=""){ ofstream file(tableOut.c_str()); if(file.good()) parser.printTable(file); else{ cerr << "can't write glr table to file '" << tableOut << "'" << endl; } } if(printStatus){ parser.printStatus(cerr); cerr << endl; } if(input!=""){ struct timeval begin,end; gettimeofday(&begin,NULL); for(int i=0;(igood()){ parser.setInput(file); parser.parse(); }else{ cerr << "can't read input from file '" << input << "'" << endl; delete file; exit(5); } delete file; } gettimeofday(&end,NULL); if(printForest) parser.printForest(cout); else cout << "number of trees in resultant forest is " << parser.getNumberOfTrees() << endl; cerr << "number of nodes in packed shared forest is " << numNodes << endl; if(repeat){ cerr << "input parsed " << (int)repeat << " times in " << (end.tv_sec-begin.tv_sec)-(end.tv_usec The source code of this example is located in the samples directory.