User Tools

Site Tools


semantic_tutorial

This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

===== Tutorial on semantic analysis ==== ==== Step 1: Preparation === * In this tutorial we use the [[http://deva.web.elte.hu/compilers/lab/while-language.html|While language]]. It is possible to use your own lexer and parser. If you have not completed it, use [[http://deva.web.elte.hu/compilers/lab/parser.zip|this implementation]]. * If you are working not with your own parser, please take the time to understand the grammar. * Look into the ''Parser.ih'' file. * The function called ''lex'' asks the current line number from the parser and saves it into a field of the ''d_loc<nowiki>__</nowiki>'' attribute of the ''Parser'' class. Then it asks for the next token from the lexer and returns it to the parser. * The line number information is used in the ''error'' function to print location information for error messages. * Make sure you can compile the parser and it accepts all the correct [[http://deva.web.elte.hu/compilers/lab/while-tests.zip|test files]] and rejects the lexical error ones. Try out the semantic error test cases! They are accepted. The goal of the tutorial is to find those errors as well. ==== Step 2: Symbol table === The symbol table will be implemented by a simple C++ map. The keys will be variable names, mapping to the variable type and the line number of the declaration. * Create a new header file, called ''semantics.h''. * Include the standard headers ''iostream'', ''string'' and ''map''. * Create an enumeration type to represent the two types of the //While// language <code> enum type { integer, boolean }; </code> * Create a class called ''var_data'' to represent the data, that variable names will be mapped to. Its two attributes: * ''decl_row'': to store the line number of the declaration. * ''var_type'': the type of the variable. * Create two constructors for the ''var_data'' class: * With two parameters, initializing both attributes. * With no parameters and empty body. This will be needed if we want to put objects of this class into a map. * In ''Parser.h'', add the symbol table as a private attribute to the ''Parser'' class: <code> std::map<std::string,var_data> symbol_table; </code> * In the ''while.y'' file, in the ''%baseclass-preinclude'' option, change ''<iostream>'' to ''"semantics.h"'', in order to make the just created header file part of the project. * Compile the project and fix the compilation errors, if any. ==== Step 3: Passing variable names from the lexer to the parser ==== In order to be able to insert variable names and the corresponding data into the symbol table, the parser needs the names of the variables. We will ask this information from the lexer and bind it to each terminal symbol representing identifiers. In //bisonc++//, every terminal and non-terminal can have an associated //semantic value//. (Remember //semantic values// or //attributes// from the lectures!) Since every symbol can have a semantic value of different type, we need to create a //union// showing the different possibilities. Our union will first have only one option, as we want to associate the names of variables to the //identifier// terminals. //bisonc++// is having its own syntax for creating this union, and it will be turned into a real C++ union when running the //bisonc++// command. * In ''while.y'', before the first ''%%'' line, add the following: <code> %union { std::string *name; } </code> * The names of options in this union can be used to define the type of semantic value assigned to a given grammar symbol. Extend the declaration of the //identifier// non-terminal as follows (''TOKEN_ID'' is the terminal you use for variable names in your project): <code> %token <name> TOKEN_ID </code> * Now, //identifiers// can have semantic values of type ''string''. This value can be set in the ''lex'' function. Extend the ''lex'' function with the following code (before the ''return'' instruction): <code> if( ret == TOKEN_ID ) { d_val__.name = new std::string(lexer.YYText()); } </code> Some explanation: The ''YYText()'' function of the lexer returns the name of the current token. We create a C++ string from it. The ''d_val<nowiki>__</nowiki>'' attribute of the parser can be used to assign semantic values to the current token. Its type is the union that we have defined in the ''while.y'' file. We load the name of the identifier into the ''name'' option of this union. Now we are able to access the variable names in the C++ code fragments following the grammar rules. If you have a rule ''a: A b C'', then the semantic values of ''A'', ''b'' and ''C'' are ''$1'', ''$2'' and ''$3'' respectively. * Modify the grammar rules for declarations such that they write the name of the declared variable to the standard output. (Note that the actual rule might look a bit different from the following example.) <code> declaration: TOKEN_INTEGER TOKEN_ID TOKEN_SEMICOLON { std::cout << "declared variable: " << *$2 << std::endl; } </code> * Run your compiler a correct test file and check if the declared variable names appear in the output. ==== Step 4: Finding re-declared and undeclared variables ==== Instead of writing the names of declared variables to the standard output, now we insert them in the symbol table. The C++ ''map'' datatype has the ''[]'' operator, which can be used to insert or retrieve elements. * Change the variable declaration rules such that the variables and their data are inserted into the symbol table: <code> TOKEN_INTEGER TOKEN_ID TOKEN_SEMICOLON { symbol_table[*$2] = var_data( d_loc__.first_line, integer ); } </code> * Before the insertion operation shown above, check if the variable has already been declared. (The ''count'' function of the ''map'' type returns the number of the given element in the map; ''0'' or ''1'' in our case.) <code> if( symbol_table.count(*$2) > 0 ) { std::stringstream ss; ss << "Re-declared variable: " << *$2 << ".\n" << "Line of previous declaration: " << symbol_table[*$2].decl_row << std::endl; error( ss.str().c_str() ); } </code> Explanation: We use here the ''stringstream'' type to collect the parts of the error message. In order to make this work, you have to include the ''<sstream>'' standard header in ''semantics.h''. We can ask for the string, collected in the ''stringstream''-typed ''ss'', using the ''str()'' member function. Since the ''error'' function (see in ''Parser.ih'') accepts a C-style character array instead of a C++ string, we have to use ''c_str()'' to convert. * Implement the C++ code fragment of the rule for logical variable declarations in a similar manner. Do not forget to use the ''boolean'' type when inserting those variables into the symbol_table. * Find or create a test file with re-declared variables and check if your compiler can really find the error. * Find all other uses of the //identifier// terminal in the grammar (assignment, expression), and implement a check for undeclared variables in the corresponding C++ code fragments. Test your solution with test files containing undeclared variables in assignments and expressions. ==== Step 5: Type checking ==== In order to do type checking, we need semantic values for expressions. These will be of type ''type'' (the enumeration declared in ''semantics.h''). * Add a new option to the //union// in ''while.y'': <code> type *expr_type; </code> * Find out how is the non-terminal symbol for expressions is called in your grammar. (It is ''expression'' in the parser downloadable at the beginning of this tutorial, but it might have a different name in your own solution.) Since this is a non-terminal, you can define the type of its semantic value as follows. (Insert it near the token declarations at the beginning of ''while.y''.) <code> %type expr_type expression </code> * For each //expression rule//, do the necessary type checking and set the semantic value of the left-hand side of the rule. The semantic value of the left-hand side can be referenced by the ''$$'' symbol. * For example, the rule of number literals will have the following code: ''$$ = new type(integer);'' * In case of the rule of a single variable name, we have to ask the type of the variable from the symbol table. (Note, that we have earlier checked in this rule whether the variable was in the symbol table, so after this check we are safe to retrieve the variable data.) ''$$ = new type(symboltable[*$1].var_type);'' * In case of expressions build using operators, there is need for type checking the arguments. For example, the rule for addition will have this C++ code: <code> if($1->expr_type != integer || $3->expr_type != integer) { std::stringstream ss; ss << d_loc__.first_line << ": Type error in addition." << std::endl; error( ss.str().c_str() ); } $$ = new type(integer); </code> * Type check assignments: Emit an error, if the left and right hand sides are of different types. * Test your solution: It should give error messages for all semantic error test cases. ==== Step 6: Fixing the memory leaks ====

semantic_tutorial.1447101083.txt.gz · Last modified: 2015/11/09 21:31 by deva