semantic_tutorial

Tutorial on semantic analysis

Tutorial on semantic analysis

Step 1: Preparation

In this tutorial we use the While language. It is possible to use your own lexer and parser. If you have not completed it, use this implementation.
If you are working not with your own parser, please take the time to understand the grammar.
Look into the Parser.ih file.
- The function called lex asks the current line number from the parser and saves it into a field of the d_loc__ attribute of the Parser class. Then it asks for the next token from the lexer and returns it to the parser.
- The line number information is used in the error function to print location information for error messages.
Make sure you can compile the parser and it accepts all the correct test files and rejects the lexical error ones. Try out the semantic error test cases! They are accepted. The goal of the tutorial is to find those errors as well.

Step 2: Symbol table

The symbol table will be implemented by a simple C++ map. The keys will be variable names, mapping to the variable type and the line number of the declaration.

Create a new header file, called semantics.h.
Include the standard headers iostream, string and map.
Create an enumeration type to represent the two types of the While language

enum type { integer, boolean };

Create a class called var_data to represent the data, that variable names will be mapped to. Its two attributes:
- decl_row: to store the line number of the declaration.
- var_type: the type of the variable.
Create two constructors for the var_data class:
- With two parameters, initializing both attributes.
- With no parameters and empty body. This will be needed if we want to put objects of this class into a map.
In Parser.h, add the symbol table as a private attribute to the Parser class:

std::map<std::string,var_data> symbol_table;

In the while.y file, in the %baseclass-preinclude option, change <iostream> to “semantics.h”, in order to make the just created header file part of the project.
Compile the project and fix the compilation errors, if any.

Step 3: Passing variable names from the lexer to the parser

In order to be able to insert variable names and the corresponding data into the symbol table, the parser needs the names of the variables. We will ask this information from the lexer and bind it to each terminal symbol representing identifiers.

In bisonc++, every terminal and non-terminal can have an associated semantic value. (Remember semantic values or attributes from the lectures!) Since every symbol can have a semantic value of different type, we need to create a union showing the different possibilities. Our union will first have only one option, as we want to associate the names of variables to the identifier terminals. bisonc++ is having its own syntax for creating this union, and it will be turned into a real C++ union when running the bisonc++ command.

In while.y, before the first %% line, add the following:

%union
{
  std::string *name;
}

The names of options in this union can be used to define the type of semantic value assigned to a given grammar symbol. Extend the declaration of the identifier non-terminal as follows (TOKEN_ID is the terminal you use for variable names in your project):

%token <name> TOKEN_ID

Now, identifiers can have semantic values of type string. This value can be set in the lex function. Extend the lex function with the following code (before the return instruction):

if( ret == TOKEN_ID )
{
  d_val__.name = new std::string(lexer.YYText());
}

Some explanation: The YYText() function of the lexer returns the name of the current token. We create a C++ string from it. The d_val__ attribute of the parser can be used to assign semantic values to the current token. Its type is the union that we have defined in the while.y file. We load the name of the identifier into the name option of this union.

Now we are able to access the variable names in the C++ code fragments following the grammar rules. If you have a rule a: A b C, then the semantic values of A, b and C are $1, $2 and $3 respectively.

Modify the grammar rules for declarations such that they write the name of the declared variable to the standard output. (Note that the actual rule might look a bit different from the following example.)

declaration:
    TOKEN_INTEGER TOKEN_ID TOKEN_SEMICOLON
    {
        std::cout << "declared variable: " << *$2 << std::endl;
    }

Run your compiler a correct test file and check if the declared variable names appear in the output.

Step 4: Finding re-declared and undeclared variables

Instead of writing the names of declared variables to the standard output, now we insert them in the symbol table. The C++ map datatype has the [] operator, which can be used to insert or retrieve elements.

Change the variable declaration rules such that the variables and their data are inserted into the symbol table:

    TOKEN_INTEGER TOKEN_ID TOKEN_SEMICOLON
    {
        symbol_table[*$2] = var_data( d_loc__.first_line, integer );
    }

Before the insertion operation shown above, check if the variable has already been declared. (The count function of the map type returns the number of the given element in the map; 0 or 1 in our case.)

  if( symbol_table.count(*$2) > 0 )
  {
    std::stringstream ss;
    ss << "Re-declared variable: " << *$2 << ".\n"
    << "Line of previous declaration: " << symbol_table[*$2].decl_row << std::endl;
    error( ss.str().c_str() );
  }

Explanation: We use here the stringstream type to collect the parts of the error message. In order to make this work, you have to include the <sstream> standard header in semantics.h. We can ask for the string, collected in the stringstream-typed ss, using the str() member function. Since the error function (see in Parser.ih) accepts a C-style character array instead of a C++ string, we have to use c_str() to convert.

Implement the C++ code fragment of the rule for logical variable declarations in a similar manner. Do not forget to use the boolean type when inserting those variables into the symbol_table.
Find or create a test file with re-declared variables and check if your compiler can really find the error.
Find all other uses of the identifier terminal in the grammar (assignment, expression), and implement a check for undeclared variables in the corresponding C++ code fragments. Test your solution with test files containing undeclared variables in assignments and expressions.

Step 5: Type checking

In order to do type checking, we need semantic values for expressions. These will be of type type (the enumeration declared in semantics.h).

Add a new option to the union in while.y:

type *expr_type;

Find out how is the non-terminal symbol for expressions is called in your grammar. (It is expression in the parser downloadable at the beginning of this tutorial, but it might have a different name in your own solution.) Since this is a non-terminal, you can define the type of its semantic value as follows. (Insert it near the token declarations at the beginning of while.y.)

%type <expr_type> expression

For each expression rule, do the necessary type checking and set the semantic value of the left-hand side of the rule. The semantic value of the left-hand side can be referenced by the $$ symbol.
- For example, the rule of number literals will have the following code: $$ = new type(integer);
- In case of the rule of a single variable name, we have to ask the type of the variable from the symbol table. (Note, that we have earlier checked in this rule whether the variable was in the symbol table, so after this check we are safe to retrieve the variable data.) $$ = new type(symbol_table[*$1].var_type);
- In case of expressions build using operators, there is need for type checking the arguments. For example, the rule for addition will have this C++ code:

if(*$1 != integer || *$3 != integer)
{
   std::stringstream ss;
   ss << d_loc__.first_line << ": Type error in addition." << std::endl;
   error( ss.str().c_str() );
}
$$ = new type(integer);

Type check assignments: Emit an error, if the left and right hand sides are of different types.
Test your solution: It should give error messages for all semantic error test cases.

Step 6: Fixing the memory leaks

All the semantic values were created using the new keyword of C++. This means we have to delete them in order to avoid memory leakage.

Review all the rules and delete the semantic values of the symbols on the right-hand side at the end of the corresponding C++ code fragments (change i to the appropriate number):

delete $i;

Table of Contents