Lab 8: Introduction to Lex/Yacc (Flex/Bision) Solution

$24.99 $18.99

Aim: The goal of this lab is to introduce the use of Lex (Flex) and Yacc (Bison) for building scanners and parsers Let’s get started! Create a directory structure to hold your work for this course and all the subsequent labs: Suggestion: CS202/Lab8 A bit more detailed description of Lex and Yacc is provided in…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product

Aim: The goal of this lab is to introduce the use of Lex (Flex) and Yacc (Bison) for building scanners and parsers

Let’s get started!

    1. Create a directory structure to hold your work for this course and all the subsequent labs: Suggestion: CS202/Lab8

    1. A bit more detailed description of Lex and Yacc is provided in LexAndYaccTutorial.pdf on Moodle

Introduction

When properly used, Lex & Yacc allow you to parse complex languages with ease. This is a great boon when you want to read a configuration file, or want to write a compiler for any language you (or anyone else) might have invented. Although these programs shine when used together, they each serve a different purpose. The next two will explain what each part does.

Lex

The program Lex generates a so called `Lexer’. This is a function that takes a stream of characters as its input, and whenever it sees a group of characters that match a key, takes a certain action. A very simple example

(example1.l):

Example 1:

%{

#include <stdio.h>

%}

%%

stop printf(“Stop command received\n”); start printf(“Start command received\n”);

%%

The first section, in between the %{ and %} pair is included directly in the output program. We need this, because we use printf later on, which is defined in stdio.h

Sections are separated using %%, so the first line of the second section starts with the stop key. Whenever the stop key is encountered in the input, the rest of the line (a printf() call) is executed.

Besides stop, we’ve also defined start, which otherwise does mostly the same. We terminate the code section with %% again. To compile this example do this:

lex example1.l

cc lex.yy.c -o example1 –ll

NOTE: If you are using flex, instead of lex, you may have to change -ll to -lfl in the compilation scripts.

This will generate the file Ex1. If you run it, it waits for you to type some input. Whenever you type something that is not matched by any of the defined keys (ie, stop and start) it’s output again. If you enter stop it will output ‘Stop command received’; Terminate with a EOF (^D).

You may wonder how the program runs, as we didn’t define a main() function. This function is defined for you in libl (liblex) which we compiled in with the -ll command.

Regular Expression in matches

The above example wasn’t very useful in itself, and our next one won’t be either. It will however show how to use regular expressions in Lex, which are massively useful later on (example2.l).

Example 2:

%{

#include <stdio.h>

%}

%%

[0123456789]+ printf(“NUMBER\n”);

[a-zA-Z][a-zA-Z0-9]* printf(“WORD\n”);

%%

This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs. You should be quite comfortable with Regular Expressions by now and realize that NUMBER matches:

[0123456789]+ : A sequence of one or more characters from the group 0123456789. Same as: [0-9]+

And, the WORD matches:

[a-zA-Z][a-zA-Z0-9]* : A letter(small or capital) followed by zero or more characters which are either a letter or a digit. Basically this rule constitutes legal variable names in many languages.

Try compiling Example 2, lust like Example 1, and feed it some text. Here is a sample session:

$ ./example2

foo

WORD

bar

WORD

123

NUMBER

bar123

WORD

123bar

NUMBER

WORD

You may also be wondering where all this whitespace is coming from in the output. The reason is simple: it was in the input, and we don’t match on it anywhere, so it gets output again.

A more complicated example for a C like syntax

Let’s say we want to parse a file that looks like this:

logging {

category lame-servers { null; };

category cname { null; };

};

zone “.” {

type hint;

file “/etc/bind/db.root”;

};

We clearly see a number of categories (tokens) in this file:

WORDs, like ‘zone’ and ‘type’

FILENAMEs, like ‘/etc/bind/db.root’

QUOTEs, like those surrounding the filename

OBRACEs, { EBRACEs, }

SEMICOLONs, ;

The corresponding Lex file is Example 3 (example3.l):

%{

#include <stdio.h>

%}

YACC

YACC can parse input streams consisting of tokens with certain values. This clearly describes the relation YACC has with Lex, YACC has no idea what ‘input streams’ are, it needs preprocessed tokens. While you can write your own Tokenizer, we will leave that entirely up to Lex.

A simple thermostat controller

Let’s say we have a thermostat that we want to control using a simple language. A session with the thermostat may look like this:

heat on

Heater on!

heat off

Heater off!

target temperature 22

New temperature set!

The tokens we need to recognize are: heat, on/off (STATE), target, temperature, NUMBER.

The Lex tokenizer (example4.l) is:

printf(“\tHeat turned on or off\n”);

}

;

target_set:

TOKTARGET TOKTEMPERATURE NUMBER

{

printf(“\tTemperature set\n”);

}

;

The first part is what we call the ‘root’. It tells us that we have commands for the thermostat, and that these commands consist of individual command parts. As you can see this rule is very recursive, because it again contains the word commands. What this means is that the program is now capable of reducing a series of commands one by one.

The second rule defines what a command is. We support only two kinds of commands, the heat_switch and the target_set. This is what the |-symbol signifies – ‘a command consists of either a heat_switch or a target_set.

A heat_switch consists of the HEAT token, which is simply the word heat, followed by a STATE (which we defined in the Lex file as on or off).

Somewhat more complicated is the target_set, which consists of the TARGET token (the word target), the TEMPERATURE token (the word temperature) and a NUMBER.

A complete YACC file

The previous section only showed the grammar part of the YACC file, but there is more. This is the header that we omitted (this part goes before the grammar shown above in example4.y file):

%{

#include <stdio.h>

#include <string.h>

void yyerror(const char *str)

{

fprintf(stderr,”error: %s\n”,str);

}

int yywrap()

{

return 1;

}

main()

{

yyparse();

}

%}

%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE

The yyerror() function is called by YACC if it finds an error. We simply output the message passed, but there are smarter things we can do.

The function yywrap() can be used to continue reading from another file. It is called at EOF and you can than open another file, and return 0. Or you can return 1, indicating that this is truly the end.

Then there is the main() function, that does nothing but set everything in motion.

The last line simply defines the tokens we will be using. These are output using y.tab.h if YACC is invoked with the ‘-d‘ option.

Compiling & running the thermostat controller

lex example4.l

yacc -d example4.y

cc lex.yy.c y.tab.c -o example4

A few things have changed. We now also invoke YACC to compile our grammar, which creates y.tab.c and y.tab.h. We then call Lex as usual. When compiling, we remove the -ll flag: we now have our own main() function and don’t need the one provided by libl.

NOTE: if you get an error about your compiler not being able to find ‘yylval’, add this to example4.l, just beneath #include <y.tab.h>:

extern YYSTYPE yylval;

A sample session:

$ ./example4

heat on

Heat turned on or off

heat off

Heat turned on or off

target temperature 10

Temperature set

target humidity 20

error: parse error

$

Expanding the thermostat to handle parameters

As we’ve seen, we now parse the thermostat commands correctly, and even flag mistakes properly. But as you might have guessed by the weasely wording, the program has no idea of what it should do, it does not get passed any of the values you enter.

Let’s start by adding the ability to read the new target temperature. In order to do so, we need to learn the NUMBER match in the Lexer to convert itself into an integer value, which can then be read in YACC. Whenever Lex matches a target, it puts the text of the match in the character string yytext. YACC in turn expects to find a value in the variable yylval. Below (example5.l), we see the obvious solution:

Remember that we already wrote a Lexer for this file. Now all we need to do is write the YACC grammar, and modify the Lexer so it returns values in a format YACC can understand. We modify the lexer in example3.l to the following (example6.l):

%{

#include <stdio.h>

#include “y.tab.h”

%}

%%

zonecontent:

OBRACE zonestatements EBRACE

It needs to start with an OBRACE, a {. Then follow the zonestatements, followed by an EBRACE, }.

quotedname:

QUOTE FILENAME QUOTE

{

$$=$2;

}

This section defines what a quotedname is: a FILENAME between QUOTEs. Then it says something special: the value of a quotedname token is the value of the FILENAME. This means that the quotedname has as its value the filename without quotes.

This is what the magic $$=$2; command does. It says: my value is the value of my second part. When the quotedname is now referenced in other rules, and you access its value with the $-construct, you see the value that we set here with $$=$2.

NOTE: this grammar chokes on filenames without either a ‘.’ or a ‘/’ in them.

zonestatements:

|

zonestatements zonestatement SEMICOLON

;

zonestatement:

statements

|

FILETOK quotedname

{

printf(“A zonefile name ‘%s’ was encountered\n”, $2);

}

;

This is a generic statement that catches all kinds of statements within the ‘zone’block. We again see the recursiveness.

block:

OBRACE zonestatements EBRACE SEMICOLON

;

statements:

| statements statement

;

statement: WORD | block | quotedname

This defines a block, and ‘statements’ which may be found within.

When executed, the output is like this:

$ ./example6

zone “.” {

type hint;

file “/etc/bind/db.root”;

type hint;

};

A zonefile name ‘/etc/bind/db.root’ was encountered Complete zone for ‘.’ found

Exercises

    1. Write a program using LEX to count the number of characters, words, spaces and lines in a given input file. (Count.l)

    1. Write a program using LEX to count the numbers of comment lines in a given C program. Also eliminate them and copy the resulting program into separate file (Comments.l)

    1. Write a YACC program to recognize strings ‘aaab’, ‘abbb’,‘ab’ and ‘a’ using the grammar (anb, n>= 10) (ab.y)

    1. Write a program using LEX to recognize a valid arithmetic expression and to recognize the identifiers and operators present. Print them separately (calc.l)

    1. Write a program using YACC to recognize and evaluate valid arithmetic expression that uses operators +, -, * and /. (calc.y)

Submitting your work:

  1. All source files and class files as one tar-gzipped archive.

    • When unzipped, it should create a directory with your ID. Example: 2008CS1001 (NO OTHER FORMAT IS ACCEPTABLE!!! Case sensitive!!!)

    • Should include:

      • count.l [3 points]

      • Comments.l [3 points]

      • ab.l [3 points]

      • calc.l [3 points]

      • calc.y [3 points]

      • README file

  1. Negative marks for any problems/errors in running your programs

  1. If any aspects of tasks are confusing, make an assumption and state it clearly in your README file!

  1. README file should also have instructions on how to use/run your program!

  1. Submit/Upload it to Google Classroom

Placeholder
Lab 8: Introduction to Lex/Yacc (Flex/Bision) Solution
$24.99 $18.99