While writing a full-blown compiler for a programming language is a difficult and frustrating task, writing a smaller and more specific parser can be surprisingly easy if you know a small trick.
On the other hand, parsing problems pops up at several places in modern-day programming. So, learning this useful trick can be rewarding.
You need to know the basics of Python, more specifically, you should know the concepts of recursion and flow of control.
After reading and understanding this post, you will be able to create simple calculators, interactive interpreters, parsers, very limited and small programming languages, etc. In general, you should be able to take input, tokenize it, perform whatever actions you want to on the tokens, and output the result of the process.
At the end of this post, you will have created a simple, Lisp-like prefix calculator. Following is a demonstration of how it's going to look:
> ( + 3 2 )
> ( / 2 0 )
> ( - -3 2 )
> ( + ( * 3 2 ) 5 )
Step 1: Writing the Grammar
The first step to writing a parser is to write a clear grammar for its syntax. The grammar determines what is and what is not right. Once you have written the grammar, translating it to Python code is a trivial chore. The grammar will serve as our first pseudocode.
For our tiny calculator, we know that the input can come in two forms: a Number (-2, .5, +8, 8.5, 9.) or a more complicated Expression begins with a (, followed by an operator, etc.).
For writing a grammar, we need to identify different elements of the syntax. So far, we have Expression, Number, and Operator. The next important thing to do is to structure the elements (known as terms) into a hierarchical form. This is shown below:
( Operator Expression Expression )
a floating-point number ([-+][0-9][*\.][0-9]*)
You will notice that Operator and Expression have no parent; they are independent terms.
A grammar is read from the bottom up and different choices appear on distinct lines. Our grammar says that:
- an Operator is one of +, -, *, /.
- a Number is a floating-point number which matches the RegEx [-+][0-9]*[\.][0-9]*
- an Expression is either a Number or a ( followed by an Operator, followed by two other Expressions, and finally ends in a ). Note that the definition of an Expression is recursive.
Step 2: Translating the Grammar into Pseudocode
Pseudocode is fake code resembling English which is supposed to be an intermediate code that can easily be converted into real code. Although writing pseudocode is optional, it is really helpful.
The trick here is to put each term from our grammar into a separate function. Whenever we need to apply the grammar of a certain term, we only have to call the function. Following is the pseudocode implementing the grammar above:
This is our rough pseudo-code that should be good enough for our purpose. In the next step, we will write the real code.
3. Writing the Code
It is said very profoundly about Python that reading and writing Python feels like doing pseudocode. The same applies here, but there is one small caveat— Python doesn't provide any function for “unreading” or putting a character back in the input buffer.
For this, I have created a small class which extends the file object to include this feature. To keep things simple, I have avoided inheritance and my class is not compatible with the file object provided by Python. Treat it like a black-box if you don't want to understand it.
Here is the buffer.py file which handles buffered input:
expression() is our top-level function and maps the Expression grammar term. We first ignore all the whitespace. After that, it takes a single non-whitespace character as input and checks it against several possibilities.
If the input string starts with +, -, ., or a digit, it is a number. We put the character back and input the entire number.
If the input string starts with (, a complete expression is to follow. We input the operator, two more expressions which will serve as the operands, and finally the closing parenthesis. We then calculate the result and return it.
The number function maps the Number grammar term and is very simple—just a wrapper around getword. We input a whole word and if it converts to a float, we return it, otherwise the function returns an error message.
The operator function inputs a single character and tests it for equality against several known operators. Like the above two functions, it also maps a grammar term, i.e., Operator. In case the given operator is not valid, an error message is returned.
The calc function is actually not necessary but makes the code substantially better. In an ideal program, each function should do only one logical task. calc removes some burden from expression.
Although Python 3 supports buffered input through stdin.buffer, Python 2 has no such facility. Plus, Python 3's stdin.buffer would still require us to create some wrapper of our own.
The UngetableInput class wraps Python's basic input to go through a buffer. We take input into the buffer and put a character back into the buffer when ungetc is called. Unless the buffer is empty, all input comes from the buffer.
This code works and leaves a lot of cleaning as homework for the reader. 🙂 Following is a list of things you can do to improve and extend the rudimentary calculator:
Improve buffer.py to handle input whitespace more accurately. Hint: You might want to use a string as the buffer.
Implement a function to get a single character while skipping all whitespace and replace the whitespace skipping loop with it.
Add the ability to create variables. Following the Lisp syntax, it should look something like the following:
( define var_name 839457.892 )
What's Coming Next?
One of the most important parts of our program is the input buffer we created. Unfortunately, it's not general purpose and can break when used in something more complicated than our tiny calculator program. In the next article, we will examine a bigger module which does this chore better.