Designing an Interpreter

The Concept #

I’m building a toy version of Python based on PunyPy from Learn More Python The Hard Way by Zed Shaw. Mine shall be named QuasiPy. It’s barely a language at all; it has no control flow!

Despite its pitiful capabilities, QuasiPy has a few components in common with some real programming languages.

The scanner scans the source code for lexemes (ordered groups of characters) which represent keywords, variables, operators, values and other elements of syntax, including indentation. Each lexeme corresponds with a token type. As each lexeme (e.g. ‘def’) is parsed from the stream of characters, it is stored in a Token object along with its token type (e.g. ‘DEF’) and depth of indentation. The tokens are stored sequentially as a list of tokens.

The parser analyzes the stream of tokens to identify keywords and delimiters which reveal the nested structure of the program. It works by identifying the nearest token and then it peeks ahead to get an idea of the grammatical context of the token. Once the context is identified, the parser applies nested grammar rules to the set of tokens to create the parse tree (aka AST or abstract syntax tree) representation of the code.

For example, if we’re looking at the grammar of a function definition, the ‘DEF’ token should be followed by ‘NAME’, ‘LPAREN’, a list of parameters (could be empty) separated by ‘COMMA’s, 'RPAREN’ and ‘COLON’ before an assignment statement in the function body.

If all of these tokens are successfully parsed in the grammar rule for function definitions (side note: my grammars were in a BNF) , the parser nests a FuncDef object in the tree which has attributes ‘name’, ‘parameters’ and ‘content’. Each attribute has its own data structure. The ‘name’ attribute is just a string. The ‘parameters’ attribute is just a list containing more objects which represent parameter names. The ‘content’ attribute contains an assignment statement which in turn nests its own components such as the assigned variable’s name and the expression following the assignment operator.

The semantic analyzer traverses the parse tree by passing to it a scope object (Zed calls this a World). The scope objects contains a symbol table in the form of a dictionary. In the case of variables, the variable name is the key, and the value is not stored at all (this stage is just to verify the semantics). In the case of function definitions, the function name is the key and the value is a tuple containing the function’s parse-tree and its own local scope object. The tuple is stored as (function, scope).

The interpreter makes the code run in a similar way to how the analyzer verifies semantics, except that expressions are evaluated, values are returned from functions, and variables are assigned.

Notice how my shortest description happens to be the last phase of the interpreter… the phase with which I have spent the least amount of time!

The Learning #

I’ve written the scanner and parser a few times. It’s a pretty straightforward process to write them in Python. The toughest part is probably handling indents and newlines which are important to Python’s syntax - together, they delimit code blocks. The analyzer and interpreter have proven more challenging because their design has a very direct effect on the language semantics such as variable scope and recursion.

Although I have successfully implemented local scope in the analyzer (it can handle variable shadowing), I am now approaching my third rewrite of the semantic analyzer and my second attempt at the interpreter. This is because I conducted some experiments to test what happens when the environment of a function definition changes with each pass over an enclosed function definition. For example,
script:

#!/usr/bin/env python3
accumulator = list()
def A(x):
  if x == 0:
    print("base case, x == 0")
  else:
    def B():
      print(f"B(): A's x == {x}")
    accumulator.append(B)
    A(x-1)

A(4)
[func() for func in accumulator]

output:

base case, x == 0
B(): A's x == 4
B(): A's x == 3
B(): A's x == 2
B(): A's x == 1

The environment in which B was declared is different for each level of recursion in A. We can see this because in B, there is a ‘free variable’, x, which is actually found in the symbol table for A.

I think that each instance of function B has its own lexical scope: each instance of B refers back to a specific version of the environment, A. My analyzer could verify these semantics in terms of the variable shadowing, but function calls at the interpreter stage can’t perform recursion because there is no way to manage references to the specific environments.

My latest solution is to ‘stack’ the environment’s symbol table so that each level of recursion can contain unique values. I suppose each version of the environment in the stack could be called a frame, but that might be confusing. I’ll use it for now and see how it goes.
 
This summary was a lot of talk and not a lot of code! I’ll do something about that in my next post, but I need to start the semantic analyzer rewrite first.

 
2
Kudos
 
2
Kudos

Now read this

Emulating Python’s Lexical Scope Semantics

KA-BOOOOOOOOOM! # That’s the sound of my head exploding. # “I slept on it. I won’t make any more changes to ex35. It was a learning exercise and I’m still learning from it, but the existing interpreter is hairy enough that I’d need to... Continue →