Nand2Tetris: Building a Jack Language Syntax Analyzer

February 23, 2022

The Syntax Analyzer:

The Syntax Analyzer takes in high-level Jack code, tokenizes it and outputs a file with the tree. High-level languages like Java, Python, C#, Rust, etc. all utilize grammar, syntax, symbols, literals and many kinds of different traits that make them unique. Codifying and tokenizing the words of a programming language allow developers to categorize the symbols at each step. The end result is a working proof that the structure, syntax, grammar, and symbology of the language can handle all the features and operations it advertises effectively. In other words, by analyzing the syntax it can be ascertained that a language can do all that it says it can do.

To recap, all the From Nand to Tetris blog posts cover some aspects of the computer science course From Nand to Tetris also called Nand2Tetris: how to build a computer from first principles. These work posts explore and recap the high-level language to machine code journey.

The output from this tokenization and syntax analysis is an Abstract Syntax Tree. It's simply a tree of nodes that the compiler has to crawl through to convert the source code into machine language.

In the "layer cake" of building a modern computer the synax analyzer is almost an optional step. It's a way of verifying that the grammatical algorithm works - that it can make linguistical sense of the high level language:

graph TD; A[Jack High Level Language] -->|Syntax Analyzer| E[ XML Tree ] A[Jack High Level Language] -->|Jack Compiler| B[Virtual Machine Language] B[Virtual Machine Language] -->|VM Translator| C[Hack Assembly Language] C[Hack Assembly Language] -->|Hack Assembler| D[Hack Machine Code]

Try pressing one of the buttons below to pre load some jack code:

The output is the Abstract Syntax Tree. As a compiler writer, different approaches can be taken to how the tree is consumed and processed for compilation. This compiler uses recursive decent parsing, meaning that it takes a top down approach:

  1. compile program

  2. compile class

  3. compile class members

  4. compile functions

  5. compile arguments

  6. and so on...

Takaways

Writing the syntax analyzer helped me think through the large algorithm required for processing a Jack language application. The most helpful aspect was probably the initial phase of looking at the language itself and categorizing all the parts. Any programming language can be broken down into parts like this (highly simplified for this post, see course materials for a full language specification):

Lexical Elements

keywords: 'class' | 'constructor' | 'function' | 'method' | 'field' | 'static' | 'var' | 'int' | 'char' | 'boolean' | 'void'...

symbols: '{' | '}' | '(' | ')' | '[' | ']'...

integerConstant: numbers ranging from 0 to 32767

stringConstant: groups of characters between double quotes

identifier: A sequence of letters, digits, and/or underscore character and not starting with a digit

Going through the language in this way makes it digestible. When the rules are presented like this it feels much less daunting to begin the process of writing a compiler.

Wait, Is that all the syntax analyzer does? It doesn't really analyze anything. So what is it for?

Yes, it's actually true. For the purposes of nand2tetris, the syntax analyzer is more like an xml converter. In real programming languages (no offense Jack), languages which are used to build commercial applications, syntax analysis is a time where the code syntax is verified, structure is identified, symbol tables may be created, type checking could occur, and errors could be reported.

In nand2tetris the xml output was used for comparison of a correect xml output. The two output files were checked programmatically for differences, this was the test to check if the language was being correctly parsed.


Written by Michael Barakat, a front end developer living in Seattle.