Building an Assembler

I recently completed the first part of the awesome “From Nand to Tetris” course. For project 6 you build an Assembler for the HACK Computer using a high level language of your choice; naturally I did my project using Go (Golang).

The HACK Assembly language is very simple, and an Assembler could be written using string manipulation in less than 50 lines of code. However, I’ve always wanted to build a lexer/scanner and parser in Go and thought this would be a great little project to do so.

// HACK Assembly Language example
// Draws a rectangle at the top-left corner of the screen.

   @0               // this is an "A" instruction
   D=M              // this is a "C" instruction
   @INFINITE_LOOP
   D;JLE            // this is also a "C" instruction
   @counter         
   M=D
   @SCREEN          // this is also an "A" instruction
   D=A
   @address
   M=D
(LOOP)              // this is a Label
   @address
   A=M
   M=-1
   @address
   D=M
   @32
   D=D+A
   @address
   M=D
   @counter
   MD=M-1
   @LOOP
   D;JGT
(INFINITE_LOOP)
   @INFINITE_LOOP
   0;JMP

Defining the tokens

The HACK assembly language has 2 instruction types: “A” and “C”; it also has “goto” labels. Although each instruction can be broken down into parts, I decided to keep my tokens super simple:

type Token int

const ( 
    ILLEGAL Token = iota
	EOF
	COMMENT
	LABEL
	A_INSTRUCTION
	C_INSTRUCTION
)

The Scanner

My scanner is based on the standard library’s Go Scanner:

type Scanner struct {
	src      []byte
	ch       rune // current character
	offset   int  // character offset
	rdOffset int  // reading offset
}

const bom = 0xFEFF
func (s *Scanner) Init(src []byte) {
	s.src = src
	s.ch = ' '
	s.offset = 0
	s.rdOffset = 0

	s.next()
	if s.ch == bom {
		s.next()
	}
}

func (s *Scanner) next() {
	if s.rdOffset < len(s.src) {
		s.offset = s.rdOffset
		s.ch = rune(s.src[s.rdOffset])
		s.rdOffset += 1
	} else {
		s.offset = len(s.src)
		s.ch = -1 // eof
	}
}

The scanners job is to read one rune at a time and return a Token and the string literal of what it scanned.

For the Hack Assembly Language whitespace means nothing, so we can safely skip any found (the indentation in the example is only for readability):

func (s *Scanner) skipWhitespace() {
	for s.ch == ' ' || s.ch == '\t' || s.ch == '\n' || s.ch == '\r' {
		s.next()
	}
}

The output of an assembler is 1s and 0s so we don’t really need to scan for comments, but I thought that handling comments would be a good starting point to building the scanner:

func (s *Scanner) scanComment() string {
	s.next()
	offs := s.offset
	for s.ch != '\n' && s.ch >= 0 {
		s.next()
	}
	return string(s.src[offs:s.offset])
}

Our scanComment() function keeps progressing the scanner until it finds a newline or reaches the end of file (EOF), and returns all the characters as a string. Our Scan() function can now fully scan for comments:

func (s *Scanner) Scan() (tok Token, lit string) {
	s.skipWhitespace()
	ch := s.ch
	s.next() // always make progress
	
	switch ch {	
	case -1:
		tok = EOF
	case '/':
		if s.ch == '/' { // the second '/' means we have a comment
			tok = COMMENT
			lit = s.scanComment()
        } 
    default:
        tok = ILLEGAL
	}
	return
}

Great, we now have a working Scanner that will tokenize Comments; lets add some more tokens.

The “A” instruction is the easiest to scan; the instruction begins with @, and all we need to do is scan the whole line:

func (s *Scanner) scanLine() string {
	offs := s.offset
	for s.ch != '\n' && s.ch != '\r' && s.ch >= 0 && s.ch != ' '  {
		s.next()
	}
	return string(s.src[offs:s.offset])
}

A “Label” is also is easy as it begins with a bracket (, but instead of reading to the end of the line, we make sure end when we see the closing bracket ):

func (s *Scanner) scanLabel() string {
	offs := s.offset
	for {
		ch := s.ch
		if ch == '\n' || ch == '\r' || ch < 0 {
			break
		}
		s.next()
		if ch == ')' {
			break
		}
	}
	return string(s.src[offs:s.offset-1])
}

The “C” instruction is the most complex but a simple helper function can help us to determine if it’s a C-Instruction (as described by the HACK Assembly Language spec):

func isCInstruction(ch rune) bool {
	return ch == '0' || ch == '1' || ch == '-' || ch == '!' || ch == 'A' || ch == 'D' || ch == 'M'
}

We can use the same scanLine() function for the C-Instruction, making sure we don’t throw away the first character; Our final Scan() function looks like this:

func (s *Scanner) Scan() (tok Token, lit string) {
	s.skipWhitespace()

	switch ch := s.ch; {
	case isCInstruction(ch):
		tok = C_INSTRUCTION
		lit = s.scanLine()
	default:
		s.next() // always make progress
		switch ch {	
		case -1:
			tok = EOF
		case '/':
			if s.ch == '/' {
				tok = COMMENT
				lit = s.scanComment()
			}
		case '(':
			tok = LABEL
			lit = s.scanLabel()
		case '@':
			tok = A_INSTRUCTION
			lit = s.scanLine()
		default:
			tok = ILLEGAL
		}
	}
	return
}

Syntax Tree

Before we start parsing the HACK Assembly file we want to define our abstract syntax tree (AST). All Labels get converted to Symbols (more on this later) so we’re left with a simple list of either “A”, or “C” instructions:

type Instruction interface {
	BinaryString() string
}

type HackFile struct {
	Instructions []Instruction
}

type AInstruction struct {
	lit string // the raw assembly instruction pre-parsing
	addr int
}

func (a *AInstruction) BinaryString() string {
	return fmt.Sprintf("0%015b\n", a.addr)
}

type CInstruction struct {
	lit string
	dest int // C instructions look like: `dest=comp;jump`; dest and jump are optional
	comp int
	jump int
}

func (c *CInstruction) BinaryString() string {
	return fmt.Sprintf("111%07b%03b%03b\n", c.comp, c.dest, c.jump)
}

Each Instruction interface has a single method BinaryString() which will make it really easy for us to output the machine code later.

With out AST defined and our complete Scanner, we are ready to start parsing the data.

Parsing

The HACK Assembly Language has a number of pre-defined memory allocations defined as Symbols, so we’ll initiate our parser with this in mind. Later, Labels we parse will be added to these Symbols and any variables found should also be added to these Symbols for later lookup.

type Parser struct {
	scanner Scanner
	symbols map[string]int
	instructions []Instruction
	nAddr int // next available address
}

func (p *Parser) Init(src []byte) {
	p.scanner.Init(src)	
	p.nAddr = 16 // address [0:15] are reserved
	p.symbols = map[string]int{
		"R0": 0, "R1": 1, "R2": 2, "R3": 3, "R4": 4, "R5": 5, "R6": 6, "R7": 7, "R8": 8,
		"R9": 9, "R10": 10, "R11": 11, "R12": 12, "R13": 13, "R14": 14, "R15": 15,
		"SCREEN": 16384, "KBD": 24576,
		"SP": 0, "LCL": 1, "ARG": 2, "THIS": 3, "THAT": 4,
	}
}

Because Labels could appear in an “A” instruction before they are defined, we will need to parse the file in two passes; but instead of re-setting the scanner, we will parse the instructions by storing the raw string literal and parse the instructions in a second pass:

func (p *Parser) Parse() HackFile {
loop:
	for {
		tok, lit := p.scanner.Scan()
		switch tok {
		case EOF:
			break loop // break out of the loop not just the switch
		case LABEL:
			p.symbols[lit] = len(p.instructions)
		case A_INSTRUCTION:
			p.instructions = append(p.instructions, &AInstruction{lit: lit})
		case C_INSTRUCTION:
			p.instructions = append(p.instructions, &CInstruction{lit: lit})
		}
	}

	for _, instr := range p.instructions {
		switch i := instr.(type) {
		case *AInstruction:
			p.parseAInstruction(i)
		case *CInstruction:
			p.parseCInstruction(i)
		}
	} 
	return HackFile{ Instructions: p.instructions }
}

I’m intentionally not giving you the parseAInstruction() and parseCInstruction() methods, just so you can do some of the work yourself 😄

Tying it all together

Our CLI tool will read in the *.asm file and output the machine code as a *.hack file:

func main() {
	asmFilePath := os.Args[1]
	asmData, _ := ioutil.ReadFile(asmFilePath) // ignoring errors for brevity
	
	var p Parser
	p.Init(asmData)
	hackFile := p.Parse()

	var b bytes.Buffer
	for _, i := range hackFile.Instructions {
		b.WriteString(i.BinaryString())
	}

	hackFilePath := strings.Replace(asmFilePath, ".asm", ".hack", 1)
	ioutil.WriteFile(hackFilePath, b.Bytes(), 0644)
}

This project was the pre-cursor to writing a lexer/parser for the Djinni IDL in Go, which builds on top of these concepts; Check it out of you want more detail on building a lexer/parser in Go.