Core Java

Getting started with ANTLR: building a simple expression language

This post is the first one of a series. The goal of the series is to describe how to create a useful language and all the supporting tools.

In this post we will start working on a very simple expression language. We will build it in our language sandbox and therefore we will call the language Sandy.

I think that tool support is vital for a language: for this reason we will start with an extremely simple language but we will build rich tool support for it. To benefit from a language we need a parser, interpreters and compilers, editors and more. It seems to me that there is a lot of material on building simple parsers but very few material on building the rest of the infrastructure needed to make using a language practical and effective.

I would like to focus on exactly these aspects, making a language small but fully useful. Then you will be able to grow your language organically.

The code is available on GitHub: https://github.com/ftomassetti/LangSandbox. The code presented in this article corresponds to the tag 01_lexer.

The language

The language will permit to define variables and expressions. We will support:

  • integer and decimal literals
  • variable definition and assignment
  • the basic mathematical operations (addition, subtraction, multiplication, division)
  • the usage of parenthesis

Examples of a valid file:

var a = 10 / 3
var b = (5 + 3) * 2 
var c = a / b

The tools we will use

We will use:

  • ANTLR to generate the lexer and the parser
  • use Gradle as our build system
  • write the code in Kotlin. It will be very basic Kotlin, given I just started learning it.

Setup the project

Our build.gradle file will look like this

buildscript {
   ext.kotlin_version = '1.0.3'
 
   repositories {
     mavenCentral()
     maven {
        name 'JFrog OSS snapshot repo'
        url  'https://oss.jfrog.org/oss-snapshot-local/'
     }
     jcenter()
   }
 
   dependencies {
     classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
   }
}
 
apply plugin: 'kotlin'
apply plugin: 'java'
apply plugin: 'idea'
apply plugin: 'antlr'
 
repositories {
  mavenLocal()
  mavenCentral()
  jcenter()
}
 
dependencies {
  antlr "org.antlr:antlr4:4.5.1"
  compile "org.antlr:antlr4-runtime:4.5.1"
  compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"
  compile "org.jetbrains.kotlin:kotlin-reflect:$kotlin_version"
  testCompile "org.jetbrains.kotlin:kotlin-test:$kotlin_version"
  testCompile "org.jetbrains.kotlin:kotlin-test-junit:$kotlin_version"
  testCompile 'junit:junit:4.12'
}
 
generateGrammarSource {
    maxHeapSize = "64m"
    arguments += ['-package', 'me.tomassetti.langsandbox']
    outputDirectory = new File("generated-src/antlr/main/me/tomassetti/langsandbox".toString())
}
compileJava.dependsOn generateGrammarSource
sourceSets {
    generated {
        java.srcDir 'generated-src/antlr/main/'
    }
}
compileJava.source sourceSets.generated.java, sourceSets.main.java
 
clean{
    delete "generated-src"
}
 
idea {
    module {
        sourceDirs += file("generated-src/antlr/main")
    }
}

We can run:

  • ./gradlew idea to generate the IDEA project files
  • ./gradlew generateGrammarSource to generate the ANTLR lexer and parser

Implementing the lexer

We will build the lexer and the parser in two separate files. This is the lexer:

lexer grammar SandyLexer;
 
// Whitespace
NEWLINE            : '\r\n' | 'r' | '\n' ;
WS                 : [\t ]+ ;
 
// Keywords
VAR                : 'var' ;
 
// Literals
INTLIT             : '0'|[1-9][0-9]* ;
DECLIT             : '0'|[1-9][0-9]* '.' [0-9]+ ;
 
// Operators
PLUS               : '+' ;
MINUS              : '-' ;
ASTERISK           : '*' ;
DIVISION           : '/' ;
ASSIGN             : '=' ;
LPAREN             : '(' ;
RPAREN             : ')' ;
 
// Identifiers
ID                 : [_]*[a-z][A-Za-z0-9_]* ;

Now we can simply run ./gradlew generateGrammarSource and the lexer will be generated for us from the previous definition.

Testing the lexer

Testing is always important but while building languages it is absolutely critical: if the tools supporting your language are not correct this could affect all possible programs you will build for them. So let’s start testing the lexer: we will just verify that the sequence of tokens the lexer produces is the one we aspect.

package me.tomassetti.sandy
 
import me.tomassetti.langsandbox.SandyLexer
import org.antlr.v4.runtime.ANTLRInputStream
import java.io.*
import java.util.*
import org.junit.Test as test
import kotlin.test.*
 
class SandyLexerTest {
 
    fun lexerForCode(code: String) = SandyLexer(ANTLRInputStream(StringReader(code)))
 
    fun lexerForResource(resourceName: String) = SandyLexer(ANTLRInputStream(this.javaClass.getResourceAsStream("/${resourceName}.sandy")))
 
    fun tokens(lexer: SandyLexer): List<String> {
        val tokens = LinkedList<String>()
        do {
           val t = lexer.nextToken()
            when (t.type) {
                -1 -> tokens.add("EOF")
                else -> if (t.type != SandyLexer.WS) tokens.add(lexer.ruleNames[t.type - 1])
            }
        } while (t.type != -1)
        return tokens
    }
 
    @test fun parseVarDeclarationAssignedAnIntegerLiteral() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "EOF"),
                tokens(lexerForCode("var a = 1")))
    }
 
    @test fun parseVarDeclarationAssignedADecimalLiteral() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "DECLIT", "EOF"),
                tokens(lexerForCode("var a = 1.23")))
    }
 
    @test fun parseVarDeclarationAssignedASum() {
        assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "PLUS", "INTLIT", "EOF"),
                tokens(lexerForCode("var a = 1 + 2")))
    }
 
    @test fun parseMathematicalExpression() {
        assertEquals(listOf("INTLIT", "PLUS", "ID", "ASTERISK", "INTLIT", "DIVISION", "INTLIT", "MINUS", "INTLIT", "EOF"),
                tokens(lexerForCode("1 + a * 3 / 4 - 5")))
    }
 
    @test fun parseMathematicalExpressionWithParenthesis() {
        assertEquals(listOf("INTLIT", "PLUS", "LPAREN", "ID", "ASTERISK", "INTLIT", "RPAREN", "MINUS", "DECLIT", "EOF"),
                tokens(lexerForCode("1 + (a * 3) - 5.12")))
    }
}

Conclusions and next steps

We started with the first small step: we setup the project and built the lexer.

There is a long way in front of us before making the language usable in practice but we started. We will next work on the parser with the same approach: building something simple that we can test and compile through the command line.

Federico Tomassetti

Federico has a PhD in Polyglot Software Development. He is fascinated by all forms of software development with a focus on Model-Driven Development and Domain Specific Languages.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Igor Ganapolsky
7 years ago

Thank you for this introduction on Lexers. I am trying to find the link to the next post in your blog series. Where is it located?

Federico Tomassetti
7 years ago

Hi Igot, you are welcome. Here there is the 8th post of the series. It has the links to all the previous posts on top:
https://tomassetti.me/generating-bytecode/

Also, I reworked this series, expanded it and updated it and wrote a book about building languages:
https://leanpub.com/create_languages

Back to top button