On the need of a generic library around ANTLR: using reflection to build a metamodel

Federico TomassettiMay 27th, 2016Last Updated: May 26th, 2016

3 38 4 minutes read

I am a Language Engineer: I use several tools to define and process languages. Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

Let me explain that:

For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
For the others: read next paragraph

Why we need a metamodel

Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

Python3Parser.Single_inputContext astRoot = pythonParse(...my code...);
String ruleName = Python3Parser.ruleNames[astRoot.getRuleIndex()];

Let’s look at the class Single_inputContext:

public static class Single_inputContext extends ParserRuleContext {
    public TerminalNode NEWLINE() { return getToken(Python3Parser.NEWLINE, 0); }
    public Simple_stmtContext simple_stmt() {
        return getRuleContext(Simple_stmtContext.class,0);
    }
    public Compound_stmtContext compound_stmt() {
        return getRuleContext(Compound_stmtContext.class,0);
    }
    public Single_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    }
    @Override public int getRuleIndex() { return RULE_single_input; }
    @Override
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterSingle_input(this);
    }
    @Override
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitSingle_input(this);
    }
}

use NEWLINE as an attribute
use simple_stmt and compound_stmt as children

I should obtain something like this:

<Single_input NEWLINES="...">
   <Simple_stmt>...</Simple_stmt>
   <Compund_stmt>...</Compunt_stmt>
</root>

Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

Reflection, obviously, you will think.

Yes. That would work. However what if when we have multiple elements? Take this class:

public static class File_inputContext extends ParserRuleContext {
    public TerminalNode EOF() { return getToken(Python3Parser.EOF, 0); }
    public List NEWLINE() { return getTokens(Python3Parser.NEWLINE); }
    public TerminalNode NEWLINE(int i) {
        return getToken(Python3Parser.NEWLINE, i);
    }
    public List stmt() {
        return getRuleContexts(StmtContext.class);
    }
    public StmtContext stmt(int i) {
        return getRuleContext(StmtContext.class,i);
    }
    public File_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    }
    @Override public int getRuleIndex() { return RULE_file_input; }
    @Override
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterFile_input(this);
    }
    @Override
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitFile_input(this);
    }
}

Now, the methods NEWLINE and stmt are returning lists. You could remember that in general generics do not work so well with reflection in Java. In this case we are lucky because there is a solution:

Class clazz = Python3Parser.File_inputContext.class;
Method method = clazz.getMethod("stmt");
Type listType = method.getGenericReturnType();
if (listType instanceof ParameterizedType) {
    Type elementType = ((ParameterizedType) listType).getActualTypeArguments()[0];
    System.out.println("ELEMENT TYPE "+elementType);
}

This will print:

ELEMENT TYPE class me.tomassetti.antlrplus.python.Python3Parser$StmtContext

So we can cover also generics. Ok, using reflection is not ideal but we can extract some information from there.

I am not 100% sure it will be enough but we can get started.

How the metamodel should like?

To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

Each Entity would have:

a name
an optional parent Entity (from which it inherits properties and relations)
a list of properties
a list of relations

Each Property would have:

a name
a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
a multiplicity (1 or many)

Each Relation would have:

a name
the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
a target type: another Entity
a multiplicity (1 or many)

Next steps

I would start building a metamodel and later building generic tools taking advantage of the metamodel.

There are other things that typically need:

transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
unmarshalling: from the AST I want to produce the test back
symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

By the way years ago I built something similar for .NET and I called it NetModelingFramework.

Reference:

On the need of a generic library around ANTLR: using reflection to build a metamodel from our JCG partner Federico Tomassetti at the Federico Tomassetti blog.