On the need of a generic library around ANTLR: using reflection to build a metamodel

I am a Language Engineer: I use several tools to define and process languages. Among other tools I use ANTLR: it is simple, it is flexible, I can build things around it.

However I find myself rebuilding similar tools around ANTLR for different projects. I see two problems with that:

  • ANTLR is a very good building block but with ANTLR alone not much can be done: the value lies in the processing we can do on the AST and I do not see an ecosystem of libraries around ANTLR
  • ANTLR does not produce a metamodel of the grammar: without it becomes very difficult to build generic tools around ANTLR

Let me explain that:

  • For people with experience with EMF: we basically need an Ecore-equivalent for each grammar.
  • For the others: read next paragraph

Why we need a metamodel

Suppose I want to build a generic library to produce an XML file or a JSON document from an AST produced by ANTLR. How could I do that?

Well, given a ParseRuleContext I can take the rule index and find the name. I have generated the parser for the Python grammar to have some examples, so let’s see how to do that with an actual class:

Python3Parser.Single_inputContext astRoot = pythonParse( code...);
String ruleName = Python3Parser.ruleNames[astRoot.getRuleIndex()];

Let’s look at the class Single_inputContext:

public static class Single_inputContext extends ParserRuleContext {
    public TerminalNode NEWLINE() { return getToken(Python3Parser.NEWLINE, 0); }
    public Simple_stmtContext simple_stmt() {
        return getRuleContext(Simple_stmtContext.class,0);
    public Compound_stmtContext compound_stmt() {
        return getRuleContext(Compound_stmtContext.class,0);
    public Single_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    @Override public int getRuleIndex() { return RULE_single_input; }
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterSingle_input(this);
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitSingle_input(this);

I should obtain something like this:

<Single_input NEWLINES="...">

Good. It is very easy for me to look at the class and recognize these elements, however how can I do that automatically?

Reflection, obviously, you will think.

Yes. That would work. However what if when we have multiple elements? Take this class:

public static class File_inputContext extends ParserRuleContext {
    public TerminalNode EOF() { return getToken(Python3Parser.EOF, 0); }
    public List NEWLINE() { return getTokens(Python3Parser.NEWLINE); }
    public TerminalNode NEWLINE(int i) {
        return getToken(Python3Parser.NEWLINE, i);
    public List stmt() {
        return getRuleContexts(StmtContext.class);
    public StmtContext stmt(int i) {
        return getRuleContext(StmtContext.class,i);
    public File_inputContext(ParserRuleContext parent, int invokingState) {
        super(parent, invokingState);
    @Override public int getRuleIndex() { return RULE_file_input; }
    public void enterRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).enterFile_input(this);
    public void exitRule(ParseTreeListener listener) {
        if ( listener instanceof Python3Listener ) ((Python3Listener)listener).exitFile_input(this);

Now, the methods NEWLINE and stmt are returning lists. You could remember that in general generics do not work so well with reflection in Java. In this case we are lucky because there is a solution:

Class clazz = Python3Parser.File_inputContext.class;
Method method = clazz.getMethod("stmt");
Type listType = method.getGenericReturnType();
if (listType instanceof ParameterizedType) {
    Type elementType = ((ParameterizedType) listType).getActualTypeArguments()[0];
    System.out.println("ELEMENT TYPE "+elementType);

This will print:

ELEMENT TYPE class me.tomassetti.antlrplus.python.Python3Parser$StmtContext

So we can cover also generics. Ok, using reflection is not ideal but we can extract some information from there.

I am not 100% sure it will be enough but we can get started.

How the metamodel should like?

To define metamodels I would not try to come up anything fancy. I would use the classical schema which is at the base of EMF and it is similar to what it is available in MPS.

I would add a sort of container named Package or Metamodel. The Package would list several Entities. We could also mark one of those entity as the root Entity.

Each Entity would have:

  • a name
  • an optional parent Entity (from which it inherits properties and relations)
  • a list of properties
  • a list of relations

Each Property would have:

  • a name
  • a type chosen among the primitive type. In practice I expect to use just String and Integers. Possibly enums in the future
  • a multiplicity (1 or many)

Each Relation would have:

  • a name
  • the kind: containment or reference. Now, the AST knows only about containments, however later we could implement symbol resolution and model transformations and at that stage we will need references
  • a target type: another Entity
  • a multiplicity (1 or many)

Next steps

I would start building a metamodel and later building generic tools taking advantage of the metamodel.

There are other things that typically need:

  • transformations: the AST which I generally get from ANTLR is determined by how I am force to express the grammar to obtain something parsable. Sometimes I have also to do some refactoring to improve performance. I want to transform the AST after parsing to obtain closer to the logical structure of the language.
  • unmarshalling: from the AST I want to produce the test back
  • symbol resolution: this could be absolutely not trivial, as I have found out building a symbol solver for Java

Yes, I know that some of you are thinking: just use Xtext. While I like EMF (Xtext is built on top of it), it has a steep learning curve and I have seen many people confused by it. I also do not like how OSGi plays with the non-OSGi world. Finally Xtext is coming with a lot of dependencies.

Do not get my wrong: I think Xtext is an amazing solution in a lot of contexts. However there are clients who prefer a leaner approach. For the cases in which it makes sense we need an alternative. I think it can be built on top of ANTLR, but there is work to do.

By the way years ago I built something similar for .NET and I called it NetModelingFramework.

Federico Tomassetti

Federico has a PhD in Polyglot Software Development. He is fascinated by all forms of software development with a focus on Model-Driven Development and Domain Specific Languages.
Scott Stanchfield
8 years ago

Have you looked into xText? It generates an EMF Meta-Model that represents the grammar as well as a nice IDE (In Eclipse or IDEA), and provides a great framework for code generation.

Check out

Scott Stanchfield
8 years ago

(I see you mention xText, but it’s not clear if you’ve actually tried it. It’s much easier to integrate with non-OSGi tools now, and you can skip the generated IDE if you don’t want or need it)

Federico Tomassetti
8 years ago

Hi Scott, thank you for your comment. Yes, in the conclusions I explain why sometimes Xtext is not the right solution for me or my clients. I know Xtext and loves it. I have also interviewed one of the core committers on my blog ( I have used it extensively. However some people do not like it because it has too many dependencies, EMF is intricate and OSGi makes difficult to include stuff from Maven. Yes, there are workarounds but in general it seems that it is a quite heavy solution. I am looking into building a very thin layer… Read more »

