This post discusses the difficulties and solutions of creating an Antlr grammar without removing spaces from the text being parsed. But first, a brief explanation of what Antlr really is.
Most Antlr grammars remove the whitespace in the lexer like this:
|: [ \r\n\t] + -> skip|
This definitely makes parsing somewhat easier as it transforms this expression:
q1 + q2 / f2 – ( 4 + 5 )
into a much simpler form:
Without these spaces it is much easier to write the rules for the Antlr grammar.
But in some cases, it is not possible to remove all spaces because they may be part of a variable name. In the case, I was working on variables were with single quotes like this: ‘Variable 1’. Spaces were part of the variable name. The grammar had to take into account that a space could be part of the variable name but also could be a “useless” character that the user put in for readability like a space after a comma. Special handling of spaces is required to parse this expression:
‘Variable ABC’ * 5 + ‘Variable Q’*( 1 + ‘Variable W’)
The idea is as follows: if we cannot remove spaces from the text we have to “pull” spaces into parenthesis, operation signs, etc.
The rules for the operators and parenthesis must change to pull in spaces. Firstly, we need to declare the tokens for spaces:
SINGLE_SPACE: ‘ ‘;
We cannot use SINGLE_SPACE* because the Antlr tool correctly shows that SINGLE_SPACE* can match an empty string.
Then incorporate these tokens into rules for operators (MUL is ‘*’ and Div is ‘/’, rParameterSeparator is for the comma that separates function actual parameters
rLeftVarParenthesis: L_VAR_PAREN ANY_SPACE | L_VAR_PAREN;
rArithmeticOperatorMulDiv: ANY_SPACE (MUL | DIV) ANY_SPACE | ANY_SPACE (MUL | DIV) | (MUL | DIV) ANY_SPACE | (MUL | DIV);
rParameterSeparator: ANY_SPACE ‘,’ ANY_SPACE | ANY_SPACE ‘,’ | ‘,’ ANY_SPACE | ‘,’;
This simple and elegant solution allows for parsing variable names with spaces.