ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Follow publication

Writing a Mathematical Expression Parser

Photo by Dan Meyers on Unsplash

This article continues a theme started in a prior one on parsing text into structured data. The concepts introduced in that article will be assumed and not explained here. The reader unfamiliar with text parsing is encouraged to read that one first before continuing with this article.

In high school my math classes required the use of a graphing calculator, and I spent four years graphing and evaluating the likes of sin(x + 2) and 2 ^ x * (3 + y). How the calculator was able to take what I entered and correctly give answers for particular values of x and y always mystified me.

Of course now it is pretty obvious that the calculator had a built-in parser that understood well-formed mathematical expressions, but it must have had more than that as the output of a parser (an AST) does not reduce to a concrete number. It must have been paired with another program that could traverse the AST, do substitutions, execute common math functions, and combine all the pieces into the correct answer.

This article will lay bare the kernel of how graphing calculators execute mathematical expressions. We will create a grammar that represents a correct expression, write the parser to transform its textual representation into an AST, and then create an interpreter that will evaluate it for given values of variables.

Grammar

The following is a comprehensive representation of the types of expressions our parser should be able to accept:

4 + 3 - 2 * 9
-3 * x ^ 2
(1 + y) * 4.1
sin PI
fn(3, x + 1, sin PI / 2)

The expressions are fairly typical to what a person would expect to type into a calculator. Typical order of operations are observed; the main addition, multiplication, and exponentiation operators are allowed; numbers can be negative, integers, or floats; variables act as placeholders; function calls can be for common built-in functions or user-defined functions.

Implementing most of these is relatively straightforward, but by far the most complicated aspect of this project will be functions. They allow optional parentheses for only one argument. Thus sin PI or sin(PI) are allowed. For more complicated functions, multiple arguments might be specified, so parentheses are required. This allows writing functions like log(x, 10) to do a base-10 logarithm of some variable x. Functions can be applied to a (single) multiplication or exponentiation expression without grouping parentheses while addition will require them. sin PI / 2 will be evaluated as sin(PI / 2) while sin PI + 2 would be (sin PI) + 2 unless explicitly written as sin(PI + 2).

To start the grammar off, a naïve starting point for the grammar might be

EXPR
: EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '^' EXPR
| '(' EXPR ')'
| number
| variable
;

It seems to account most the operations we want to handle (except function calls), allows for parenthesized expressions, and can recursively terminate whenever numbers or variables are encountered.

Unfortunately, this grammar has one crucial problem: ambiguity.

Given the expression 4 + x * 3 (which we know multiplies first and adds second) our grammar could produce one of two possible ASTs based on how its rules are expanded:

  +            EXPR + EXPR
/ \ -> EXPR + (EXPR * EXPR)
4 * -> number + (variable * number)
/ \
x 3

* EXPR * EXPR
/ \ -> (EXPR + EXPR) * EXPR
+ \ -> (number + variable) * number
/ \ \
4 x 3

While the top expansion is correct, the grammar has no reason to decide to expand that way over the bottom one. All of the grammar rules have the same level of precedence and so are indistinguishable from one another in terms of picking the one that will preserve the correct order of operations.

To get around this some parsers (usually parser generators) allow for operator precedence to be defined in the grammar, which acts as a heuristic to the parser in choosing the grammar rule that preserves the desired order of operations. The other way (and the approach we will take) is to restructure the grammar so that the rules themselves indicate the operator precedence.

This means we will add a number of additional rules (one for each level of operator precedence) between our topmost EXPR rule to our lowermost number/variable rules. Each rule takes on a higher level of precedence the closer it gets to the bottom so we will start with addition and proceed downward to exponentiation before hitting our terminal rules.

Our starting grammar will look like this:

EXPRESSION
: ADDITION
;

ADDITION
: ADDITION ('+' | '-') MULTIPLICATION
| MULTIPLICATION
;

MULTIPLICATION
: MULTIPLICATION ('*' | '/') EXPONENTIATION
| EXPONENTIATION
;

EXPONENTIATION
: EXPONENTIATION '^' BASIC
| BASIC
;

BASIC
: number
| identifier
| '"' EXPRESSION '"'
| '(' EXPRESSION ')'
;

To explain some of the terminal rules under BASIC, we will define identifier to be any alphabetic string not surrounded by quotations marks (so that it will look like any regular variable name in most common programming languages), and the EXPRESSION between two quotation marks will be useful for user-defined functions which must supply an expression that such function must evaluate (e.g., a user-defined integration function would indicate the expression being integrated over as a quotation-surrounded expression).

The only thing this grammar lacks is the ability to call functions. As stated above, parentheses are only necessary when a function is applied to addition expressions (like sin(x + 1)) but optional when applied to multiplication or exponentiation ones. In other words function calls have a precedence level, too, and it happens to be between addition and multiplication.

Our final grammar looks like:

EXPRESSION
: ADDITION
;

ADDITION
: ADDITION ('+' | '-') CALL
| CALL
;

CALL
: MULTIPLICATION
| identifier CALL
| identifier '(' EXPRESSION [',' EXPRESSION]* ')'
;

MULTIPLICATION
: MULTIPLICATION ('*' | '/') EXPONENTIATION
| EXPONENTIATION
;

EXPONENTIATION
: EXPONENTIATION '^' BASIC
| BASIC
;

BASIC
: number
| identifier
| string
| '(' EXPRESSION ')'
;

The CALL rule is a little complex, but essentially what it does is:

  1. Falls through to MULTIPLICATION in the case that no functions are called.
  2. Allows for multiple function calls to be right associatively stacked (in the case of something like sin sin 4).
  3. Requires at least one argument (with or without parentheses) but permits multiple comma-separated arguments wrapped in parentheses.

One of the most important elements of this grammar is associativity. The left recursion of ADDITION means that a series of additions and subtractions will be left associative (preserving the left-to-right order in which we normally evaluate such expressions by hand).

Oppositely the right recursion of CALL ensures that multiple function calls associate to the right. This means that sin tan cos PI, while being a little silly, will evaluate as expected (cos first, then tan, then finally sin).

Now that our grammar is done, we can move on to writing the code.

Tokenizer

The tokenizer will be the same the code used in the JSON parser. We will specify the following as our grammer’s tokens, though:

const tokens = [
[/^\s+/, null],
[/^-?\d+(?:\.\d+)?/, 'NUMBER'],
[/^[a-zA-Z]+/, 'IDENT'],
[/^"[^"]+"/, 'STRING'],
[/^\+/, '+'],
[/^-/, '-'],
[/^\*/, '*'],
[/^\^/, '^'],
[/^\//, '/'],
[/^\(/, '('],
[/^\)/, ')'],
[/^,/, ','],
];

Numbers are a bit complicated since there are so many different ways of writing them. They can be negative (-4), they can be integers or decimals (4 or 5.3), and when decimals should begin with a digit (0.3 or 12.5). There are still more ways to write numbers but this handles enough of the cases to feel fairly comprehensive.

Anything that is text is an identifier. This means a function call is recognized entirely by its syntactic location in an expression since it will be tokenized as an identifier just like a variable. In fact, as we will see when we get to the evaluation stage, there is significant overlap between a function and a variable, anyway.

Parser

As this is a hand-written parser, most of the code must be ad hoc to this project. Only the barest skeleton can be ported over from the JSON parser.

class Parser {
#tokenizer;
#lookahead;

constructor(tokenizer) {
this.#tokenizer = tokenizer;
}

read(string) {
this.#tokenizer.read(string);
this.#lookahead = this.#tokenizer.next();
return this.#EXPRESSION();
}

#eat(...tokenTypes) {
const token = this.#lookahead;

if (!token) {
throw new Error(
`Unexpected end of input; expected ${token.type}`
);
}

if (!tokenTypes.includes(this.#lookahead.type)) {
throw new Error(
`Expected ${tokenType} === ${token.type}`
);
}

this.#lookahead = this.#tokenizer.next();

return token;
}

#is(...tokenTypes) {
return tokenTypes.includes(this.#lookahead?.type);
}
}

The #is method allows us to peek at the lookahead token and check its type without consuming it or throwing errors if the type does not match those passed in as arguments. It simplifies some of the logic when we have to check if a token is one of many possible types.

As is common for recursive descent parsers, we will add methods for each of the rules in our grammar. This gives the skeleton

class Parser {
// ... properties/methods from above

#EXPRESSION() {}

#ADDITION() {}

#CALL() {}

#MULTIPLICATION() {}

#EXPONENTIATION() {}

#BASIC() {}
}

Some of these methods are quite simple and very similar to each other, so we will add the contents of those first.

class Parser {
// ... properties/methods from above

#EXPRESSION() {
return this.#ADDITION();
}

#ADDITION() {
let left = this.#CALL();

while (this.#is('+', '-')) {
left = {
type: 'binary',
left,
op: this.#eat('+', '-').type,
right: this.#CALL(),
};
}

return left;
}

#CALL() {}

#MULTIPLICATION() {
let left = this.#EXPONENTIATION();

while (this.#is('*', '/')) {
left = {
type: 'binary',
left,
op: this.#eat('*', '/').type,
right: this.#EXPONENTIATION(),
};
}

return left;
}

#EXPONENTIATION() {
let left = this.#BASIC();

while (this.#is('^')) {
left = {
type: 'binary',
left,
op: this.#eat('^').type,
right: this.#BASIC(),
};
}

return left;
}

#BASIC() {}
}

As is also common with recursive descent parsing, the implementation of these left-recursive rules takes place in an iterative while loop since recursion would cause infinite looping terminating in a call-stack-size-exceeded error. These rules do, however, still maintain left associativity by transforming successively rightward pieces of the expression at that operator precedence into successively more nested nodes inside the primary AST node.

The other two methods are very different in their implementation but, due to being broken down into multiple cases, are still fairly straightforward:

class Parser {
// ... properties/methods from above

#CALL() {
const maybeCallee = this.#MULTIPLICATION();

if (this.#is('NUMBER', 'IDENT')) {
return {
type: 'call',
fn: maybeCallee.value,
args: [this.#CALL()],
};
}

if (this.#is('(')) {
this.#eat('(');
const args = [this.#EXPRESSION()];

while (this.#is(',')) {
this.#eat(',');
args.push(this.#EXPRESSION());
}

this.#eat(')');
return { type: 'call', fn: maybeCallee.value, args };
}

return maybeCallee;
}

#BASIC() {
if (this.#is('(')) {
this.#eat('(');
const expr = this.#EXPRESSION();
this.#eat(')');

return expr;
}

if (this.#is('NUMBER')) {
return { type: 'number', value: this.#eat('NUMBER').token };
}

if (this.#is('IDENT')) {
return { type: 'ident', value: this.#eat('IDENT').token };
}

if (this.#is('STRING')) {
// Remove containing quotation marks
const expr = this.#eat('STRING').token.slice(1, -1);

// Test that the sub-expression is in fact a valid expression
new Parser(new Tokenizer(tokens)).read(expr);

return {
type: 'expr',
value: expr,
};
}

throw new Error('Malformed expression.');
}
}

The CALL implementation first generates an expression by falling through the MULTIPLICATION rule. Then it decides what to do with what it just generated. If the next token is a number, identifier, or opening parenthesis we will transform that expression into a call expression. Otherwise we return whatever had been parsed under the assumption that it was a plain expression of some kind.

It is worth noting at this point that the CALL implementation has a couple holes in it. The way it is written allows certain invalid constructions like 4 4 will be successfully parsed (specifically as a function call of a function named 4 with an argument of 4). If this were to be used in production, this rule would need to be hardened against cases like these, but doing so here adds complexities to the code not worth making.

#BASIC, being the method furthest down the grammar, deals with the expressions that have the highest precedence. Of course the parenthesized expressions belong in this list, but so do numbers and function names and variables (since these are the most fundamental units of any mathematical expression).

One thing about strings is we require them to be valid expressions themselves and not just any old string. That is why we construct a new parser and attempt to .read() it. If it can execute that line without erroring then we know the string represents a valid expression and we stick it in a node still as a string (we will see why later on when we evaluate the AST).

Lastly, the #BASIC rule is where we will abort the whole procedure if we attempt to parse a malformed expression like 1 +.

Examples

With the parser completed we will generate a couple example ASTs that we will evaluate later.

// 1 + 2
{
type: 'binary',
left: { type: 'number', value: '1' },
op: '+',
right: { type: 'number', value: '2' },
};

// 1 + 2 * x
{
type: 'binary',
left: { type: 'number', value: '1' },
op: '+',
right: {
type: 'binary',
left: { type: 'number', value: '2' },
op: '*',
right: { type: 'ident', value: 'x' },
},
};

// sin 3 * x
{
type: 'call',
fn: 'sin',
args: [{
type: 'binary',
left: { type: 'number', value: '3' },
op: '*',
right: { type: 'ident', value: 'x' },
}],
};

// sigma("x", 1, 5)
{
type: 'call',
fn: 'sigma',
args: [
{ type: 'expr', value: 'x' },
{ type: 'number', value: '1' },
{ type: 'number', value: '5' },
],
};

Evaluator

Now that we can generate an AST for an expression, it is time to walk the tree and reduce it to a concrete number. Each node will be handled in a way idiomatic to its type property, and the evaluator will walk down the tree to the deepest nodes, evaluate those first, and then move up the tree combining the previous results together until we have fully evaluated the expression.

In general this process is quite simple. The greatest complication is that variables can take on any value and the AST can have 'call' nodes for user-defined functions. The way we will handle this is the same way that implementations of programming languages handle variables: we will create an environment record that stores the concrete values of variables (specified at evaluation time). This record will serve double duty, too, by allowing users to specify both variables as well as the implementation of custom functions.

Specifically, we will have two environment records: a global and a local one. The global one will hold the common mathematical functions (like sin) so that users do not have to worry about specifying those themselves. The local one will store all user-supplied variables and custom functions (like sigma above).

So, how will the evaluator actually work? Given a node, it will perform the following behavior based on its type:

  1. 'number': cast the string value of the number to a numeric value
  2. 'ident': retrieve the value of the variable from the environment record
  3. 'call': retrieve the function from the environment record and apply it to the specified arguments
  4. 'binary': evaluate the binary operator against the left and right expressions
  5. 'expr': compile the expression (still represented as a string) and return its AST

Since half of the implementation relies on the environment record, let us get that written first:

class Environment {
#fields;
#parent;

constructor(fields, parent) {
this.#fields = fields;
this.#parent = parent;
}

lookup(ident) {
if (ident in this.#fields) {
return this.#fields[ident];
}

if (this.#parent) {
return this.#parent.lookup(ident);
}

throw new Error(`Unspecified identifier "${ident}"`);
}
}

Environment records can be linked together via the #parent field and our evaluator will look up the chain for a particular identifier until the highest record is checked. Since the goal is to produce a number, unspecified values will be fatal as the evaluator will simply have no idea what possible number could result.

This environment record will then get passed to the evaluator:

const binaryEval = (op, left, right) => {
if (op === '+') {
return left + right;
}

if (op === '-') {
return left - right;
}

if (op === '*') {
return left * right;
}

if (op === '/') {
return left / right;
}

if (op === '^') {
return left ** right;
}
}

const evaluate = (node, env) => {
switch (node.type) {
case 'binary': {
const left = evaluate(node.left, env);
const right = evaluate(node.right, env);
return binaryEval(node.op, left, right);
}

case 'call': {
const fn = env.lookup(node.fn);
const args = node.args.map(arg => evaluate(arg, env));
return fn.apply(null, args);
}

case 'ident':
return env.lookup(node.value);

case 'number':
return +node.value;

case 'expr': {
// will implement below
}
}
};

Aside from being able to evaluate 'expr' nodes, this comprises the meat of all the work required to go from mathematical expression to a concrete number, but it is a little clunky to work with directly. We will add some code that will allow us to compile a string to a function which, when executed with an object representing all user-supplied values, will return the desired number.

We also need to seed our global environment record with some out-of-the-box functions so that our users do not have to supply every piece of information.

const globalEnv = new Environment({
sin: Math.sin,
cos: Math.cos,
tan: Math.tan,
log: (n, base) => Math.log(n) / Math.log(base),
});

const compile = (string) => {
const parser = new Parser(new Tokenizer(tokens));
const ast = parser.read(string);

return (idents) => evaluate(ast, new Environment(idents, globalEnv));
};

The function returned by compile accepts an object whose keys are variables or custom functions and values are numbers or function implementations. When called, the function wraps that object into an environment instance (linked to the global environment) and immediately evaluates the AST.

Since 'expr' nodes are themselves expressions (albeit nested inside another one), the implementation we will use for their evaluations will build off the compile function listed above.

const evaluate = (node, env) => {
switch (node.type) {
// ... other code
case 'expr':
return compile(node.value);
}
};

Whenever an 'expr' node is encountered it will be compiled into a function but not evaluated further. As these nodes are intended to be used inside a custom function, it is up to that function to utilize it in some way.

Given the example expressions listed above, here are their evaluations.

const fn1 = compile('1 + 2');
fn1(); // 3
fn1({ x: 4 }); // 3; specifying unused variables affects nothing

const fn2 = compile('1 + 2 * x');
fn2({ x: -1 }); // -1
fn2({ x: 3, y: 7 }); // 7

const fn3 = compile('sin 3 * x');
fn3({ x: 4 }); // -0.5365729180004349
fn3({ x: 3 }); // 0.4121184852417566

const fn4 = compile('sigma("x", 1, 5)');
fn4({
sigma: (eq, low, hi) => {
let sum = 0;

for (let x = low; x <= hi; x++) {
sum += eq({ x });
}

return sum;
},
}); // 15

In the fourth example, eq represents the compiled expression "x" and so must be called (with a particular value of x supplied as it is an identifier) in order to get a concrete number. This means that the implementation of a custom function does have to be aware of its required variables otherwise the evaluation will not happen properly — meaning, for instance, changing the first argument of sigma to "y" means below we would need to update its implementation to eq({ y }) to match.

And with this we have successfully implemented a graphing calculator! (Minus, of course, the actual graphing part.) A very simple grammar, a lightweight parser, and a tree-walking evaluator are all it took to be able to write and execute mathematical expressions.

Conclusion

There are a lot of elements that go into making a successful mathematical expression evaluator. We covered the basics but it always has more room to expand. Some areas people could expand on as an exercise would be:

  1. Allowing for numbers in scientific notation
  2. Allowing for shorthand multiplication like 2x instead of requiring 2 * x
  3. Hardening the grammar to not allow constructions like 4 4

This Gist contains all the code from the article.

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Written by Ryan Dabler

I’m a fullstack web developer working with the MERN stack. I love learning new things, and have decided writing about them is the best way to remember them.

Responses (2)

Write a response