Demystifying JSON.Parse()

Published in

ITNEXT

14 min readSep 8, 2022

A fascinating aspect of software engineering is the transformation of a series of characters in a text file into a set of instructions which can be executed on a CPU to produce meaningful results. It is quite obvious to a person reading the code (well, maybe not quite so obvious, depending on the language) what the desired outcome is.

That is because people’s minds have developed to be able to efficiently structure a stream of characters into meaningful groups which appear in meaningful orders. This is what enables people to communicate and respond and is an activity that is often taken for granted.

It is another story entirely for computers. They have no mental processes that automatically structure content that can then be used to direct their operations; a stream of characters is just a stream of characters. For it to be anything else a computer must be taught how to understand it.

That this is not a difficult task (at least anymore, thanks to prior generations of computer scientists and engineers who trailblazed this area) is testified by the plurality of programming languages, file formats, and methods of deserializing data that software engineers use every day to solve problems. Any engineer can get involved in adding one more programming language or one more file format or one more serialization technique to this list. The only real prerequisite is having some knowledge of and experience doing textual parsing.

Parsing

Parsing is essentially converting a sequence of characters (or more generally, of bits) into structured data. For the purposes of this article, it will be taken to deal with characters, although things like file formats deal with bits.

Simplistically put, the process of transforming a string into something structured and meaningful has two steps:

lexical analysis: the stream of individual characters is converted into a stream of tokens (i.e., grouped characters)
syntactic analysis: the stream of tokens is converted into an abstract syntax tree (a data structure that expresses the relationships between the tokens in the text and can also record important metadata about them)

Usually, parsing is only the first step of a larger effort. For example, compiling a Golang file into an executable file requires parsing as a first step but also has many more that further transform the AST into CPU instructions.

Regardless of what the ultimate outcome is, any successful parsing stage requires that a formal grammar be specified. This is a human-readable specification that explains what the relevant tokens are as well as in what order they should appear.

In its simplest form, a grammar can be described as a set rules for rewriting strings. Certain strings can be rewritten in terms of other strings while some strings cannot be rewritten any further — they are called non-terminal and terminal symbols, respectively. The tokens which are recognized in the lexical stage comprise that grammar’s set of terminal symbols which further comprise its set of non-terminal ones (there being a 1-to-many relationship between non-terminal to terminal symbols).

A simple example of a grammar that could be used by a calculator to parse a mathematical expression might look like

EXPRESSION
    : EXPRESSION '+' EXPRESSION
    | EXPRESSION '-' EXPRESSION
    | EXPRESSION '*' EXPRESSION
    | EXPRESSION '/' EXPRESSION
    | number
    ;

(This is actually a bad grammar for reasons beyond the scope of this article; it is used only to illustrate the kind of forms that will be used later.)

This grammar specifies the terminal symbols +, -, *, /, and number and indicates they can be organized in five different ways — four of which allow for recursive organization. As an example, EXPRESSION '+' EXPRESSION can be expanded to EXPRESSION '+' EXPRESSION '-' EXPRESSION. This allows for complex groups of tokens to be expressed in very few rules. Under this grammar a standard arithmetical expression like 12 + 41 * 3can be matched as an expression by undergoing the following rewrites:

   EXPRESSION '+' EXPRESSION (rewrite the rightmost EXPRESSION)
-> EXPRESSION '+' EXPRESSION * EXPRESSION
-> number '+' number '*' number

How the parsing happens, how a grammar is used to recognize a valid mathematical expression, is the main focus of this article. There are two main approaches to parsing: using one of the many parser generator libraries (which implement well establish parsing algorithms), or writing one by hand. Using a generator abstracts a lot of the work away from a developer and is great for complex situations, but when trying to understand how the process works at all it is better to write a parser by hand.

To that end, this article will create a parser for something that JavaScript developers use every day: JSON. By the end of this article there will be a working version of the JSON.parse method that explores the sorts of paths that must be taken and decisions that must be made to convert JSON text into JavaScript values.

Parsing JSON (lite)

To keep vocabulary consistent throughout the rest of this article, “parsing” will be taken to mean the entire process of converting an unstructured string of characters into a structured piece of data. Parsing will consist of the tokenization step (lexical analysis) following by the (syntactic) analysis step.

Grammar

To set the stage for building the tokenizer and analyzer, an exact grammar expressing the JSON to parse will be given. This grammar will not read every possible string that matches the full JSON specification; doing so would require delving into minutiae and detract from the higher level goal of broadly understanding the parsing pipeline.

The possible values a JSON string can represent are strings, numbers, Booleans, the null value, and objects and arrays (both of which can contain any amount of JSON values). This gives the grammar its topmost rule:

JSON
    : JSON_STRING
    | number
    | null
    | JSON_BOOL
    | JSON_ARRAY
    | JSON_OBJECT
    ;

The convention that will be used for naming the symbols in this grammar will be:

non-terminal symbols will be all uppercase
terminal symbols that represent a class of characters will be all lowercase
terminal symbols that represent a literal character will be written as they are between single quotation marks

To be complete, this grammar’s four non-terminal symbols need to have their rules specified.

Strings in JSON will be anything between two quotation marks. Arrays will be a comma-separated list of any JSON value between two square brackets. Objects will be a comma-separate list of key/value pairs between curly braces. This expands the grammar to the following:

JSON
    : JSON_STRING
    | number
    | null
    | JSON_BOOL
    | JSON_ARRAY
    | JSON_OBJECT
    ;JSON_STRING
    : '"' string '"'
    ;JSON_BOOL
    : true
    | false
    ;JSON_ARRAY
    : '[' JSON_ARRAY_ELS ']'
    ;JSON_ARRAY_ELS
    : JSON_ARRAY_ELS ',' JSON
    | JSON
    | epsilon
    ;JSON_OBJECT
    : '{' JSON_OBJECT_KV_PAIRS '}'
    ;JSON_OBJECT_KV_PAIRS
    : JSON_OBJECT_KV_PAIRS ',' JSON_OBJECT_KV_PAIR
    | JSON_OBJECT_KV_PAIR
    | epsilon
    ;JSON_OBJECT_KV_PAIR
    : JSON_STRING ':' JSON
    ;

There are a couple interesting features to point out in this grammar.

A special symbol called epsilon has been introduced. This symbol represents the empty string, and in this grammar is used to indicate that an array or object can be empty. This would allow "[]" and "{}" to both be valid JSON strings.

The grammar is also recursive. This reflects the recursive nature of JSON values themselves, since there is no limit to what can be nested inside arrays and objects. This allows "[1,[true,[]]]" to be a valid JSON string that would be parsed into a series of nested arrays. Also, some of these recursive rules are left recursive, meaning that a production rule has itself as the start of the rule:

JSON_ARRAY_ELS
    : JSON_ARRAY_ELS ',' JSON
    ;

For analyzers utilizing certain parsing algorithms this can cause infinite looping, since the parser would, to parse JSON_ARRAY_ELS, first have to see if it matches JSON_ARRAY_ELS. But then it would have to see if that matches JSON_ARRAY_ELS, ad infinitum. Since this article is implementing an analyzer by hand there is much flexibility with how the parser handles left recursion: it can easily be rewritten as iteration.

Tokenizer

The point of the tokenizer is to chunk up the text that is to be parsed into groups of characters (even if some of those groups contain just one character). These tokens represent the terminal symbols in the grammar and are often defined by regular expressions and given a name. Certain tokens can also be recognized as meaningless and skipped entirely — this is the approach often taken for whitespace characters.

The grammar defined above gives the following tokens:

const tokens = [
    [/^\s+/, null], // skip whitespace
    [/^\[/, '['],
    [/^]/, ']'],
    [/^\{/, '{'],
    [/^}/, '}'],
    [/^:/, ':'],
    [/^,/, ','],
    [/^"/, '"'],
    [/^\d+/, 'number'],
    [/^null\b/, 'null'], // Add word boundaries to ensure that
    [/^true\b/, 'true'], // `null`, etc., is matched exactly.
    [/^false\b/, 'false'],
    [/^[^"]*/, 'string'],  // Match anything not a quotation mark.
];

Each tuple represents a pattern/name pair. Each pattern is anchored at the start of the string to ensure the tokenizer always generates the next token that should be fed to the analyzer. By assigning whitespace a null name, that will indicate to the tokenizer that all whitespace should be ignored.

The logic of a tokenizer is to loop through the list of tokens and find the first one that matches the beginning of the string. This means it is important to order tokens in such way that more specific tokens come before more general ones. For example, if a certain programming language had two keywords def and define, define should come first so that tokenizing the string define add(a, b) does not first match def and thus leave an invalid ine add(a, b) for the tokenizer to process afterwards.

This gives as the skeleton of a tokenizer the following structure

class Tokenizer {
    constructor(tokens) {}    read(string) {}    next() {}
}

which can be fleshed out to

class Tokenizer {
    // List of tokens recognized by the tokenizer
    #tokens;    // Position in the input string from which it will 
    // read to get the next token
    #cursor;    // String to turn into tokens
    #string;    constructor(tokens) {
        this.#tokens = tokens;
    }    read(string) {
        this.#cursor = 0;
        this.#string = string;
    }    next() {
        // If at end of input, no more tokens to generate
        if (this.#cursor === this.#string.length) {
            return undefined;
        }        // Find substring beginning at position of cursor
        const str = this.#string.slice(this.#cursor);        for (const [pattern, type] of this.#tokens) {
            const [match] = pattern.exec(str) || [];            if (!match) {
                continue;
            }            this.#cursor += match.length;            // Skip tokens with null types
            if (type === null) {
                return this.next();
            }            return { token: match, type };
        }        // Could not extract any tokens, so throw error
        throw new Error(`Unrecognized input: ${str[0]}`);
    }
}

Now that a stream of characters can be converted to a stream of tokens, it is time to transform those tokens into an abstract syntax tree.

Analyzer

The analyzer has a bit more work to do than the tokenizer since it has to implement the production rules of the grammar. With this analyzer being written by hand, its size and complexity is directly proportional to the number of rules it has to implement.

Since this variety of manually created analyzers (called “recursive descent”) directly implement the rules of the grammar, a good skeleton to start with would be:

class Analyzer {
    // Tokenizer class
    #tokenizer;    // Current token being analyzed
    #lookahead;    constructor(tokenizer) {}    #eat(tokenType) {}    read(string) {}    #JSON() {}    #JSON_STRING() {}    #JSON_number() {}    #JSON_null() {}    #JSON_BOOL() {}    #JSON_ARRAY() {}    #JSON_OBJECT() {}    #JSON_ARRAY_ELS() {}    #JSON_OBJECT_KV_PAIRS() {}    #JSON_OBJECT_KV_PAIR() {}
}

Most of these methods directly match the grammar; two are new. read is the public method that will be called on a string to convert it into an abstract syntax tree. #eat is a method commonly used by analyzers to ensure the current token is of the required type. If the token matches the type it will be returned and the analyzer can continue with its derivations. Otherwise, the analyzer should throw an error. #eat's use will become more apparent later in this article.

There is also a #lookahead property on the analyzer. This is one of the most crucial aspects of parsing: the ability to inspect the current token before consuming it. #lookahead stores the current token which allows the analyzer to determine which path in the grammar it should take. Its existence will be crucial in implementing the #JSON method. In general any number of tokens can be used as lookaheads, although one token is a very common number and is sufficient for implementing a JSON parser.

Fleshing out the non-grammar methods, the analyzer will look like

class Analyzer {
    #tokenizer;
    #lookahead;    constructor(tokenizer) {
        this.#tokenizer = tokenizer;
    }    read(string) {
        this.#tokenizer.read(string);        // Set the first token as the lookahead
        this.#lookahead = this.#tokenizer.next();        return this.#JSON();
    }    #eat(tokenType) {
        const token = this.#lookahead;        if (!token) {
            throw new Error(
                `Unexpected end of input; expected ${token.type}`
            );
        }        if (this.#lookahead.type !== tokenType) {
            throw new Error(
                `Expected ${tokenType} === ${token.type}`
            );
        }        // Set the next token as the lookahead
        this.#lookahead = this.#tokenizer.next();        return token;
    }
}

With #eat and read having been implemented, the methods that build the AST can be implemented.

class Analyzer {
    // ... methods from above    #JSON() {
        switch (this.#lookahead.type) {
            case '"':
                return this.#JSON_STRING();            case 'number':
                return this.#JSON_number();            case 'null':
                return this.#JSON_null();            case 'true':
            case 'false':
                return this.#JSON_BOOL();            case '[':
                return this.#JSON_ARRAY();            case '{':
                return this.#JSON_OBJECT();            default:
                throw new Error(
                    `Received token which cannot be valid JSON`
                );
        }
    }
}

As can be seen from this method, the lookahead token plays a crucial role in allowing the analyzer to determine which JSON value should be parsed. Without this knowledge the analyzer would be unable to continue. Since only a subset of the grammar’s set of tokens can begin a JSON value, if the lookahead is not in one of the case statements then the analyzer is attempting to read malformed JSON and throws an error.

The rest of the methods (some of which exemplify the use of #eat) are:

class Analyzer {
    // ... previous methods    #JSON_STRING() {
        // The quotation marks are necessary for the JSON grammar,
        // but contribute nothing to the semantic content of the
        // AST, so ensure they exist but do not use
        this.#eat('"');
        const string = this.#eat('string').token;
        this.#eat('"');        return {
            type: 'JSON_STRING',
            value: string,
        }
    }    #JSON_number() {
        const number = this.#eat('number').token;        return {
            type: 'JSON_number',
            value: +number,
        }
    }    #JSON_null() {
        this.#eat('null');        return {
            type: 'JSON_null',
            value: null,
        }
    }    #JSON_BOOL() {
        const bool = this.#lookahead.type === 'true'
            ? this.#eat('true')
            : this.#eat('false');        return {
            type: 'JSON_BOOL',
            value: bool.token === 'true',
        }
    }
}

This takes care of the non-recursive rules of the grammar. Each rule returns an object which specifies the type of the node it represents in the AST as well as an interpreted value of that node (meaning that if the analyzer is reading a number, the node returned has a value which is a number instead of the textual token of that number).

These terminal cases of a JSON value can be recursively combined into arrays and objects, and so the next rules will in some way involve recursing via the topmost #JSON method.

class Analyzer {
    // ... previous rules    #JSON_ARRAY() {
        this.#eat('[');
        const elements = this.#JSON_ARRAY_ELS();
        this.#eat(']');        return {
            type: 'JSON_ARRAY',
            elements,
        }
    }    #JSON_ARRAY_ELS() {
        const elements = [];        while (this.#lookahead.type !== ']') {
            elements.push(this.#JSON());            // The final element of an array will not have a comma
            // following it, so conditionally consume any comma
            // characters
            this.#lookahead.type === ',' && this.#eat(',');
        }        return elements;
    }
}

This is the first production rule that uses a different shape for its AST node: the ones representing the terminal nodes all had value properties, but this uses elements. AST nodes can have any possible shape with no problems as long as the shapes are used consistently. In this case it makes more semantic sense to have elements instead of value for array nodes.

The grammar rule for JSON_ARRAY_ELS is left recursive which, as said above, can cause infinite recursion for certain algorithmic analyzers. However, the manual analyzer circumvents this by analyzing the array elements using a while loop. Since recursion and looping are intimately related concepts, it is often possible to make such a substitution, and in this case it prevents changes to the grammar while still giving the same outcome.

The method #JSON_ARRAY_ELS itself does not return an AST node since it is a utility function rather than a primary method for reading JSON.

The methods for reading objects is very similar to those used for reading arrays:

class Analyzer {
    // ... previous methods    #JSON_OBJECT() {
        this.#eat('{');
        const kvPairs = this.#JSON_OBJECT_KV_PAIRS();
        this.#eat('}');        return {
            type: 'JSON_OBJECT',
            entries: kvPairs,
        }
    }    #JSON_OBJECT_KV_PAIRS() {
        const entries = [];        while (this.#lookahead.type !== '}') {
            entries.push(this.#JSON_OBJECT_KV_PAIR());
            this.#lookahead.type === ',' && this.#eat(',');
        }        return entries;
    }    #JSON_OBJECT_KV_PAIR() {
        // Get key
        this.#eat('"');
        const key = this.#eat('string').token;
        this.#eat('"');        this.#eat(':');        // Get value
        const value = this.#JSON();        return [key, value];
    }
}

Sample Output

Now that the parsing pipeline is complete, the tokenizer and analyzer can be put together to illustrate sample outputs given various JSON string inputs.

const tokenizer = new Tokenizer(tokens);
const analyzer = new Analyzer(tokenizer);const str1 = '"abc"';
const str2 = '145';
const str3 = '[true, { "k": null }, []]';analyzer.read(str1);
// {
//   type: 'JSON_STRING',
//   value: 'abc',
// }analyzer.read(str2);
// {
//   type: 'JSON_number',
//   value: 145,
// }analyzer.read(str3);
// {
//   type: 'JSON_ARRAY',
//   elements: [
//     { type: 'JSON_BOOL', value: true },
//     { type: 'JSON_OBJECT', entries: [
//       [ 'k', { type: 'JSON_null', value: null } ],
//     },
//     { type: 'JSON_ARRAY', elements: [] },
//  }

Converting the String to a JavaScript Value

So far the analyzer converts a JSON string to an abstract syntax tree. The goal, however, is to convert said string to a JavaScript value (thereby imitating the actual JSON.parse method).

The trick to doing it requires capitalizing on the fact that the structure of the AST already closely mirrors the structure of the JSON itself: every method that returns an AST node returns something that has all the information necessary to create the actual underlying JSON value. This means that with very minimal changes the analyzer could return values instead of ASTs.

To make this happen, a bit of refactoring is necessary. Instead of the analyzer’s methods always returning AST nodes, they will call factory functions that return something representing the “theme” of that factory. In the original case the theme was AST nodes; in the next case it will be JavaScript.

The refactored version of the analyzer would look like

const astFactory = {
    JSON_STRING(token) {
        return {
            type: 'JSON_STRING',
            value: token.token,
        };
    },    JSON_number(token) {
        return {
            type: 'JSON_number',
            value: +token.token,
        };
    },    JSON_null() {
        return {
            type: 'JSON_null',
            value: null,
        };
    },    JSON_BOOL(token) {
        return {
            type: 'JSON_BOOL',
            value: token.token === 'true',
        };
    },    JSON_ARRAY(elements) {
        return {
            type: 'JSON_ARRAY',
            elements,
        };
    },    JSON_OBJECT(entries) {
        return {
            type: 'JSON_OBJECT',
            entries,
        }
    }
};class Analyzer {
    // ... non-refactored methods and properties    #factory;    constructor(tokenizer, factory) {
        this.#tokenizer = tokenizer;
        this.#factory = factory;
    }    #JSON_STRING() {
        this.#eat('"');
        const string = this.#eat('string');
        this.#eat('"');        return this.#factory.JSON_STRING(string);
    }    #JSON_number() {
        const number = this.#eat('number');        return this.#factory.JSON_number(number);
    }    #JSON_null() {
        this.#eat('null');        return this.#factory.JSON_null();
    }    #JSON_BOOL() {
        const bool = this.#lookahead.type === 'true'
            ? this.#eat('true')
            : this.#eat('false');        return this.#factory.JSON_BOOL(bool);
    }    #JSON_ARRAY() {
        this.#eat('[');
        const elements = this.#JSON_ARRAY_ELS();
        this.#eat(']');        return this.#factory.JSON_ARRAY(elements);
    }    #JSON_OBJECT() {
        this.#eat('{');
        const kvPairs = this.#JSON_OBJECT_KV_PAIRS();
        this.#eat('}');        return this.#factory.JSON_OBJECT(kvPairs);
    }
}

Now the analyzer returns whatever the relevant #factory function returns. At this point all that is necessary is to write a JavaScript factory instead of an AST factory.

const jsFactory = {
    JSON_STRING(token) {
        return token.token;
    },    JSON_number(token) {
        return +token.token;
    },    JSON_null() {
        return null;
    },    JSON_BOOL(token) {
        return token.token === 'true'
    },    JSON_ARRAY(elements) {
        return elements;
    },    JSON_OBJECT(entries) {
        return entries.reduce(
            (obj, [k, v]) => {
                obj[k] = v;
                return obj;
            },
            {}
        );
    }
};

Using the same strings as above, the JSON factory gives as output truly parsed JSON values.

const tokenizer = new Tokenizer(tokens);
const analyzer = new Analyzer(tokenizer, jsFactory);const str1 = '"abc"';
const str2 = '145';
const str3 = '[true, { "k": null }, []]';analyzer.read(str1); // "abc"
analyzer.read(str2); // 145
analyzer.read(str3); // [true, { k: null }, []]

Conclusion

Writing a custom tokenizer and analyzer by hand is not so challenging a task as it might initially seem and opens up many other avenues for engineering, such as building a regular expression engine, a CSS engine, or even one’s own programming language. In fact, the most challenging part is often modeling the grammar appropriately. With a well-defined grammar the parser itself flows fairly quickly.

While this parser does not parse the full JSON spec, it gets close, and for anyone interested in implementing the full grammar, Douglas Crockford’s RFC can be found here. In fact, this parser can be extended to a JSON+ grammar if anyone is interested in extending its capabilities. Values like Infinity, undefined, and comments are not supported in the spec yet would be easy to add to the current grammar.

Here is a Gist of all the parser code in one place.