moo vs nearley vs antlr4 vs pegjs vs ohm-js vs jison
Parsing Libraries for JavaScript Comparison
1 Year
moonearleyantlr4pegjsohm-jsjison
What's Parsing Libraries for JavaScript?

Parsing libraries are tools that help developers analyze and interpret structured text data, such as programming languages or data formats. These libraries provide functionalities to create parsers that can read, process, and transform input text into a more manageable structure, typically an Abstract Syntax Tree (AST). They are essential in building compilers, interpreters, and other tools that require understanding the syntax and semantics of a language or data format.

Package Weekly Downloads Trend
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
moo4,539,85384332.7 kB33-BSD-3-Clause
nearley3,063,9403,656-1974 years agoMIT
antlr4579,13917,5733.09 MB9916 months agoBSD-3-Clause
pegjs375,8014,849-1168 years agoMIT
ohm-js183,0275,2292.37 MB472 years agoMIT
jison58,0494,361-1617 years agoMIT
Feature Comparison: moo vs nearley vs antlr4 vs pegjs vs ohm-js vs jison

Grammar Definition

  • moo:

    Moo focuses on tokenization and does not define grammars directly. Instead, it allows you to specify token patterns, which can then be used in conjunction with other parsing libraries.

  • nearley:

    Nearley supports a flexible grammar definition that can handle ambiguous grammars, allowing for more complex parsing scenarios and enabling the use of multiple parsing strategies.

  • antlr4:

    ANTLR4 allows you to define grammars in a clear and concise way, supporting complex language constructs and providing features like syntax error reporting and tree generation.

  • pegjs:

    PEG.js uses a Parsing Expression Grammar (PEG) approach for defining grammars, which is intuitive and allows for clear and concise grammar specifications.

  • ohm-js:

    Ohm.js provides a clear syntax for defining grammars and allows you to associate semantic actions directly within the grammar, making it easy to implement custom processing logic during parsing.

  • jison:

    Jison uses a BNF-like syntax for grammar definitions, making it straightforward to create parsers for simple languages or data formats without much overhead.

Performance

  • moo:

    Moo is designed for speed and efficiency in tokenization, making it one of the fastest lexers available, which is crucial for performance-sensitive applications.

  • nearley:

    Nearley can handle complex parsing scenarios but may have performance overhead due to its flexibility in grammar handling, which can affect speed in large inputs.

  • antlr4:

    ANTLR4 is optimized for performance with features like lazy parsing and efficient tree walking, making it suitable for large-scale applications that require fast parsing.

  • pegjs:

    PEG.js generates parsers that are generally slower than those produced by ANTLR4 but are easier to integrate and use for simpler applications.

  • ohm-js:

    Ohm.js balances performance with ease of use, providing reasonable performance for educational purposes and smaller projects, but may not be as optimized for large-scale applications.

  • jison:

    Jison is lightweight and generates parsers that are efficient for smaller grammars, but may not perform as well with very complex or large grammars.

Error Handling

  • moo:

    Moo focuses on tokenization and does not provide built-in error handling for parsing, so developers must implement their own error management when integrating with other libraries.

  • nearley:

    Nearley supports error handling but requires additional setup to manage ambiguous grammars, making it more complex for error recovery compared to others.

  • antlr4:

    ANTLR4 provides advanced error handling capabilities, allowing developers to define custom error messages and recovery strategies, which is essential for building robust parsers.

  • pegjs:

    PEG.js provides error messages that are clear and informative, making it easier to debug parsing issues, but may not support advanced recovery strategies.

  • ohm-js:

    Ohm.js includes basic error reporting features, allowing developers to catch and handle syntax errors during parsing, which is useful for educational contexts.

  • jison:

    Jison offers basic error handling features, but they may require additional configuration to manage complex error scenarios effectively.

Community and Documentation

  • moo:

    Moo has good documentation and examples, but its community is smaller compared to more established libraries, which may limit support.

  • nearley:

    Nearley has a growing community and decent documentation, with examples that help users understand its capabilities, especially for complex parsing tasks.

  • antlr4:

    ANTLR4 has a large community and extensive documentation, including tutorials and examples, which makes it easier for developers to learn and implement.

  • pegjs:

    PEG.js has a solid documentation base and an active community, providing ample resources for developers to learn and troubleshoot.

  • ohm-js:

    Ohm.js has a friendly community and clear documentation, making it suitable for beginners and educational purposes, with many examples available.

  • jison:

    Jison has a smaller community but provides sufficient documentation for basic usage, making it accessible for quick projects.

Integration and Extensibility

  • moo:

    Moo is designed to be lightweight and can easily integrate with other parsing libraries, making it a good choice for tokenization in larger projects.

  • nearley:

    Nearley is highly extensible and can be integrated with various tools and libraries, making it suitable for complex projects that require flexibility.

  • antlr4:

    ANTLR4 can be integrated with various programming languages and frameworks, making it highly extensible and suitable for diverse applications.

  • pegjs:

    PEG.js is designed for seamless integration into JavaScript applications and can be easily extended with custom parsing logic.

  • ohm-js:

    Ohm.js allows for easy integration with JavaScript applications and supports custom semantic actions, making it extensible for various use cases.

  • jison:

    Jison is easy to integrate with JavaScript projects and can be extended with custom actions, but it may not support other languages natively.

How to Choose: moo vs nearley vs antlr4 vs pegjs vs ohm-js vs jison
  • moo:

    Opt for Moo if you require a fast and efficient lexer that can handle tokenization for your parser. It is particularly useful when you want to build a lexer that is easy to integrate with other parsing libraries and is optimized for performance.

  • nearley:

    Use Nearley if you want a powerful parser that can handle ambiguous grammars and supports a wide range of parsing techniques. It's great for projects that require flexibility in grammar definitions and can handle complex parsing scenarios.

  • antlr4:

    Choose ANTLR4 if you need a powerful parser generator that supports multiple target languages and offers advanced features like error handling and tree walking. It's ideal for complex language processing tasks and has a steep learning curve but provides robust capabilities.

  • pegjs:

    Select PEG.js if you need a parser generator that uses Parsing Expression Grammar (PEG) and allows for easy integration into JavaScript applications. It's ideal for projects that require a simple yet powerful way to define grammars and generate parsers.

  • ohm-js:

    Choose Ohm.js if you want a library that combines parsing with semantic actions, allowing you to define grammars and associated actions in a straightforward way. It's suitable for educational purposes or projects that need a clear syntax definition alongside parsing capabilities.

  • jison:

    Select Jison if you prefer a simple and lightweight parser generator that can quickly create parsers from a BNF grammar. It's suitable for smaller projects or when you need to implement a quick solution without extensive setup.

README for moo

Moo!

Moo is a highly-optimised tokenizer/lexer generator. Use it to tokenize your strings, before parsing 'em with a parser like nearley or whatever else you're into.

Is it fast?

Yup! Flying-cows-and-singed-steak fast.

Moo is the fastest JS tokenizer around. It's ~2–10x faster than most other tokenizers; it's a couple orders of magnitude faster than some of the slower ones.

Define your tokens using regular expressions. Moo will compile 'em down to a single RegExp for performance. It uses the new ES6 sticky flag where possible to make things faster; otherwise it falls back to an almost-as-efficient workaround. (For more than you ever wanted to know about this, read adventures in the land of substrings and RegExps.)

You might be able to go faster still by writing your lexer by hand rather than using RegExps, but that's icky.

Oh, and it avoids parsing RegExps by itself. Because that would be horrible.

Usage

First, you need to do the needful: $ npm install moo, or whatever will ship this code to your computer. Alternatively, grab the moo.js file by itself and slap it into your web page via a <script> tag; moo is completely standalone.

Then you can start roasting your very own lexer/tokenizer:

    const moo = require('moo')

    let lexer = moo.compile({
      WS:      /[ \t]+/,
      comment: /\/\/.*?$/,
      number:  /0|[1-9][0-9]*/,
      string:  /"(?:\\["\\]|[^\n"\\])*"/,
      lparen:  '(',
      rparen:  ')',
      keyword: ['while', 'if', 'else', 'moo', 'cows'],
      NL:      { match: /\n/, lineBreaks: true },
    })

And now throw some text at it:

    lexer.reset('while (10) cows\nmoo')
    lexer.next() // -> { type: 'keyword', value: 'while' }
    lexer.next() // -> { type: 'WS', value: ' ' }
    lexer.next() // -> { type: 'lparen', value: '(' }
    lexer.next() // -> { type: 'number', value: '10' }
    // ...

When you reach the end of Moo's internal buffer, next() will return undefined. You can always reset() it and feed it more data when that happens.

On Regular Expressions

RegExps are nifty for making tokenizers, but they can be a bit of a pain. Here are some things to be aware of:

  • You often want to use non-greedy quantifiers: e.g. *? instead of *. Otherwise your tokens will be longer than you expect:

    let lexer = moo.compile({
      string: /".*"/,   // greedy quantifier *
      // ...
    })
    
    lexer.reset('"foo" "bar"')
    lexer.next() // -> { type: 'string', value: 'foo" "bar' }
    

    Better:

    let lexer = moo.compile({
      string: /".*?"/,   // non-greedy quantifier *?
      // ...
    })
    
    lexer.reset('"foo" "bar"')
    lexer.next() // -> { type: 'string', value: 'foo' }
    lexer.next() // -> { type: 'space', value: ' ' }
    lexer.next() // -> { type: 'string', value: 'bar' }
    
  • The order of your rules matters. Earlier ones will take precedence.

    moo.compile({
        identifier:  /[a-z0-9]+/,
        number:  /[0-9]+/,
    }).reset('42').next() // -> { type: 'identifier', value: '42' }
    
    moo.compile({
        number:  /[0-9]+/,
        identifier:  /[a-z0-9]+/,
    }).reset('42').next() // -> { type: 'number', value: '42' }
    
  • Moo uses multiline RegExps. This has a few quirks: for example, the dot /./ doesn't include newlines. Use [^] instead if you want to match newlines too.

  • Since an excluding character ranges like /[^ ]/ (which matches anything but a space) will include newlines, you have to be careful not to include them by accident! In particular, the whitespace metacharacter \s includes newlines.

Line Numbers

Moo tracks detailed information about the input for you.

It will track line numbers, as long as you apply the lineBreaks: true option to any rules which might contain newlines. Moo will try to warn you if you forget to do this.

Note that this is false by default, for performance reasons: counting the number of lines in a matched token has a small cost. For optimal performance, only match newlines inside a dedicated token:

    newline: {match: '\n', lineBreaks: true},

Token Info

Token objects (returned from next()) have the following attributes:

  • type: the name of the group, as passed to compile.
  • text: the string that was matched.
  • value: the string that was matched, transformed by your value function (if any).
  • offset: the number of bytes from the start of the buffer where the match starts.
  • lineBreaks: the number of line breaks found in the match. (Always zero if this rule has lineBreaks: false.)
  • line: the line number of the beginning of the match, starting from 1.
  • col: the column where the match begins, starting from 1.

Value vs. Text

The value is the same as the text, unless you provide a value transform.

const moo = require('moo')

const lexer = moo.compile({
  ws: /[ \t]+/,
  string: {match: /"(?:\\["\\]|[^\n"\\])*"/, value: s => s.slice(1, -1)},
})

lexer.reset('"test"')
lexer.next() /* { value: 'test', text: '"test"', ... } */

Reset

Calling reset() on your lexer will empty its internal buffer, and set the line, column, and offset counts back to their initial value.

If you don't want this, you can save() the state, and later pass it as the second argument to reset() to explicitly control the internal state of the lexer.

    lexer.reset('some line\n')
    let info = lexer.save() // -> { line: 10 }
    lexer.next() // -> { line: 10 }
    lexer.next() // -> { line: 11 }
    // ...
    lexer.reset('a different line\n', info)
    lexer.next() // -> { line: 10 }

Keywords

Moo makes it convenient to define literals.

    moo.compile({
      lparen:  '(',
      rparen:  ')',
      keyword: ['while', 'if', 'else', 'moo', 'cows'],
    })

It'll automatically compile them into regular expressions, escaping them where necessary.

Keywords should be written using the keywords transform.

    moo.compile({
      IDEN: {match: /[a-zA-Z]+/, type: moo.keywords({
        KW: ['while', 'if', 'else', 'moo', 'cows'],
      })},
      SPACE: {match: /\s+/, lineBreaks: true},
    })

Why?

You need to do this to ensure the longest match principle applies, even in edge cases.

Imagine trying to parse the input className with the following rules:

    keyword: ['class'],
    identifier: /[a-zA-Z]+/,

You'll get two tokens — ['class', 'Name'] -- which is not what you want! If you swap the order of the rules, you'll fix this example; but now you'll lex class wrong (as an identifier).

The keywords helper checks matches against the list of keywords; if any of them match, it uses the type 'keyword' instead of 'identifier' (for this example).

Keyword Types

Keywords can also have individual types.

    let lexer = moo.compile({
      name: {match: /[a-zA-Z]+/, type: moo.keywords({
        'kw-class': 'class',
        'kw-def': 'def',
        'kw-if': 'if',
      })},
      // ...
    })
    lexer.reset('def foo')
    lexer.next() // -> { type: 'kw-def', value: 'def' }
    lexer.next() // space
    lexer.next() // -> { type: 'name', value: 'foo' }

You can use Object.fromEntries to easily construct keyword objects:

Object.fromEntries(['class', 'def', 'if'].map(k => ['kw-' + k, k]))

States

Moo allows you to define multiple lexer states. Each state defines its own separate set of token rules. Your lexer will start off in the first state given to moo.states({}).

Rules can be annotated with next, push, and pop, to change the current state after that token is matched. A "stack" of past states is kept, which is used by push and pop.

  • next: 'bar' moves to the state named bar. (The stack is not changed.)
  • push: 'bar' moves to the state named bar, and pushes the old state onto the stack.
  • pop: 1 removes one state from the top of the stack, and moves to that state. (Only 1 is supported.)

Only rules from the current state can be matched. You need to copy your rule into all the states you want it to be matched in.

For example, to tokenize JS-style string interpolation such as a${{c: d}}e, you might use:

    let lexer = moo.states({
      main: {
        strstart: {match: '`', push: 'lit'},
        ident:    /\w+/,
        lbrace:   {match: '{', push: 'main'},
        rbrace:   {match: '}', pop: 1},
        colon:    ':',
        space:    {match: /\s+/, lineBreaks: true},
      },
      lit: {
        interp:   {match: '${', push: 'main'},
        escape:   /\\./,
        strend:   {match: '`', pop: 1},
        const:    {match: /(?:[^$`]|\$(?!\{))+/, lineBreaks: true},
      },
    })
    // <= `a${{c: d}}e`
    // => strstart const interp lbrace ident colon space ident rbrace rbrace const strend

The rbrace rule is annotated with pop, so it moves from the main state into either lit or main, depending on the stack.

Errors

If none of your rules match, Moo will throw an Error; since it doesn't know what else to do.

If you prefer, you can have moo return an error token instead of throwing an exception. The error token will contain the whole of the rest of the buffer.

    moo.compile({
      // ...
      myError: moo.error,
    })

    moo.reset('invalid')
    moo.next() // -> { type: 'myError', value: 'invalid', text: 'invalid', offset: 0, lineBreaks: 0, line: 1, col: 1 }
    moo.next() // -> undefined

You can have a token type that both matches tokens and contains error values.

    moo.compile({
      // ...
      myError: {match: /[\$?`]/, error: true},
    })

Formatting errors

If you want to throw an error from your parser, you might find formatError helpful. Call it with the offending token:

throw new Error(lexer.formatError(token, "invalid syntax"))

It returns a string with a pretty error message.

Error: invalid syntax at line 2 col 15:

  totally valid `syntax`
                ^

Iteration

Iterators: we got 'em.

    for (let here of lexer) {
      // here = { type: 'number', value: '123', ... }
    }

Create an array of tokens.

    let tokens = Array.from(lexer);

Use itt's iteration tools with Moo.

    for (let [here, next] of itt(lexer).lookahead()) { // pass a number if you need more tokens
      // enjoy!
    }

Transform

Moo doesn't allow capturing groups, but you can supply a transform function, value(), which will be called on the value before storing it in the Token object.

    moo.compile({
      STRING: [
        {match: /"""[^]*?"""/, lineBreaks: true, value: x => x.slice(3, -3)},
        {match: /"(?:\\["\\rn]|[^"\\])*?"/, lineBreaks: true, value: x => x.slice(1, -1)},
        {match: /'(?:\\['\\rn]|[^'\\])*?'/, lineBreaks: true, value: x => x.slice(1, -1)},
      ],
      // ...
    })

Contributing

Do check the FAQ.

Before submitting an issue, remember...