htmlparser2 vs sax vs xml2js vs cheerio
HTML and XML Parsing Libraries Comparison
1 Year
htmlparser2saxxml2jscheerioSimilar Packages:
What's HTML and XML Parsing Libraries?

HTML and XML parsing libraries are essential tools in web development for extracting and manipulating data from web pages and structured documents. These libraries provide developers with the ability to parse, traverse, and manipulate HTML and XML content efficiently. They are particularly useful for web scraping, data extraction, and transforming documents into usable formats. Each library has its unique strengths and use cases, making it crucial to choose the right one based on project requirements.

Package Weekly Downloads Trend
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
htmlparser239,319,8904,574489 kB225 months agoMIT
sax39,205,2001,11156 kB100a year agoISC
xml2js23,450,3864,9443.44 MB2472 years agoMIT
cheerio10,840,25429,4411.25 MB549 months agoMIT
Feature Comparison: htmlparser2 vs sax vs xml2js vs cheerio

Parsing Methodology

  • htmlparser2:

    htmlparser2 operates as a low-level parser that can handle both HTML and XML. It provides a streaming interface, allowing developers to process data as it is parsed, which is beneficial for handling large documents or when immediate processing is required.

  • sax:

    SAX (Simple API for XML) is an event-driven, streaming parser that reads XML documents sequentially. It does not build a tree structure, making it memory efficient and suitable for large XML files. It emits events for each element, allowing for immediate processing of data as it is encountered.

  • xml2js:

    xml2js converts XML into JavaScript objects, allowing developers to work with XML data in a more natural way. It parses the entire XML document into an object structure, making it easy to access and manipulate data, but it may consume more memory compared to streaming parsers.

  • cheerio:

    Cheerio uses a jQuery-like syntax to manipulate the DOM, making it intuitive for developers familiar with jQuery. It loads HTML into memory and allows for easy traversal and manipulation, but it does not create a full DOM tree, which makes it faster for certain tasks.

Performance

  • htmlparser2:

    htmlparser2 is designed for high performance and can handle large documents efficiently. Its streaming capabilities allow it to parse data in chunks, reducing memory overhead and improving performance for large-scale parsing tasks.

  • sax:

    SAX is highly efficient for large XML files due to its streaming nature. It processes data on-the-fly, which minimizes memory usage and allows for handling very large documents without significant performance degradation.

  • xml2js:

    xml2js is less performant for large XML documents compared to streaming parsers because it loads the entire document into memory. However, it excels in scenarios where ease of use and quick access to data are more critical than raw performance.

  • cheerio:

    Cheerio is optimized for speed and is particularly efficient for parsing and manipulating small to medium-sized HTML documents. It is not as performant as lower-level parsers for large documents, but its ease of use often outweighs this drawback for many applications.

Error Handling

  • htmlparser2:

    htmlparser2 is robust in handling malformed HTML and XML. It is designed to be forgiving, allowing developers to parse documents that do not conform to strict standards without crashing, making it suitable for web scraping.

  • sax:

    SAX provides minimal error handling, as it is focused on streaming and efficiency. Developers need to implement their own error handling logic to manage parsing errors, which can be a drawback in some use cases.

  • xml2js:

    xml2js offers some error handling capabilities, but it may not be as forgiving as htmlparser2. It can throw errors when encountering unexpected XML structures, requiring developers to ensure their XML is well-formed.

  • cheerio:

    Cheerio does not perform extensive error handling for malformed HTML, as it is designed to be forgiving and can work with imperfect markup. However, it may not provide detailed error messages, which can make debugging more challenging in complex scenarios.

Use Cases

  • htmlparser2:

    htmlparser2 is a versatile parser that can be used for both HTML and XML parsing. It is suitable for applications that need to handle a variety of document types, especially when performance is a concern.

  • sax:

    SAX is perfect for applications that need to process large XML files or streams of XML data in a memory-efficient manner. It is commonly used in scenarios where real-time processing of XML data is required, such as in data feeds or APIs.

  • xml2js:

    xml2js is ideal for applications that frequently interact with XML data and require a straightforward way to convert XML into JavaScript objects. It is commonly used in scenarios where XML data needs to be integrated into JavaScript applications seamlessly.

  • cheerio:

    Cheerio is best suited for web scraping and server-side DOM manipulation tasks where developers want to leverage jQuery-like syntax. It is ideal for projects that require quick data extraction and manipulation from HTML documents.

Learning Curve

  • htmlparser2:

    htmlparser2 has a moderate learning curve due to its low-level API and streaming nature. Developers may need to familiarize themselves with event-driven programming to use it effectively, which can be a barrier for beginners.

  • sax:

    SAX has a steeper learning curve as it requires understanding event-driven programming and managing state across events. This can be challenging for developers who are not accustomed to this paradigm.

  • xml2js:

    xml2js is relatively easy to learn, especially for developers already familiar with JavaScript objects. Its straightforward API allows for quick integration and manipulation of XML data, making it accessible for most developers.

  • cheerio:

    Cheerio has a gentle learning curve, especially for developers familiar with jQuery. Its syntax and methods are intuitive, making it easy to pick up and use effectively for DOM manipulation tasks.

How to Choose: htmlparser2 vs sax vs xml2js vs cheerio
  • htmlparser2:

    Select htmlparser2 when you require a fast, forgiving HTML and XML parser that can handle malformed markup. It is suitable for scenarios where performance is critical and you need to parse large documents efficiently without the overhead of a full DOM.

  • sax:

    Opt for sax if you need a streaming XML parser that is lightweight and efficient. It is perfect for processing large XML files in a memory-efficient manner, as it emits events as it parses the document, allowing for real-time processing without loading the entire document into memory.

  • xml2js:

    Use xml2js when you need to convert XML data into JavaScript objects easily. It is particularly useful for applications that require seamless integration of XML data into JavaScript environments, allowing for straightforward manipulation and access to XML data.

  • cheerio:

    Choose Cheerio if you need a fast and flexible library for server-side jQuery-like manipulation of HTML documents. It is ideal for web scraping and allows you to use familiar jQuery syntax to traverse and manipulate the DOM.

README for htmlparser2

htmlparser2

NPM version Downloads Node.js CI Coverage

The fast & forgiving HTML/XML parser.

htmlparser2 is the fastest HTML parser, and takes some shortcuts to get there. If you need strict HTML spec compliance, have a look at parse5.

Installation

npm install htmlparser2

A live demo of htmlparser2 is available on AST Explorer.

Ecosystem

| Name | Description | | ------------------------------------------------------------- | ------------------------------------------------------- | | htmlparser2 | Fast & forgiving HTML/XML parser | | domhandler | Handler for htmlparser2 that turns documents into a DOM | | domutils | Utilities for working with domhandler's DOM | | css-select | CSS selector engine, compatible with domhandler's DOM | | cheerio | The jQuery API for domhandler's DOM | | dom-serializer | Serializer for domhandler's DOM |

Usage

htmlparser2 itself provides a callback interface that allows consumption of documents with minimal allocations. For a more ergonomic experience, read Getting a DOM below.

import * as htmlparser2 from "htmlparser2";

const parser = new htmlparser2.Parser({
    onopentag(name, attributes) {
        /*
         * This fires when a new tag is opened.
         *
         * If you don't need an aggregated `attributes` object,
         * have a look at the `onopentagname` and `onattribute` events.
         */
        if (name === "script" && attributes.type === "text/javascript") {
            console.log("JS! Hooray!");
        }
    },
    ontext(text) {
        /*
         * Fires whenever a section of text was processed.
         *
         * Note that this can fire at any point within text and you might
         * have to stitch together multiple pieces.
         */
        console.log("-->", text);
    },
    onclosetag(tagname) {
        /*
         * Fires when a tag is closed.
         *
         * You can rely on this event only firing when you have received an
         * equivalent opening tag before. Closing tags without corresponding
         * opening tags will be ignored.
         */
        if (tagname === "script") {
            console.log("That's it?!");
        }
    },
});
parser.write(
    "Xyz <script type='text/javascript'>const foo = '<<bar>>';</script>",
);
parser.end();

Output (with multiple text events combined):

--> Xyz
JS! Hooray!
--> const foo = '<<bar>>';
That's it?!

This example only shows three of the possible events. Read more about the parser, its events and options in the wiki.

Usage with streams

While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:

import { WritableStream } from "htmlparser2/lib/WritableStream";

const parserStream = new WritableStream({
    ontext(text) {
        console.log("Streaming:", text);
    },
});

const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));

Getting a DOM

The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.

import * as htmlparser2 from "htmlparser2";

const dom = htmlparser2.parseDocument(htmlString);

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.

Parsing Feeds

htmlparser2 makes it easy to parse RSS, RDF and Atom feeds, by providing a parseFeed method:

const feed = htmlparser2.parseFeed(content, options);

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on GitHub Actions (sourced from here):

htmlparser2        : 2.17215 ms/file ± 3.81587
node-html-parser   : 2.35983 ms/file ± 1.54487
html5parser        : 2.43468 ms/file ± 2.81501
neutron-html5parser: 2.61356 ms/file ± 1.70324
htmlparser2-dom    : 3.09034 ms/file ± 4.77033
html-dom-parser    : 3.56804 ms/file ± 5.15621
libxmljs           : 4.07490 ms/file ± 2.99869
htmljs-parser      : 6.15812 ms/file ± 7.52497
parse5             : 9.70406 ms/file ± 6.74872
htmlparser         : 15.0596 ms/file ± 89.0826
html-parser        : 28.6282 ms/file ± 22.6652
saxes              : 45.7921 ms/file ± 128.691
html5              : 120.844 ms/file ± 153.944

How does this module differ from node-htmlparser?

In 2011, this module started as a fork of the htmlparser module. htmlparser2 was rewritten multiple times and, while it maintains an API that's mostly compatible with htmlparser, the projects don't share any code anymore.

The parser now provides a callback interface inspired by sax.js (originally targeted at readabilitySAX). As a result, old handlers won't work anymore.

The DefaultHandler was renamed to clarify its purpose (to DomHandler). The old name is still available when requiring htmlparser2 and your code should work as expected.

The RssHandler was replaced with a getFeed function that takes a DomHandler DOM and returns a feed object. There is a parseFeed helper function that can be used to parse a feed from a string.

Security contact information

To report a security vulnerability, please use the Tidelift security contact. Tidelift will coordinate the fix and disclosure.

htmlparser2 for enterprise

Available as part of the Tidelift Subscription.

The maintainers of htmlparser2 and thousands of other packages are working with Tidelift to deliver commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use. Learn more.