htmlparser2 vs jsdom vs cheerio vs html
HTML Parsing and Manipulation
htmlparser2jsdomcheeriohtmlSimilar Packages:
HTML Parsing and Manipulation

HTML parsing libraries in JavaScript provide tools for reading, manipulating, and extracting data from HTML documents. These libraries are essential for tasks such as web scraping, server-side rendering, and manipulating the DOM in environments outside the browser. They offer various features, including parsing HTML strings, traversing the DOM tree, and modifying elements, attributes, and content. These libraries are particularly useful in Node.js applications, where direct access to the browser's DOM API is not available. They enable developers to work with HTML documents in a structured and efficient manner, making it easier to extract information, manipulate content, and perform tasks that would typically require a browser environment. The choice of library depends on the specific needs of the project, such as performance requirements, ease of use, and the complexity of the HTML manipulation tasks.

Npm Package Weekly Downloads Trend
3 Years
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
htmlparser244,298,8924,695489 kB22a year agoMIT
jsdom35,536,76821,3373.29 MB4382 days agoMIT
cheerio12,543,52229,8911.27 MB424 months agoMIT
html287,90576-119 years agoBSD
Feature Comparison: htmlparser2 vs jsdom vs cheerio vs html

Parsing Speed

  • htmlparser2:

    htmlparser2 is one of the fastest HTML parsers available in the Node.js ecosystem. It is designed for performance, especially when dealing with large documents, making it ideal for streaming and memory-efficient parsing.

  • jsdom:

    jsdom is slower compared to the other libraries because it emulates a full browser environment. The performance trade-off is worth it for applications that need a complete DOM implementation, but it may not be suitable for tasks that require only simple parsing.

  • cheerio:

    cheerio is built on top of htmlparser2, which provides fast parsing capabilities. It is optimized for performance, making it a great choice for web scraping and other tasks that require quick HTML parsing.

  • html:

    html is designed to be lightweight and efficient, providing fast parsing and serialization of HTML documents. It is suitable for applications that need quick processing of HTML without significant overhead.

DOM Manipulation

  • htmlparser2:

    htmlparser2 focuses on parsing rather than manipulation. It provides a low-level API for handling HTML and XML, but it does not offer built-in tools for manipulating the DOM. Developers often use it in combination with other libraries for manipulation tasks.

  • jsdom:

    jsdom offers a full-featured DOM API, including support for advanced features like event handling, CSSOM, and more. It is the best choice for applications that require comprehensive DOM manipulation and a browser-like environment.

  • cheerio:

    cheerio provides a jQuery-like API for manipulating the DOM, making it easy to select, modify, and traverse elements. It is particularly useful for tasks like web scraping, where you need to extract or modify content quickly.

  • html:

    html provides basic DOM manipulation capabilities, but it is not as feature-rich as cheerio or jsdom. It is suitable for simple tasks that require minimal manipulation of HTML elements.

Memory Usage

  • htmlparser2:

    htmlparser2 is designed for low memory usage, particularly when used in streaming mode. It is ideal for applications that need to parse large documents without loading them entirely into memory.

  • jsdom:

    jsdom consumes more memory than the other libraries because it creates a complete DOM tree and emulates a browser environment. This makes it less suitable for memory-constrained applications.

  • cheerio:

    cheerio is memory-efficient, especially when compared to full browser emulation libraries. However, it still loads the entire HTML document into memory, which can be a concern for very large documents.

  • html:

    html is lightweight and has a small memory footprint, making it suitable for applications that need to process HTML without consuming significant resources.

Feature Completeness

  • htmlparser2:

    htmlparser2 is focused on parsing and does not provide high-level features for DOM manipulation or serialization. It is a low-level library that excels at parsing but requires additional tools for more complex tasks.

  • jsdom:

    jsdom is the most feature-complete library in this group, offering a full implementation of the DOM, including support for events, styles, and more. It is ideal for applications that need a complete web environment in Node.js.

  • cheerio:

    cheerio provides a comprehensive set of features for HTML manipulation, including support for CSS selectors, attribute manipulation, and content editing. It is a great all-around tool for web scraping and simple DOM tasks.

  • html:

    html offers basic features for parsing and serializing HTML, but it lacks advanced capabilities like CSS selector support or event handling. It is best used for simple tasks that do not require extensive functionality.

Ease of Use: Code Examples

  • htmlparser2:

    htmlparser2 has a more complex API due to its low-level nature. It may take some time for developers to become proficient, especially if they are not familiar with event-driven parsing.

  • jsdom:

    jsdom has a comprehensive API that mirrors the browser DOM, but its complexity can be overwhelming for beginners. It is well-documented, which helps ease the learning curve.

  • cheerio:

    cheerio is easy to use, especially for developers familiar with jQuery. Its API is intuitive and well-documented, making it quick to learn and implement.

  • html:

    html has a simple API that is easy to understand, but its lack of advanced features may require developers to implement additional functionality on their own.

Ease of Use: Code Examples

  • htmlparser2:

    htmlparser2 has a more complex API due to its low-level nature. It may take some time for developers to become proficient, especially if they are not familiar with event-driven parsing.

  • jsdom:

    jsdom has a comprehensive API that mirrors the browser DOM, but its complexity can be overwhelming for beginners. It is well-documented, which helps ease the learning curve.

  • cheerio:

    cheerio is easy to use, especially for developers familiar with jQuery. Its API is intuitive and well-documented, making it quick to learn and implement.

  • html:

    html has a simple API that is easy to understand, but its lack of advanced features may require developers to implement additional functionality on their own.

How to Choose: htmlparser2 vs jsdom vs cheerio vs html
  • htmlparser2:

    Choose htmlparser2 if you need a high-performance, event-driven parser for handling large HTML or XML documents. It is ideal for applications that require streaming parsing and low memory usage, such as web crawlers and data extraction tools.

  • jsdom:

    Choose jsdom if you need a full-featured DOM implementation in Node.js that closely mimics a real browser environment. It is suitable for applications that require advanced DOM manipulation, event handling, and support for modern web APIs.

  • cheerio:

    Choose cheerio if you need a fast and lightweight solution for parsing and manipulating HTML on the server side. It is ideal for web scraping and simple DOM manipulation tasks without the overhead of a full browser environment.

  • html:

    Choose html if you need a simple and efficient way to parse and serialize HTML documents. It is suitable for projects that require basic HTML manipulation without the need for complex features or a large API.

README for htmlparser2

htmlparser2

NPM version Downloads Node.js CI Coverage

The fast & forgiving HTML/XML parser.

htmlparser2 is the fastest HTML parser, and takes some shortcuts to get there. If you need strict HTML spec compliance, have a look at parse5.

Installation

npm install htmlparser2

A live demo of htmlparser2 is available on AST Explorer.

Ecosystem

NameDescription
htmlparser2Fast & forgiving HTML/XML parser
domhandlerHandler for htmlparser2 that turns documents into a DOM
domutilsUtilities for working with domhandler's DOM
css-selectCSS selector engine, compatible with domhandler's DOM
cheerioThe jQuery API for domhandler's DOM
dom-serializerSerializer for domhandler's DOM

Usage

htmlparser2 itself provides a callback interface that allows consumption of documents with minimal allocations. For a more ergonomic experience, read Getting a DOM below.

import * as htmlparser2 from "htmlparser2";

const parser = new htmlparser2.Parser({
    onopentag(name, attributes) {
        /*
         * This fires when a new tag is opened.
         *
         * If you don't need an aggregated `attributes` object,
         * have a look at the `onopentagname` and `onattribute` events.
         */
        if (name === "script" && attributes.type === "text/javascript") {
            console.log("JS! Hooray!");
        }
    },
    ontext(text) {
        /*
         * Fires whenever a section of text was processed.
         *
         * Note that this can fire at any point within text and you might
         * have to stitch together multiple pieces.
         */
        console.log("-->", text);
    },
    onclosetag(tagname) {
        /*
         * Fires when a tag is closed.
         *
         * You can rely on this event only firing when you have received an
         * equivalent opening tag before. Closing tags without corresponding
         * opening tags will be ignored.
         */
        if (tagname === "script") {
            console.log("That's it?!");
        }
    },
});
parser.write(
    "Xyz <script type='text/javascript'>const foo = '<<bar>>';</script>",
);
parser.end();

Output (with multiple text events combined):

--> Xyz
JS! Hooray!
--> const foo = '<<bar>>';
That's it?!

This example only shows three of the possible events. Read more about the parser, its events and options in the wiki.

Usage with streams

While the Parser interface closely resembles Node.js streams, it's not a 100% match. Use the WritableStream interface to process a streaming input:

import { WritableStream } from "htmlparser2/lib/WritableStream";

const parserStream = new WritableStream({
    ontext(text) {
        console.log("Streaming:", text);
    },
});

const htmlStream = fs.createReadStream("./my-file.html");
htmlStream.pipe(parserStream).on("finish", () => console.log("done"));

Getting a DOM

The DomHandler produces a DOM (document object model) that can be manipulated using the DomUtils helper.

import * as htmlparser2 from "htmlparser2";

const dom = htmlparser2.parseDocument(htmlString);

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at that for further information.

Parsing Feeds

htmlparser2 makes it easy to parse RSS, RDF and Atom feeds, by providing a parseFeed method:

const feed = htmlparser2.parseFeed(content, options);

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on GitHub Actions (sourced from here):

htmlparser2        : 2.17215 ms/file ± 3.81587
node-html-parser   : 2.35983 ms/file ± 1.54487
html5parser        : 2.43468 ms/file ± 2.81501
neutron-html5parser: 2.61356 ms/file ± 1.70324
htmlparser2-dom    : 3.09034 ms/file ± 4.77033
html-dom-parser    : 3.56804 ms/file ± 5.15621
libxmljs           : 4.07490 ms/file ± 2.99869
htmljs-parser      : 6.15812 ms/file ± 7.52497
parse5             : 9.70406 ms/file ± 6.74872
htmlparser         : 15.0596 ms/file ± 89.0826
html-parser        : 28.6282 ms/file ± 22.6652
saxes              : 45.7921 ms/file ± 128.691
html5              : 120.844 ms/file ± 153.944

How does this module differ from node-htmlparser?

In 2011, this module started as a fork of the htmlparser module. htmlparser2 was rewritten multiple times and, while it maintains an API that's mostly compatible with htmlparser, the projects don't share any code anymore.

The parser now provides a callback interface inspired by sax.js (originally targeted at readabilitySAX). As a result, old handlers won't work anymore.

The DefaultHandler was renamed to clarify its purpose (to DomHandler). The old name is still available when requiring htmlparser2 and your code should work as expected.

The RssHandler was replaced with a getFeed function that takes a DomHandler DOM and returns a feed object. There is a parseFeed helper function that can be used to parse a feed from a string.

Security contact information

To report a security vulnerability, please use the Tidelift security contact. Tidelift will coordinate the fix and disclosure.

htmlparser2 for enterprise

Available as part of the Tidelift Subscription.

The maintainers of htmlparser2 and thousands of other packages are working with Tidelift to deliver commercial support and maintenance for the open source dependencies you use to build your applications. Save time, reduce risk, and improve code health, while paying the maintainers of the exact dependencies you use. Learn more.