cheerio vs htmlparser2 vs jsdom vs parse5
HTML Parsing Libraries
cheeriohtmlparser2jsdomparse5Similar Packages:

HTML Parsing Libraries

HTML parsing libraries are essential tools in web development that allow developers to manipulate and traverse HTML documents programmatically. They provide a way to extract data from web pages, modify the structure of HTML, and facilitate tasks such as web scraping, DOM manipulation, and testing. Each library has its unique features and design principles, making them suitable for different use cases and developer preferences.

Npm Package Weekly Downloads Trend

3 Years

Github Stars Ranking

Stat Detail

Package
Downloads
Stars
Size
Issues
Publish
License
cheerio030,2031.01 MB392 months agoMIT
htmlparser204,801198 kB19 hours agoMIT
jsdom021,5296.93 MB4195 days agoMIT
parse503,885337 kB358 months agoMIT

Feature Comparison: cheerio vs htmlparser2 vs jsdom vs parse5

API Design

  • cheerio:

    Cheerio provides a jQuery-like syntax, making it easy for developers familiar with jQuery to manipulate HTML documents. Its API is intuitive and allows for chaining methods, which simplifies the process of traversing and modifying the DOM.

  • htmlparser2:

    htmlparser2 offers a low-level API that gives developers control over the parsing process. It allows for event-driven parsing, which can be beneficial for handling large documents efficiently, but may require more boilerplate code compared to higher-level libraries.

  • jsdom:

    jsdom mimics the browser environment, providing a comprehensive API that includes support for modern web features like fetch, localStorage, and more. This makes it suitable for testing and running scripts that rely on browser-specific functionality.

  • parse5:

    parse5 is designed to be a fast and compliant HTML parser with a straightforward API. It focuses on providing a clear separation between parsing and serialization, allowing developers to handle HTML documents in a structured way.

Performance

  • cheerio:

    Cheerio is optimized for speed and efficiency, making it a great choice for web scraping tasks where performance is crucial. It operates in a lightweight manner, parsing HTML quickly without the overhead of a browser.

  • htmlparser2:

    htmlparser2 is known for its high performance and low memory usage, especially when dealing with large or malformed HTML documents. Its streaming parser allows for efficient handling of input data, making it suitable for performance-sensitive applications.

  • jsdom:

    While jsdom provides a rich feature set, it may not be as performant as lighter libraries like Cheerio or htmlparser2 due to its comprehensive DOM simulation. It's best used when full browser capabilities are needed, rather than for raw performance.

  • parse5:

    parse5 is designed for speed and compliance with the HTML5 specification. It balances performance with adherence to standards, making it a solid choice for projects that require both.

Error Handling

  • cheerio:

    Cheerio does not perform extensive error handling for malformed HTML, relying on the underlying HTML parser. This can lead to unexpected results if the input HTML is not well-formed, so developers must ensure the input is valid.

  • htmlparser2:

    htmlparser2 excels at handling malformed HTML, providing robust error handling and recovery mechanisms. This makes it a preferred choice for parsing real-world HTML documents that may not conform to strict standards.

  • jsdom:

    jsdom provides error handling similar to a browser, allowing developers to catch and respond to DOM-related errors effectively. This is beneficial when running scripts that may encounter unexpected HTML structures.

  • parse5:

    parse5 is built to handle HTML5 parsing errors gracefully, allowing developers to work with imperfect HTML while still adhering to the specification. It provides detailed error reporting, which can aid in debugging.

Use Cases

  • cheerio:

    Cheerio is best suited for server-side web scraping, data extraction, and simple HTML manipulation tasks where a lightweight solution is preferred. Its jQuery-like syntax makes it easy to use for those familiar with jQuery.

  • htmlparser2:

    htmlparser2 is ideal for applications that require a fast, low-level parser for HTML documents, especially when performance is critical. It's often used in scenarios where developers need to build custom parsing logic.

  • jsdom:

    jsdom is perfect for testing front-end code in a Node.js environment, allowing developers to run scripts that require a full DOM. It's also useful for server-side rendering of web applications that rely on client-side JavaScript.

  • parse5:

    parse5 is suitable for projects that require strict adherence to HTML5 standards, such as web crawlers or validators. Its focus on compliance makes it a good choice for applications that need to process complex HTML structures.

Community and Support

  • cheerio:

    Cheerio has a strong community and is widely used in the web scraping ecosystem. It has good documentation and numerous examples available, making it easy for new users to get started.

  • htmlparser2:

    htmlparser2 is well-maintained and has a solid user base, but its documentation may not be as extensive as some other libraries. However, it is backed by a strong community of contributors.

  • jsdom:

    jsdom has a large community and is actively maintained, with extensive documentation and examples. It is widely used in testing frameworks and has strong support for modern web features.

  • parse5:

    parse5 is actively maintained and has a growing community. Its documentation is clear, and it provides examples to help developers understand how to use the library effectively.

How to Choose: cheerio vs htmlparser2 vs jsdom vs parse5

  • cheerio:

    Choose Cheerio if you need a fast, lightweight library for server-side DOM manipulation that closely resembles jQuery's API. It's ideal for web scraping and quick HTML manipulation tasks without the overhead of a full browser environment.

  • htmlparser2:

    Opt for htmlparser2 if you require a highly efficient, low-level HTML parser that can handle malformed HTML and offers great flexibility. It's suitable for projects where performance is critical and you need fine-grained control over parsing.

  • jsdom:

    Select jsdom if you need a full-fledged DOM implementation that simulates a browser environment. It's particularly useful for testing front-end code in Node.js or when you need to manipulate the DOM as you would in a browser, including support for modern web APIs.

  • parse5:

    Use parse5 if you need a fast and robust HTML parser that adheres closely to the HTML5 specification. It's great for projects that require strict compliance with HTML standards and can handle complex HTML structures.

README for cheerio

cheerio

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

中文文档 (Chinese Readme)

import * as cheerio from 'cheerio';
const $ = cheerio.load('<h2 class="title">Hello world</h2>');

$('h2.title').text('Hello there!');
$('h2').addClass('welcome');

$.html();
//=> <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>

Installation

Install Cheerio using a package manager like npm, yarn, or bun.

npm install cheerio
# or
bun add cheerio

Features

❤ Proven syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.

ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient.

❁ Incredibly flexible: Cheerio wraps around parse5 for parsing HTML and can optionally use the forgiving htmlparser2. Cheerio can parse nearly any HTML or XML document. Cheerio works in both browser and server environments.

API

Loading

First you need to load in the HTML. This step in jQuery is implicit, since jQuery operates on the one, baked-in DOM. With Cheerio, we need to pass in the HTML document.

// ESM or TypeScript:
import * as cheerio from 'cheerio';

// In other environments:
const cheerio = require('cheerio');

const $ = cheerio.load('<ul id="fruits">...</ul>');

$.html();
//=> <html><head></head><body><ul id="fruits">...</ul></body></html>

Selectors

Once you've loaded the HTML, you can use jQuery-style selectors to find elements within the document.

$( selector, [context], [root] )

selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root, if provided, is typically the HTML document string.

This selector method is the starting point for traversing and manipulating the document. Like in jQuery, it's the primary method for selecting elements in the document.

$('.apple', '#fruits').text();
//=> Apple

$('ul .pear').attr('class');
//=> pear

$('li[class=orange]').html();
//=> Orange

Rendering

When you're ready to render the document, you can call the html method on the "root" selection:

$.root().html();
//=>  <html>
//      <head></head>
//      <body>
//        <ul id="fruits">
//          <li class="apple">Apple</li>
//          <li class="orange">Orange</li>
//          <li class="pear">Pear</li>
//        </ul>
//      </body>
//    </html>

If you want to render the outerHTML of a selection, you can use the outerHTML prop:

$('.pear').prop('outerHTML');
//=> <li class="pear">Pear</li>

You may also render the text content of a Cheerio object using the text method:

const $ = cheerio.load('This is <em>content</em>.');
$('body').text();
//=> This is content.

The "DOM Node" object

Cheerio collections are made up of objects that bear some resemblance to browser-based DOM nodes. You can expect them to define the following properties:

  • tagName
  • parentNode
  • previousSibling
  • nextSibling
  • nodeValue
  • firstChild
  • childNodes
  • lastChild

Screencasts

https://vimeo.com/31950192

This video tutorial is a follow-up to Nettut's "How to Scrape Web Pages with Node.js and jQuery", using cheerio instead of JSDOM + jQuery. This video shows how easy it is to use cheerio and how much faster cheerio is than JSDOM + jQuery.

Cheerio in the real world

Are you using cheerio in production? Add it to the wiki!

Sponsors

Does your company use Cheerio in production? Please consider sponsoring this project! Your help will allow maintainers to dedicate more time and resources to its development and support.

Headlining Sponsors

Tidelift Github AirBnB HasData

Other Sponsors

OnlineCasinosSpelen Nieuwe-Casinos.net

Backers

Become a backer to show your support for Cheerio and help us maintain and improve this open source project.

Vasy Kafidoff

License

MIT