parse5 vs htmlparser2
HTML Parsing Libraries for Node.js and Browser Environments
parse5htmlparser2Similar Packages:
HTML Parsing Libraries for Node.js and Browser Environments

htmlparser2 and parse5 are both widely used HTML parsing libraries in the JavaScript ecosystem, designed to parse HTML documents into structured representations such as DOM trees or event streams. htmlparser2 provides a fast, streaming SAX-style parser with optional DOM support via additional modules, while parse5 implements the full WHATWG HTML specification and produces standards-compliant DOM trees that closely match browser behavior. Both are commonly used in tooling like linters, scrapers, static site generators, and testing utilities where accurate or performant HTML processing is required.

Npm Package Weekly Downloads Trend
3 Years
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
parse565,308,7233,852337 kB325 months agoMIT
htmlparser247,812,9664,740489 kB22a year agoMIT

htmlparser2 vs parse5: Parsing HTML the Right Way

When you need to read, transform, or analyze HTML in JavaScript—whether in Node.js or the browser—you’ll likely reach for a dedicated parser. Two of the most battle-tested options are htmlparser2 and parse5. They solve similar problems but with different priorities: speed and flexibility versus spec compliance and correctness. Let’s dig into how they differ in practice.

🧩 Parsing Model: Streaming Events vs Full DOM Tree

htmlparser2 uses a SAX-style (event-driven) parser by default. It emits events as it encounters tags, text, and attributes—ideal for low-memory, one-pass processing.

// htmlparser2: Event-based parsing
const { Parser } = require('htmlparser2');

const parser = new Parser({
  onopentag(name, attribs) {
    console.log(`Tag: ${name}`, attribs);
  },
  ontext(text) {
    if (text.trim()) console.log(`Text: ${text}`);
  }
});

parser.write('<div class="card">Hello</div>');
parser.end();

You can also build a DOM tree using domhandler, but it’s an extra step:

// htmlparser2 + domhandler: Build a DOM
const { Parser } = require('htmlparser2');
const { DomHandler } = require('domhandler');

const handler = new DomHandler((error, dom) => {
  if (!error) console.log(dom);
});
const parser = new Parser(handler);
parser.write('<p>Hi</p>');
parser.end();

parse5, by contrast, always builds a full DOM tree that follows the official HTML spec. There’s no streaming mode—it parses the entire document upfront.

// parse5: Parse to DOM tree
const parse5 = require('parse5');

const document = parse5.parse('<!DOCTYPE html><html><body><p>Hello</p></body></html>');
console.log(document); // Full tree with proper node types

// Or parse a fragment
const fragment = parse5.parseFragment('<li>Item</li>');

💡 Use htmlparser2 if you’re scanning large files and only care about certain tags (e.g., extracting all <img> sources). Use parse5 when you need the complete, structured document.

📜 Spec Compliance: “Good Enough” vs “Exactly Like a Browser”

htmlparser2 is not fully compliant with the WHATWG HTML specification. It’s forgiving and flexible—it treats <br> and <br/> the same, allows arbitrary self-closing tags like <my-component/>, and doesn’t enforce complex nesting rules.

This makes it great for non-standard HTML, JSX, or templating languages:

// htmlparser2 happily parses this
const html = '<Component attr="value"/>';
// Parses as a self-closing tag

parse5, however, strictly follows the HTML spec. It knows that only certain void elements (like <img>, <br>) can be self-closing, and it reconstructs the DOM exactly as a browser would—even with broken markup.

// parse5 corrects this invalid nesting
const badHtml = '<table><tr><div>Oops</div></tr></table>';
const tree = parse5.parse(badHtml);
// Result: <div> gets moved outside the table, just like in Chrome

If your tool must behave identically to a browser (e.g., a testing utility or SSR renderer), parse5 is the only safe choice.

⚙️ Output Format: Custom Nodes vs Standard DOM

htmlparser2’s DOM (via domhandler) uses a simplified node structure:

// htmlparser2 DOM node example
{
  type: 'tag',
  name: 'div',
  attribs: { class: 'card' },
  children: [ /* ... */ ]
}

It’s easy to traverse but not compatible with standard DOM APIs like node.nodeType or element.getAttribute().

parse5 produces nodes that match the DOM spec closely. While not identical to browser Element objects, they include standard properties like nodeName, childNodes, and parentNode:

// parse5 node example
{
  nodeName: 'div',
  tagName: 'div',
  attrs: [{ name: 'class', value: 'card' }],
  childNodes: [ /* ... */ ],
  parentNode: /* ... */
}

This matters if you’re using libraries that expect spec-compliant structures (e.g., jsdom uses parse5 under the hood).

🛠️ Error Handling and Recovery

Both parsers handle malformed HTML gracefully, but differently.

htmlparser2 tries to keep going without strict correction. If you have unclosed tags, it won’t auto-close them unless configured to.

parse5 applies the HTML error recovery algorithm defined by the spec. For example:

<!-- Input -->
<div>
  <p>Start
  <div>Nested</div>
  Continue?
</div>

A browser (and parse5) will auto-close the <p> before the inner <div>. htmlparser2 will treat “Continue?” as inside the <p>.

Use parse5 when you need predictable, standardized recovery—critical for security-sensitive tools like sanitizers.

🌐 Real-World Use Cases

Case 1: Web Scraper That Extracts Links

You’re crawling pages and only need <a href> values.

  • Best choice: htmlparser2
  • Why? Stream through gigabytes of HTML with minimal memory, stop after finding 10 links.
let linkCount = 0;
const parser = new Parser({
  onopentag(name, attribs) {
    if (name === 'a' && attribs.href) {
      console.log(attribs.href);
      if (++linkCount >= 10) parser.end();
    }
  }
});

Case 2: Static Site Generator with SSR

You render React/Vue to HTML, then manipulate the output before serving.

  • Best choice: parse5
  • Why? Must match browser DOM structure exactly for hydration to work.
const html = renderToString(App);
const document = parse5.parse(html);
// Inject meta tags, modify head, etc.
const finalHtml = parse5.serialize(document);

Case 3: Linter for Custom Templating Language

Your templates use <MyComponent prop={x}/> syntax.

  • Best choice: htmlparser2
  • Why? Tolerates non-HTML constructs; parse5 would reject or mangle them.

🔄 Interoperability with Other Tools

  • jsdom uses parse5 internally for parsing—so if you’re already using jsdom, you’re indirectly relying on parse5.
  • cheerio (jQuery-like server-side DOM) uses htmlparser2 by default but can be configured to use parse5 for better spec compliance.
// Cheerio with parse5
const cheerio = require('cheerio');
const $ = cheerio.load('<div>Test</div>', {
  parser: require('parse5')
});

📊 Summary: Key Differences

Featurehtmlparser2parse5
Parsing StyleStreaming (SAX) or DOM (with handler)Full DOM tree only
Spec ComplianceLoose, forgivingFull WHATWG HTML spec
Self-Closing TagsAllows any (<x/>)Only valid void elements (<img/>)
Memory UsageLow (streaming)Higher (full tree)
DOM StructureCustom, simplifiedStandards-aligned
Best ForScraping, linting, transformsSSR, testing, sanitization

💡 Final Recommendation

  • Need speed, low memory, or custom syntax support? → Go with htmlparser2.
  • Need browser-identical parsing for correctness or compatibility? → Choose parse5.

Neither is “better”—they’re optimized for different jobs. Pick based on whether your priority is performance and flexibility or accuracy and standards compliance.

How to Choose: parse5 vs htmlparser2
  • parse5:

    Choose parse5 when you require full WHATWG HTML specification compliance, such as when building tools that must replicate browser parsing exactly (e.g., SSR frameworks, testing libraries, or HTML sanitizers). It produces standard DOM trees compatible with the DOM spec and handles edge cases like malformed markup the same way modern browsers do. Prefer it over htmlparser2 when correctness trumps raw speed or memory efficiency.

  • htmlparser2:

    Choose htmlparser2 if you need a lightweight, high-performance parser for tasks like scraping, linting, or transforming HTML where strict spec compliance isn't critical. Its streaming event-based API allows low-memory processing of large documents, and it supports XML-like syntax (e.g., self-closing tags) out of the box — useful for JSX or templating languages. Avoid it when you need exact browser-compatible parsing behavior.

README for parse5

parse5

parse5

HTML parser and serializer.

npm install --save parse5

📖 Documentation 📖


List of parse5 toolset packages

GitHub

Online playground

Changelog