htmlparser2 and parse5 are both widely used HTML parsing libraries in the JavaScript ecosystem, designed to parse HTML documents into structured representations such as DOM trees or event streams. htmlparser2 provides a fast, streaming SAX-style parser with optional DOM support via additional modules, while parse5 implements the full WHATWG HTML specification and produces standards-compliant DOM trees that closely match browser behavior. Both are commonly used in tooling like linters, scrapers, static site generators, and testing utilities where accurate or performant HTML processing is required.
When you need to read, transform, or analyze HTML in JavaScript—whether in Node.js or the browser—you’ll likely reach for a dedicated parser. Two of the most battle-tested options are htmlparser2 and parse5. They solve similar problems but with different priorities: speed and flexibility versus spec compliance and correctness. Let’s dig into how they differ in practice.
htmlparser2 uses a SAX-style (event-driven) parser by default. It emits events as it encounters tags, text, and attributes—ideal for low-memory, one-pass processing.
// htmlparser2: Event-based parsing
const { Parser } = require('htmlparser2');
const parser = new Parser({
onopentag(name, attribs) {
console.log(`Tag: ${name}`, attribs);
},
ontext(text) {
if (text.trim()) console.log(`Text: ${text}`);
}
});
parser.write('<div class="card">Hello</div>');
parser.end();
You can also build a DOM tree using domhandler, but it’s an extra step:
// htmlparser2 + domhandler: Build a DOM
const { Parser } = require('htmlparser2');
const { DomHandler } = require('domhandler');
const handler = new DomHandler((error, dom) => {
if (!error) console.log(dom);
});
const parser = new Parser(handler);
parser.write('<p>Hi</p>');
parser.end();
parse5, by contrast, always builds a full DOM tree that follows the official HTML spec. There’s no streaming mode—it parses the entire document upfront.
// parse5: Parse to DOM tree
const parse5 = require('parse5');
const document = parse5.parse('<!DOCTYPE html><html><body><p>Hello</p></body></html>');
console.log(document); // Full tree with proper node types
// Or parse a fragment
const fragment = parse5.parseFragment('<li>Item</li>');
💡 Use
htmlparser2if you’re scanning large files and only care about certain tags (e.g., extracting all<img>sources). Useparse5when you need the complete, structured document.
htmlparser2 is not fully compliant with the WHATWG HTML specification. It’s forgiving and flexible—it treats <br> and <br/> the same, allows arbitrary self-closing tags like <my-component/>, and doesn’t enforce complex nesting rules.
This makes it great for non-standard HTML, JSX, or templating languages:
// htmlparser2 happily parses this
const html = '<Component attr="value"/>';
// Parses as a self-closing tag
parse5, however, strictly follows the HTML spec. It knows that only certain void elements (like <img>, <br>) can be self-closing, and it reconstructs the DOM exactly as a browser would—even with broken markup.
// parse5 corrects this invalid nesting
const badHtml = '<table><tr><div>Oops</div></tr></table>';
const tree = parse5.parse(badHtml);
// Result: <div> gets moved outside the table, just like in Chrome
If your tool must behave identically to a browser (e.g., a testing utility or SSR renderer), parse5 is the only safe choice.
htmlparser2’s DOM (via domhandler) uses a simplified node structure:
// htmlparser2 DOM node example
{
type: 'tag',
name: 'div',
attribs: { class: 'card' },
children: [ /* ... */ ]
}
It’s easy to traverse but not compatible with standard DOM APIs like node.nodeType or element.getAttribute().
parse5 produces nodes that match the DOM spec closely. While not identical to browser Element objects, they include standard properties like nodeName, childNodes, and parentNode:
// parse5 node example
{
nodeName: 'div',
tagName: 'div',
attrs: [{ name: 'class', value: 'card' }],
childNodes: [ /* ... */ ],
parentNode: /* ... */
}
This matters if you’re using libraries that expect spec-compliant structures (e.g., jsdom uses parse5 under the hood).
Both parsers handle malformed HTML gracefully, but differently.
htmlparser2 tries to keep going without strict correction. If you have unclosed tags, it won’t auto-close them unless configured to.
parse5 applies the HTML error recovery algorithm defined by the spec. For example:
<!-- Input -->
<div>
<p>Start
<div>Nested</div>
Continue?
</div>
A browser (and parse5) will auto-close the <p> before the inner <div>. htmlparser2 will treat “Continue?” as inside the <p>.
Use parse5 when you need predictable, standardized recovery—critical for security-sensitive tools like sanitizers.
You’re crawling pages and only need <a href> values.
htmlparser2let linkCount = 0;
const parser = new Parser({
onopentag(name, attribs) {
if (name === 'a' && attribs.href) {
console.log(attribs.href);
if (++linkCount >= 10) parser.end();
}
}
});
You render React/Vue to HTML, then manipulate the output before serving.
parse5const html = renderToString(App);
const document = parse5.parse(html);
// Inject meta tags, modify head, etc.
const finalHtml = parse5.serialize(document);
Your templates use <MyComponent prop={x}/> syntax.
htmlparser2parse5 would reject or mangle them.jsdom uses parse5 internally for parsing—so if you’re already using jsdom, you’re indirectly relying on parse5.cheerio (jQuery-like server-side DOM) uses htmlparser2 by default but can be configured to use parse5 for better spec compliance.// Cheerio with parse5
const cheerio = require('cheerio');
const $ = cheerio.load('<div>Test</div>', {
parser: require('parse5')
});
| Feature | htmlparser2 | parse5 |
|---|---|---|
| Parsing Style | Streaming (SAX) or DOM (with handler) | Full DOM tree only |
| Spec Compliance | Loose, forgiving | Full WHATWG HTML spec |
| Self-Closing Tags | Allows any (<x/>) | Only valid void elements (<img/>) |
| Memory Usage | Low (streaming) | Higher (full tree) |
| DOM Structure | Custom, simplified | Standards-aligned |
| Best For | Scraping, linting, transforms | SSR, testing, sanitization |
htmlparser2.parse5.Neither is “better”—they’re optimized for different jobs. Pick based on whether your priority is performance and flexibility or accuracy and standards compliance.
Choose parse5 when you require full WHATWG HTML specification compliance, such as when building tools that must replicate browser parsing exactly (e.g., SSR frameworks, testing libraries, or HTML sanitizers). It produces standard DOM trees compatible with the DOM spec and handles edge cases like malformed markup the same way modern browsers do. Prefer it over htmlparser2 when correctness trumps raw speed or memory efficiency.
Choose htmlparser2 if you need a lightweight, high-performance parser for tasks like scraping, linting, or transforming HTML where strict spec compliance isn't critical. Its streaming event-based API allows low-memory processing of large documents, and it supports XML-like syntax (e.g., self-closing tags) out of the box — useful for JSX or templating languages. Avoid it when you need exact browser-compatible parsing behavior.
npm install --save parse5
📖 Documentation 📖