pdf-parse vs pdf2json vs pdfreader
PDF Parsing Libraries
pdf-parsepdf2jsonpdfreader
PDF Parsing Libraries

PDF parsing libraries in Node.js allow developers to extract text, images, and metadata from PDF files programmatically. These libraries are useful for applications that need to analyze, manipulate, or display content from PDF documents. They provide APIs to read and process PDF files, enabling tasks such as text extraction, data mining, and document analysis. Popular PDF parsing libraries include pdf-parse, pdf2json, and pdfreader, each offering unique features and capabilities for handling PDF content efficiently.

Npm Package Weekly Downloads Trend
3 Years
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
pdf-parse1,795,31010321.3 MB72 months agoApache-2.0
pdf2json335,8372,1678.18 MB742 months agoApache-2.0
pdfreader54,44869359.6 kB6a month agoMIT
Feature Comparison: pdf-parse vs pdf2json vs pdfreader

Text Extraction

  • pdf-parse:

    pdf-parse provides basic text extraction capabilities, focusing on extracting text content from PDF files. It does not preserve layout or formatting, but it is effective for extracting plain text quickly.

  • pdf2json:

    pdf2json extracts text along with its positioning information, providing a more detailed representation of the text within the PDF. This is useful for applications that need to analyze text layout and structure.

  • pdfreader:

    pdfreader offers text extraction with layout information, including support for extracting text from specific pages or regions. It allows for more controlled and structured extraction of text content.

Metadata Extraction

  • pdf-parse:

    pdf-parse extracts basic metadata from PDF files, including title, author, and creation date. This information is accessible through a simple API and is useful for quick metadata retrieval.

  • pdf2json:

    pdf2json extracts detailed metadata along with the entire PDF structure, including information about fonts, images, and annotations. This makes it suitable for applications that require comprehensive metadata analysis.

  • pdfreader:

    pdfreader provides access to PDF metadata, including title, author, and custom metadata fields. It allows for easy retrieval of metadata information as part of the PDF reading process.

Image Extraction

  • pdf-parse:

    pdf-parse does not support image extraction. It is focused solely on text and metadata extraction, making it unsuitable for applications that need to extract images from PDF files.

  • pdf2json:

    pdf2json supports image extraction as part of the PDF-to-JSON conversion process. It captures image data along with its positioning information, allowing for extraction and manipulation of images within the PDF.

  • pdfreader:

    pdfreader does not provide built-in support for image extraction. It focuses on text and metadata, but developers can extend its functionality to handle images if needed.

Output Format

  • pdf-parse:

    pdf-parse outputs extracted text and metadata in a simple, structured format. The text is returned as a string, while metadata is provided as an object, making it easy to work with the extracted data.

  • pdf2json:

    pdf2json converts PDF files into a JSON format, representing the entire PDF structure, including text, images, and metadata. This allows for detailed analysis and manipulation of the PDF content in a programmatic way.

  • pdfreader:

    pdfreader provides extracted text and metadata in a structured format, but it does not convert the PDF into another format. The data is accessible through the library's API, allowing for custom processing and handling.

Ease of Use: Code Examples

  • pdf-parse:

    Extracting text and metadata with pdf-parse

    const fs = require('fs');
    const pdf = require('pdf-parse');
    
    const pdfBuffer = fs.readFileSync('example.pdf');
    
    pdf(pdfBuffer).then(data => {
      console.log('Text:', data.text);
      console.log('Metadata:', data.metadata);
    });
    
  • pdf2json:

    Extracting text and metadata with pdf2json

    const fs = require('fs');
    const PDFParser = require('pdf2json');
    
    const pdfParser = new PDFParser();
    
    pdfParser.on('pdfParser_dataReady', pdfData => {
      console.log('PDF Data:', pdfData);
    });
    
    pdfParser.loadPDF('example.pdf');
    
  • pdfreader:

    Extracting text and metadata with pdfreader

    const fs = require('fs');
    const { PdfReader } = require('pdfreader');
    
    fs.readFile('example.pdf', (err, data) => {
      if (err) throw err;
    
      const reader = new PdfReader();
      reader.parseBuffer(data, (err, item) => {
        if (err) throw err;
        if (item && item.text) {
          console.log('Text:', item.text);
        }
      });
    });
    
How to Choose: pdf-parse vs pdf2json vs pdfreader
  • pdf-parse:

    Choose pdf-parse if you need a simple and lightweight solution for extracting text and metadata from PDF files. It is easy to use and integrates well with streams, making it ideal for quick text extraction tasks.

  • pdf2json:

    Choose pdf2json if you require a comprehensive representation of the PDF structure, including text, images, and metadata. It converts PDF files into a JSON format, allowing for detailed analysis and manipulation of the content.

  • pdfreader:

    Choose pdfreader if you need a library that provides a high-level API for reading PDF files and extracting text. It supports text extraction with layout information and allows for custom processing of PDF content.

README for pdf-parse

pdf-parse

Pure TypeScript, cross-platform module for extracting text, images, and tables from PDFs.
Run 🤗 directly in your browser or in Node!

npm version npm downloads node version tests tests biome vitest codecov test & coverage reports


Getting Started with v2 (Coming from v1)

// v1
// const pdf = require('pdf-parse');
// pdf(buffer).then(result => console.log(result.text));

// v2
const { PDFParse } = require('pdf-parse');
// import { PDFParse } from 'pdf-parse';

async function run() {
	const parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });

	const result = await parser.getText();
	console.log(result.text);
}

run();

Features demo

Installation

npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parse

CLI Installation

For command-line usage, install the package globally:

npm install -g pdf-parse

Or use it directly with npx:

npx pdf-parse --help

For detailed CLI documentation and usage examples, see: CLI Documentation

Usage

getHeader — Node Utility: PDF Header Retrieval and Validation

// Important: getHeader is available from the 'pdf-parse/node' submodule
import { getHeader } from 'pdf-parse/node';

// Retrieve HTTP headers and file size without downloading the full file.
// Pass `true` to check PDF magic bytes via range request.
// Optionally validates PDFs by fetching the first 4 bytes (magic bytes).
// Useful for checking file existence, size, and type before full parsing.
// Node only, will not work in browser environments.
const result = await getHeader('https://bitcoin.org/bitcoin.pdf', true);

console.log(`Status: ${result.status}`);
console.log(`Content-Length: ${result.size}`);
console.log(`Is PDF: ${result.isPdf}`);
console.log(`Headers:`, result.headers);

getInfo — Extract Metadata and Document Information

import { readFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';

const link = 'https://mehmet-kozan.github.io/pdf-parse/pdf/climate.pdf';
// const buffer = await readFile('reports/pdf/climate.pdf');
// const parser = new PDFParse({ data: buffer });

const parser = new PDFParse({ url: link });
const result = await parser.getInfo({ parsePageInfo: true });
await parser.destroy();

console.log(`Total pages: ${result.total}`);
console.log(`Title: ${result.info?.Title}`);
console.log(`Author: ${result.info?.Author}`);
console.log(`Creator: ${result.info?.Creator}`);
console.log(`Producer: ${result.info?.Producer}`);

// Access parsed date information
const dates = result.getDateNode();
console.log(`Creation Date: ${dates.CreationDate}`);
console.log(`Modification Date: ${dates.ModDate}`);

// Links, pageLabel, width, height (when `parsePageInfo` is true)
console.log('Per-page information:');
console.log(JSON.stringify(result.pages, null, 2));

getText — Extract Text

import { PDFParse } from 'pdf-parse';

const parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });
const result = await parser.getText();
// to extract text from page 3 only:
// const result = await parser.getText({ partial: [3] });
await parser.destroy();
console.log(result.text);

For a complete list of configuration options, see:

Usage Examples:

getScreenshot — Render Pages as PNG

import { readFile, writeFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';

const link = 'https://bitcoin.org/bitcoin.pdf';
// const buffer = await readFile('reports/pdf/bitcoin.pdf');
// const parser = new PDFParse({ data: buffer });

const parser = new PDFParse({ url: link });

// scale:1 for original page size.
// scale:1.5 50% bigger.
const result = await parser.getScreenshot({ scale: 1.5 });

await parser.destroy();
await writeFile('bitcoin.png', result.pages[0].data);

Usage Examples:

  • Limit output resolution or specific pages using ParseParameters
  • getScreenshot({scale:1.5}) — Increase rendering scale (higher DPI / larger image)
  • getScreenshot({desiredWidth:1024}) — Request a target width in pixels; height scales to keep aspect ratio
  • imageDataUrl (default: true) — include base64 data URL string in the result.
  • imageBuffer (default: true) — include a binary buffer for each image.
  • Select specific pages with partial (e.g. getScreenshot({ partial: [1,3] }))
  • partial overrides first/last.
  • Use first to render the first N pages (e.g. getScreenshot({ first: 3 })).
  • Use last to render the last N pages (e.g. getScreenshot({ last: 2 })).
  • When both first and last are provided they form an inclusive range (first..last).

getImage — Extract Embedded Images

import { readFile, writeFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';

const link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/image-test.pdf');
// const buffer = await readFile('reports/pdf/image-test.pdf');
// const parser = new PDFParse({ data: buffer });

const parser = new PDFParse({ url: link });
const result = await parser.getImage();
await parser.destroy();

await writeFile('adobe.png', result.pages[0].images[0].data);

Usage Examples:

  • Exclude images with width or height <= 50 px: getImage({ imageThreshold: 50 })
  • Default imageThreshold is 80 (pixels)
  • Useful for excluding tiny decorative or tracking images.
  • To disable size-based filtering and include all images, set imageThreshold: 0.
  • imageDataUrl (default: true) — include base64 data URL string in the result.
  • imageBuffer (default: true) — include a binary buffer for each image.
  • Extract images from specific pages: getImage({ partial: [2,4] })

getTable — Extract Tabular Data

import { readFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';

const link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/simple-table.pdf');
// const buffer = await readFile('reports/pdf/simple-table.pdf');
// const parser = new PDFParse({ data: buffer });

const parser = new PDFParse({ url: link });
const result = await parser.getTable();
await parser.destroy();

// Pretty-print each row of the first table
for (const row of result.pages[0].tables[0]) {
	console.log(JSON.stringify(row));
}

Exception Handling & Type Usage

import type { LoadParameters, ParseParameters, TextResult } from 'pdf-parse';
import { PasswordException, PDFParse, VerbosityLevel } from 'pdf-parse';

const loadParams: LoadParameters = {
	url: 'https://mehmet-kozan.github.io/pdf-parse/pdf/password-123456.pdf',
	verbosity: VerbosityLevel.WARNINGS,
	password: 'abcdef',
};

const parseParams: ParseParameters = {
	first: 1,
};

// Initialize the parser class without executing any code yet
const parser = new PDFParse(loadParams);

function handleResult(result: TextResult) {
	console.log(result.text);
}

try {
	const result = await parser.getText(parseParams);
	handleResult(result);
} catch (error) {
	// InvalidPDFException
	// PasswordException
	// FormatError
	// ResponseException
	// AbortException
	// UnknownErrorException
	if (error instanceof PasswordException) {
		console.error('Password must be 123456\n', error);
	} else {
		throw error;
	}
} finally {
	// Always call destroy() to free memory
	await parser.destroy();
}

Web / Browser

CDN Usage

<!-- ES Module -->
<script type="module">

  import {PDFParse} from 'https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.js';
  //// Available Worker Files
  // pdf.worker.mjs
  // pdf.worker.min.mjs
  // If you use a custom build or host pdf.worker.mjs yourself, configure worker accordingly.
  PDFParse.setWorker('https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjs');

  const parser = new PDFParse({url:'https://mehmet-kozan.github.io/pdf-parse/pdf/bitcoin.pdf'});
  const result = await parser.getText();

  console.log(result.text)
</script>

CDN Options: https://www.jsdelivr.com/package/npm/pdf-parse

  • https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.js
  • https://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.es.js
  • https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.umd.js
  • https://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.umd.js

Worker Options:

  • https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjs
  • https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.min.mjs

Similar Packages

Benchmark Note: The benchmark currently runs only against pdf2json. I don't know the current state of pdf2json — the original reason for creating pdf-parse was to work around stability issues with pdf2json. I deliberately did not include pdf-parse or other pdf.js-based packages in the benchmark because dependencies conflict. If you have recommendations for additional packages to include, please open an issue, see benchmark results.

Supported Node.js Versions(20.x, 22.x, 23.x, 24.x)

  • Supported: Node.js 20 (>= 20.16.0), Node.js 22 (>= 22.3.0), Node.js 23 (>= 23.0.0), and Node.js 24 (>= 24.0.0).
  • Not supported: Node.js 21.x, and Node.js 19.x and earlier.

Integration tests run on Node.js 20–24, see test_integration.yml.

Unsupported Node.js Versions (18.x, 19.x, 21.x)

Requires additional setup see docs/troubleshooting.md.

Worker Configuration & Troubleshooting

See docs/troubleshooting.md for detailed troubleshooting steps and worker configuration for Node.js and serverless environments.

  • Worker setup for Node.js, Next.js, Vercel, AWS Lambda, Netlify, Cloudflare Workers.
  • Common error messages and solutions.
  • Manual worker configuration for custom builds and Electron/NW.js.
  • Node.js version compatibility.

If you encounter issues, please refer to the Troubleshooting Guide.

Contributing

When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see: contributing to pdf-parse