Pure TypeScript, cross-platform module for extracting text, images, and tables from PDFs.
Run 🤗 directly in your browser or in Node!
// v1
// const pdf = require('pdf-parse');
// pdf(buffer).then(result => console.log(result.text));
// v2
const { PDFParse } = require('pdf-parse');
// import { PDFParse } from 'pdf-parse';
async function run() {
const parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });
const result = await parser.getText();
console.log(result.text);
}
run();
React, Vue, Angular, or any other web framework.CLI DocumentationSecurity PolicygetHeadergetInfogetTextgetScreenshotgetImagegetTableunit testsIntegration tests to validate end-to-end behavior across environments.live demo, examples, tests and tests example folders.Next.js + Vercel, Netlify, AWS Lambda, Cloudflare Workers.npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parse
For command-line usage, install the package globally:
npm install -g pdf-parse
Or use it directly with npx:
npx pdf-parse --help
For detailed CLI documentation and usage examples, see: CLI Documentation
getHeader — Node Utility: PDF Header Retrieval and Validation// Important: getHeader is available from the 'pdf-parse/node' submodule
import { getHeader } from 'pdf-parse/node';
// Retrieve HTTP headers and file size without downloading the full file.
// Pass `true` to check PDF magic bytes via range request.
// Optionally validates PDFs by fetching the first 4 bytes (magic bytes).
// Useful for checking file existence, size, and type before full parsing.
// Node only, will not work in browser environments.
const result = await getHeader('https://bitcoin.org/bitcoin.pdf', true);
console.log(`Status: ${result.status}`);
console.log(`Content-Length: ${result.size}`);
console.log(`Is PDF: ${result.isPdf}`);
console.log(`Headers:`, result.headers);
getInfo — Extract Metadata and Document Informationimport { readFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';
const link = 'https://mehmet-kozan.github.io/pdf-parse/pdf/climate.pdf';
// const buffer = await readFile('reports/pdf/climate.pdf');
// const parser = new PDFParse({ data: buffer });
const parser = new PDFParse({ url: link });
const result = await parser.getInfo({ parsePageInfo: true });
await parser.destroy();
console.log(`Total pages: ${result.total}`);
console.log(`Title: ${result.info?.Title}`);
console.log(`Author: ${result.info?.Author}`);
console.log(`Creator: ${result.info?.Creator}`);
console.log(`Producer: ${result.info?.Producer}`);
// Access parsed date information
const dates = result.getDateNode();
console.log(`Creation Date: ${dates.CreationDate}`);
console.log(`Modification Date: ${dates.ModDate}`);
// Links, pageLabel, width, height (when `parsePageInfo` is true)
console.log('Per-page information:');
console.log(JSON.stringify(result.pages, null, 2));
getText — Extract Textimport { PDFParse } from 'pdf-parse';
const parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });
const result = await parser.getText();
// to extract text from page 3 only:
// const result = await parser.getText({ partial: [3] });
await parser.destroy();
console.log(result.text);
For a complete list of configuration options, see:
Usage Examples:
password.test.tsspecific-pages.test.tshyperlink.test.tspassword.test.tsurl.test.tsbase64.test.tslarge-file.test.tsgetScreenshot — Render Pages as PNGimport { readFile, writeFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';
const link = 'https://bitcoin.org/bitcoin.pdf';
// const buffer = await readFile('reports/pdf/bitcoin.pdf');
// const parser = new PDFParse({ data: buffer });
const parser = new PDFParse({ url: link });
// scale:1 for original page size.
// scale:1.5 50% bigger.
const result = await parser.getScreenshot({ scale: 1.5 });
await parser.destroy();
await writeFile('bitcoin.png', result.pages[0].data);
Usage Examples:
getScreenshot({scale:1.5}) — Increase rendering scale (higher DPI / larger image)getScreenshot({desiredWidth:1024}) — Request a target width in pixels; height scales to keep aspect ratioimageDataUrl (default: true) — include base64 data URL string in the result.imageBuffer (default: true) — include a binary buffer for each image.partial (e.g. getScreenshot({ partial: [1,3] }))partial overrides first/last.first to render the first N pages (e.g. getScreenshot({ first: 3 })).last to render the last N pages (e.g. getScreenshot({ last: 2 })).first and last are provided they form an inclusive range (first..last).getImage — Extract Embedded Imagesimport { readFile, writeFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';
const link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/image-test.pdf');
// const buffer = await readFile('reports/pdf/image-test.pdf');
// const parser = new PDFParse({ data: buffer });
const parser = new PDFParse({ url: link });
const result = await parser.getImage();
await parser.destroy();
await writeFile('adobe.png', result.pages[0].images[0].data);
Usage Examples:
getImage({ imageThreshold: 50 })imageThreshold is 80 (pixels)imageThreshold: 0.imageDataUrl (default: true) — include base64 data URL string in the result.imageBuffer (default: true) — include a binary buffer for each image.getImage({ partial: [2,4] })getTable — Extract Tabular Dataimport { readFile } from 'node:fs/promises';
import { PDFParse } from 'pdf-parse';
const link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/simple-table.pdf');
// const buffer = await readFile('reports/pdf/simple-table.pdf');
// const parser = new PDFParse({ data: buffer });
const parser = new PDFParse({ url: link });
const result = await parser.getTable();
await parser.destroy();
// Pretty-print each row of the first table
for (const row of result.pages[0].tables[0]) {
console.log(JSON.stringify(row));
}
import type { LoadParameters, ParseParameters, TextResult } from 'pdf-parse';
import { PasswordException, PDFParse, VerbosityLevel } from 'pdf-parse';
const loadParams: LoadParameters = {
url: 'https://mehmet-kozan.github.io/pdf-parse/pdf/password-123456.pdf',
verbosity: VerbosityLevel.WARNINGS,
password: 'abcdef',
};
const parseParams: ParseParameters = {
first: 1,
};
// Initialize the parser class without executing any code yet
const parser = new PDFParse(loadParams);
function handleResult(result: TextResult) {
console.log(result.text);
}
try {
const result = await parser.getText(parseParams);
handleResult(result);
} catch (error) {
// InvalidPDFException
// PasswordException
// FormatError
// ResponseException
// AbortException
// UnknownErrorException
if (error instanceof PasswordException) {
console.error('Password must be 123456\n', error);
} else {
throw error;
}
} finally {
// Always call destroy() to free memory
await parser.destroy();
}
React, Vue, Angular, or any other web framework.https://mehmet-kozan.github.io/pdf-parse/reports/demopdf-parse.es.js UMD/Global: pdf-parse.umd.jsweb worker explicitly.<!-- ES Module -->
<script type="module">
import {PDFParse} from 'https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.js';
//// Available Worker Files
// pdf.worker.mjs
// pdf.worker.min.mjs
// If you use a custom build or host pdf.worker.mjs yourself, configure worker accordingly.
PDFParse.setWorker('https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjs');
const parser = new PDFParse({url:'https://mehmet-kozan.github.io/pdf-parse/pdf/bitcoin.pdf'});
const result = await parser.getText();
console.log(result.text)
</script>
CDN Options: https://www.jsdelivr.com/package/npm/pdf-parse
https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.jshttps://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.es.jshttps://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.umd.jshttps://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.umd.jsWorker Options:
https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjshttps://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.min.mjspdf-parse-basedpdf-parse-basedBenchmark Note: The benchmark currently runs only against
pdf2json. I don't know the current state ofpdf2json— the original reason for creatingpdf-parsewas to work around stability issues withpdf2json. I deliberately did not includepdf-parseor otherpdf.js-based packages in the benchmark because dependencies conflict. If you have recommendations for additional packages to include, please open an issue, seebenchmark results.
Integration tests run on Node.js 20–24, see test_integration.yml.
Requires additional setup see docs/troubleshooting.md.
See docs/troubleshooting.md for detailed troubleshooting steps and worker configuration for Node.js and serverless environments.
If you encounter issues, please refer to the Troubleshooting Guide.
When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see: contributing to pdf-parse