A pure TypeScript/JavaScript, cross-platform module for extracting text, images, and tabular data from PDF files.
Contributing Note: When opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see:
contributing to pdf-parse
security policy
getText
getImage
pageToImage
getTable
test
folder and live demo
gh-pages branch
pdf-parse
basedpdf-parse
basedBenchmark Note: The benchmark currently runs only against
pdf2json
. I don't know the current state ofpdf2json
— the original reason for creatingpdf-parse
was to work around stability issues withpdf2json
. I deliberately did not includepdf-parse
or otherpdf.js
-based packages in the benchmark because dependencies conflict. If you have recommendations for additional packages to include, please open an issue.benchmark results
npm install pdf-parse
# or
pnpm add pdf-parse
# or
yarn add pdf-parse
# or
bun add pdf-parse
const pdf = require('pdf-parse');
// or
// const {pdf,PDFParse} = require('pdf-parse');
const fs = require('fs');
const data = fs.readFileSync('test.pdf');
pdf(data).then(result=>{
console.log(result.text);
});
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const textResult = await parser.getText();
console.log(textResult.text);
For a complete list of configuration options, see:
DocumentInitParameters
- PDF.js document initialization optionsParseParameters
- pdf-parse specific optionsUsage Examples
test/test-06-password
test/test-parse-parameters
test/test-hyperlinks
test/test-types
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.pageToImage();
for (const pageData of result.pages) {
const imgFileName = `page_${pageData.pageNumber}.png`;
await writeFile(imgFileName, pageData.data, { flag: 'w' });
}
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile, writeFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getImage();
for (const pageData of result.pages) {
for (const pageImage of pageData.images) {
const imgFileName = `page_${pageData.pageNumber}-${pageImage.fileName}.png`;
await writeFile(imgFileName, pageImage.data, { flag: 'w' });
}
}
// Node / ESM
import { PDFParse } from 'pdf-parse';
import { readFile } from 'node:fs/promises';
const buffer = await readFile('test/test-01/test.pdf');
const parser = new PDFParse({ data: buffer });
const result = await parser.getTable();
for (const pageData of result.pages) {
for (const table of pageData.tables) {
console.log(table);
}
}
npm run build
, you will find both regular and minified browser bundles in dist/browser
(e.g., pdf-parse.es.js
and pdf-parse.es.min.js
).live demo
gh-pages branch
Use the minified versions (.min.js
) for production to reduce file size, or the regular versions for development and debugging.
You can use any of the following browser bundles depending on your module system and requirements:
pdf-parse.es.js
orpdf-parse.es.min.js
for ES modulespdf-parse.umd.js
orpdf-parse.umd.min.js
for UMD/global usage
You can include the browser bundle directly from a CDN. Use the latest version:
Or specify a particular version:
Worker Note: In browser environments, the package sets
pdfjs.GlobalWorkerOptions.workerSrc
automatically when imported from the built browser bundle. If you use a custom build or hostpdf.worker
yourself, configure pdfjs accordingly.