pdf-parse vs pdf2json

PDF Parsing Libraries Comparison

PDF parsing libraries are essential tools in web development for extracting data from PDF documents. They enable developers to read, manipulate, and convert PDF files into usable formats, such as JSON or plain text. These libraries are particularly useful in applications that require document processing, data extraction, or integration with other systems. The choice between different PDF parsing libraries often depends on the specific requirements of the project, such as the complexity of the PDF files, the need for accuracy, and the desired output format.

Package	Downloads	Stars	Size	Issues	Publish	License

pdf-parse	1,154,582	-	-	-	7 years ago	MIT
pdf2json	206,370	2,116	14.6 MB	109	2 months ago	Apache-2.0

Output Format

pdf-parse:
pdf-parse outputs plain text extracted from PDF files, making it ideal for applications that only need the textual content without any formatting or structure.
pdf2json:
pdf2json provides a detailed JSON representation of the entire PDF document, including text, images, and layout information, allowing for more complex data manipulation and analysis.

Complexity Handling

pdf-parse:
pdf-parse is designed for simplicity and works well with straightforward PDF documents. It may struggle with highly complex PDFs that have intricate layouts or embedded objects.
pdf2json:
pdf2json excels in handling complex PDF structures, providing detailed information about the document's layout, images, and text positioning, making it suitable for advanced use cases.

Ease of Use

pdf-parse:
pdf-parse is user-friendly and easy to implement, requiring minimal setup and configuration, making it ideal for quick projects or simple text extraction tasks.
pdf2json:
pdf2json has a steeper learning curve due to its comprehensive output and additional features, but it offers more control and flexibility for developers who need to work with complex PDF data.

Performance

pdf-parse:
pdf-parse is lightweight and performs well for basic text extraction, but may not be optimized for processing large or complex PDF files efficiently.
pdf2json:
pdf2json may have slower performance on very large PDFs due to its detailed parsing and output generation, but it provides more thorough data extraction.

Community and Support

pdf-parse:
pdf-parse has a smaller community and fewer resources available for troubleshooting, but it is sufficient for basic use cases.
pdf2json:
pdf2json has a larger community and more extensive documentation, which can be beneficial for developers needing support or examples for complex implementations.

pdf-parse:
Choose pdf-parse if you need a simple and straightforward solution for extracting text content from PDF files. It is particularly effective for quick text extraction and is easy to integrate into existing Node.js applications.
pdf2json:
Choose pdf2json if you require a more comprehensive analysis of PDF documents, including the ability to extract structured data, metadata, and handle complex layouts. It provides a detailed JSON representation of the PDF structure, making it suitable for applications that need to manipulate or analyze PDF content in depth.

pdf-parse

pdf2json

pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

Similar Packages

pdf2json buggy, no support anymore, memory leak, throws non-catchable fatal errors
j-pdfjson fork of pdf2json
pdf-parser buggy, no tests
pdfreader using pdf2json
pdf-extract not cross-platform using xpdf

Installation

npm install pdf-parse

Basic Usage - Local Files

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {

	// number of pages
	console.log(data.numpages);
	// number of rendered pages
	console.log(data.numrender);
	// PDF info
	console.log(data.info);
	// PDF metadata
	console.log(data.metadata); 
	// PDF.js version
	// check https://mozilla.github.io/pdf.js/getting_started/
	console.log(data.version);
	// PDF text
	console.log(data.text); 
        
});

Basic Usage - HTTP

You can use crawler-request which uses the pdf-parse

Exception Handling

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {
	// use data
})
.catch(function(error){
	// handle exceptions
})

Extend

v1.0.9 and above break pagerender callback changelog
If you need another format like json, you can change page render behaviour with a callback
Check out https://mozilla.github.io/pdf.js/

// default render callback
function render_page(pageData) {
    //check documents https://mozilla.github.io/pdf.js/
    let render_options = {
        //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`.
        normalizeWhitespace: false,
        //do not attempt to combine same line TextItem's. The default value is `false`.
        disableCombineTextItems: false
    }

    return pageData.getTextContent(render_options)
	.then(function(textContent) {
		let lastY, text = '';
		for (let item of textContent.items) {
			if (lastY == item.transform[5] || !lastY){
				text += item.str;
			}  
			else{
				text += '\n' + item.str;
			}    
			lastY = item.transform[5];
		}
		return text;
	});
}

let options = {
    pagerender: render_page
}

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer,options).then(function(data) {
	//use new format
});

Options

const DEFAULT_OPTIONS = {
	// internal page parser callback
	// you can set this option, if you need another format except raw text
	pagerender: render_page,
	// max page number to parse
    max: 0,
    //check https://mozilla.github.io/pdf.js/getting_started/
    version: 'v1.10.100'
}

pagerender (callback)

If you need another format except raw text.

max (number)

Max number of page to parse. If the value is less than or equal to 0, parser renders all pages.

version (string, pdf.js version)

check pdf.js

'default'
'v1.9.426'
'v1.10.100'
'v1.10.88'
'v2.0.550'

default uses version v1.10.100
mozilla.github.io/pdf.js

Test

mocha or npm test
Check test folder and quickstart.js for extra usages.

Support

I use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.

Submitting an Issue

If you find a bug or a mistake, you can help by submitting an issue to GitLab Repository

Creating a Merge Request

GitLab calls it merge request instead of pull request.

License

MIT licensed and all it's dependencies are MIT or BSD licensed.