pdf-parse vs pdf2json
PDF Parsing Libraries Comparison
1 Year
pdf-parsepdf2json
What's PDF Parsing Libraries?

PDF parsing libraries are essential tools in web development for extracting data from PDF documents. They enable developers to read, manipulate, and convert PDF files into usable formats, such as JSON or plain text. These libraries are particularly useful in applications that require document processing, data extraction, or integration with other systems. The choice between different PDF parsing libraries often depends on the specific requirements of the project, such as the complexity of the PDF files, the need for accuracy, and the desired output format.

npm Package Downloads Trend
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
pdf-parse573,777---6 years agoMIT
pdf2json135,9512,04611.9 MB100a month agoApache-2.0
Feature Comparison: pdf-parse vs pdf2json

Output Format

  • pdf-parse:

    pdf-parse outputs plain text extracted from PDF files, making it ideal for applications that only need the textual content without any formatting or structure.

  • pdf2json:

    pdf2json provides a detailed JSON representation of the entire PDF document, including text, images, and layout information, allowing for more complex data manipulation and analysis.

Complexity Handling

  • pdf-parse:

    pdf-parse is designed for simplicity and works well with straightforward PDF documents. It may struggle with highly complex PDFs that have intricate layouts or embedded objects.

  • pdf2json:

    pdf2json excels in handling complex PDF structures, providing detailed information about the document's layout, images, and text positioning, making it suitable for advanced use cases.

Ease of Use

  • pdf-parse:

    pdf-parse is user-friendly and easy to implement, requiring minimal setup and configuration, making it ideal for quick projects or simple text extraction tasks.

  • pdf2json:

    pdf2json has a steeper learning curve due to its comprehensive output and additional features, but it offers more control and flexibility for developers who need to work with complex PDF data.

Performance

  • pdf-parse:

    pdf-parse is lightweight and performs well for basic text extraction, but may not be optimized for processing large or complex PDF files efficiently.

  • pdf2json:

    pdf2json may have slower performance on very large PDFs due to its detailed parsing and output generation, but it provides more thorough data extraction.

Community and Support

  • pdf-parse:

    pdf-parse has a smaller community and fewer resources available for troubleshooting, but it is sufficient for basic use cases.

  • pdf2json:

    pdf2json has a larger community and more extensive documentation, which can be beneficial for developers needing support or examples for complex implementations.

How to Choose: pdf-parse vs pdf2json
  • pdf-parse:

    Choose pdf-parse if you need a simple and straightforward solution for extracting text content from PDF files. It is particularly effective for quick text extraction and is easy to integrate into existing Node.js applications.

  • pdf2json:

    Choose pdf2json if you require a more comprehensive analysis of PDF documents, including the ability to extract structured data, metadata, and handle complex layouts. It provides a detailed JSON representation of the PDF structure, making it suitable for applications that need to manipulate or analyze PDF content in depth.

README for pdf-parse

pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

version downloads node status

Similar Packages

Installation

npm install pdf-parse

Basic Usage - Local Files

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {

	// number of pages
	console.log(data.numpages);
	// number of rendered pages
	console.log(data.numrender);
	// PDF info
	console.log(data.info);
	// PDF metadata
	console.log(data.metadata); 
	// PDF.js version
	// check https://mozilla.github.io/pdf.js/getting_started/
	console.log(data.version);
	// PDF text
	console.log(data.text); 
        
});

Basic Usage - HTTP

You can use crawler-request which uses the pdf-parse

Exception Handling

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {
	// use data
})
.catch(function(error){
	// handle exceptions
})

Extend

  • v1.0.9 and above break pagerender callback changelog
  • If you need another format like json, you can change page render behaviour with a callback
  • Check out https://mozilla.github.io/pdf.js/
// default render callback
function render_page(pageData) {
    //check documents https://mozilla.github.io/pdf.js/
    let render_options = {
        //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`.
        normalizeWhitespace: false,
        //do not attempt to combine same line TextItem's. The default value is `false`.
        disableCombineTextItems: false
    }

    return pageData.getTextContent(render_options)
	.then(function(textContent) {
		let lastY, text = '';
		for (let item of textContent.items) {
			if (lastY == item.transform[5] || !lastY){
				text += item.str;
			}  
			else{
				text += '\n' + item.str;
			}    
			lastY = item.transform[5];
		}
		return text;
	});
}

let options = {
    pagerender: render_page
}

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer,options).then(function(data) {
	//use new format
});

Options

const DEFAULT_OPTIONS = {
	// internal page parser callback
	// you can set this option, if you need another format except raw text
	pagerender: render_page,
	// max page number to parse
    max: 0,
    //check https://mozilla.github.io/pdf.js/getting_started/
    version: 'v1.10.100'
}

pagerender (callback)

If you need another format except raw text.

max (number)

Max number of page to parse. If the value is less than or equal to 0, parser renders all pages.

version (string, pdf.js version)

check pdf.js

  • 'default'
  • 'v1.9.426'
  • 'v1.10.100'
  • 'v1.10.88'
  • 'v2.0.550'

default uses version v1.10.100
mozilla.github.io/pdf.js

Test

Support

I use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.

WhatsApp

Submitting an Issue

If you find a bug or a mistake, you can help by submitting an issue to GitLab Repository

Creating a Merge Request

GitLab calls it merge request instead of pull request.

License

MIT licensed and all it's dependencies are MIT or BSD licensed.