pdf-parse vs pdf2json vs pdfreader
PDF Parsing Libraries Comparison
1 Year
pdf-parsepdf2jsonpdfreader
What's PDF Parsing Libraries?

PDF parsing libraries are essential tools for extracting text, images, and metadata from PDF documents in a programmatic way. These libraries facilitate the manipulation and analysis of PDF files, enabling developers to integrate PDF functionalities into their applications. They vary in terms of features, ease of use, and the types of data they can extract, making it crucial to choose the right library based on specific project requirements.

Package Weekly Downloads Trend
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
pdf-parse882,769---6 years agoMIT
pdf2json177,3822,07711.9 MB1043 months agoApache-2.0
pdfreader38,65067259.6 kB33 months agoMIT
Feature Comparison: pdf-parse vs pdf2json vs pdfreader

Text Extraction

  • pdf-parse:

    pdf-parse excels at extracting plain text from PDF files. It is designed for simplicity and efficiency, making it easy to implement in applications where the primary goal is to retrieve text without the need for additional formatting or structure.

  • pdf2json:

    pdf2json provides comprehensive text extraction alongside the preservation of the document's layout and structure. It outputs the extracted content in JSON format, allowing developers to access text in context with its positioning, which is useful for applications that require structured data.

  • pdfreader:

    pdfreader offers a balanced approach to text extraction, allowing for both simple text retrieval and the ability to navigate through the PDF structure. It can extract text while also providing access to the layout, making it versatile for various use cases.

Output Format

  • pdf-parse:

    The output from pdf-parse is straightforward, providing plain text without any additional formatting or structure. This makes it easy to use for applications that only require the text itself without concern for layout or positioning.

  • pdf2json:

    pdf2json outputs data in a structured JSON format, which includes not only the text but also the layout information such as font sizes and positions. This is particularly useful for applications that need to analyze or manipulate the document's content in a structured way.

  • pdfreader:

    pdfreader outputs text in a format that allows for basic navigation through the PDF structure. While it does not provide as much detail as pdf2json, it strikes a balance between simplicity and the ability to interact with the PDF's content.

Complexity and Learning Curve

  • pdf-parse:

    pdf-parse is very easy to use and has a low learning curve, making it suitable for developers who need a quick solution for text extraction without delving into complex configurations or setups.

  • pdf2json:

    pdf2json has a steeper learning curve due to its more complex output and the need to understand JSON structures. However, it is powerful for those who require detailed analysis and manipulation of PDF content.

  • pdfreader:

    pdfreader offers moderate complexity, providing a balance between ease of use and functionality. It is relatively straightforward to implement but requires some understanding of how to navigate PDF structures.

Use Cases

  • pdf-parse:

    Ideal for applications that require quick and efficient text extraction from PDFs, such as search engines, document indexing, or simple data retrieval tasks.

  • pdf2json:

    Best suited for applications that need to analyze PDF content in detail, such as data mining, content management systems, or any application that benefits from structured data extraction.

  • pdfreader:

    Suitable for applications that need to read and interact with PDF content, such as form processing, interactive document applications, or any scenario where basic structure navigation is required.

Performance

  • pdf-parse:

    pdf-parse is lightweight and performs well for text extraction, making it efficient for processing large volumes of PDF files quickly without significant overhead.

  • pdf2json:

    pdf2json may be slower than other libraries due to the complexity of converting PDF content into JSON format, especially for large or complex documents, but it provides rich detail in the output.

  • pdfreader:

    pdfreader offers decent performance for text extraction and basic structure navigation, making it a good choice for applications that require a balance between speed and functionality.

How to Choose: pdf-parse vs pdf2json vs pdfreader
  • pdf-parse:

    Choose pdf-parse if you need a simple and straightforward solution for extracting text from PDF files. It is lightweight and focuses primarily on text extraction, making it ideal for applications that do not require complex PDF structures or additional features.

  • pdf2json:

    Select pdf2json if you require detailed extraction of both text and structure from PDF files. This library converts PDF documents into a JSON format, preserving layout and structure, which is beneficial for applications that need to analyze or manipulate the content in a structured way.

  • pdfreader:

    Opt for pdfreader if you need a library that can handle both text extraction and basic PDF structure navigation. It allows for reading PDF files in a more controlled manner, making it suitable for applications that require more than just text extraction, such as form filling or interactive PDF manipulation.

README for pdf-parse

pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

version downloads node status

Similar Packages

Installation

npm install pdf-parse

Basic Usage - Local Files

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {

	// number of pages
	console.log(data.numpages);
	// number of rendered pages
	console.log(data.numrender);
	// PDF info
	console.log(data.info);
	// PDF metadata
	console.log(data.metadata); 
	// PDF.js version
	// check https://mozilla.github.io/pdf.js/getting_started/
	console.log(data.version);
	// PDF text
	console.log(data.text); 
        
});

Basic Usage - HTTP

You can use crawler-request which uses the pdf-parse

Exception Handling

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {
	// use data
})
.catch(function(error){
	// handle exceptions
})

Extend

  • v1.0.9 and above break pagerender callback changelog
  • If you need another format like json, you can change page render behaviour with a callback
  • Check out https://mozilla.github.io/pdf.js/
// default render callback
function render_page(pageData) {
    //check documents https://mozilla.github.io/pdf.js/
    let render_options = {
        //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`.
        normalizeWhitespace: false,
        //do not attempt to combine same line TextItem's. The default value is `false`.
        disableCombineTextItems: false
    }

    return pageData.getTextContent(render_options)
	.then(function(textContent) {
		let lastY, text = '';
		for (let item of textContent.items) {
			if (lastY == item.transform[5] || !lastY){
				text += item.str;
			}  
			else{
				text += '\n' + item.str;
			}    
			lastY = item.transform[5];
		}
		return text;
	});
}

let options = {
    pagerender: render_page
}

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer,options).then(function(data) {
	//use new format
});

Options

const DEFAULT_OPTIONS = {
	// internal page parser callback
	// you can set this option, if you need another format except raw text
	pagerender: render_page,
	// max page number to parse
    max: 0,
    //check https://mozilla.github.io/pdf.js/getting_started/
    version: 'v1.10.100'
}

pagerender (callback)

If you need another format except raw text.

max (number)

Max number of page to parse. If the value is less than or equal to 0, parser renders all pages.

version (string, pdf.js version)

check pdf.js

  • 'default'
  • 'v1.9.426'
  • 'v1.10.100'
  • 'v1.10.88'
  • 'v2.0.550'

default uses version v1.10.100
mozilla.github.io/pdf.js

Test

Support

I use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.

WhatsApp

Submitting an Issue

If you find a bug or a mistake, you can help by submitting an issue to GitLab Repository

Creating a Merge Request

GitLab calls it merge request instead of pull request.

License

MIT licensed and all it's dependencies are MIT or BSD licensed.