pdf-parse vs pdf2json vs pdfreader
PDF Parsing Libraries Comparison
3 Years
pdf-parsepdf2jsonpdfreader
What's PDF Parsing Libraries?

PDF parsing libraries in Node.js allow developers to extract text, images, and metadata from PDF files programmatically. These libraries are useful for applications that need to analyze, manipulate, or display content from PDF documents. They provide APIs to read and process PDF files, enabling tasks such as text extraction, data mining, and document analysis. Popular PDF parsing libraries include pdf-parse, pdf2json, and pdfreader, each offering unique features and capabilities for handling PDF content efficiently.

Package Weekly Downloads Trend
Github Stars Ranking
Stat Detail
Package
Downloads
Stars
Size
Issues
Publish
License
pdf-parse1,354,784
---7 years agoMIT
pdf2json215,359
2,1368.1 MB992 months agoApache-2.0
pdfreader60,459
69159.6 kB38 months agoMIT
Feature Comparison: pdf-parse vs pdf2json vs pdfreader

Text Extraction

  • pdf-parse:

    pdf-parse provides basic text extraction capabilities, focusing on extracting text content from PDF files. It does not preserve layout or formatting, but it is effective for extracting plain text quickly.

  • pdf2json:

    pdf2json extracts text along with its positioning information, providing a more detailed representation of the text within the PDF. This is useful for applications that need to analyze text layout and structure.

  • pdfreader:

    pdfreader offers text extraction with layout information, including support for extracting text from specific pages or regions. It allows for more controlled and structured extraction of text content.

Metadata Extraction

  • pdf-parse:

    pdf-parse extracts basic metadata from PDF files, including title, author, and creation date. This information is accessible through a simple API and is useful for quick metadata retrieval.

  • pdf2json:

    pdf2json extracts detailed metadata along with the entire PDF structure, including information about fonts, images, and annotations. This makes it suitable for applications that require comprehensive metadata analysis.

  • pdfreader:

    pdfreader provides access to PDF metadata, including title, author, and custom metadata fields. It allows for easy retrieval of metadata information as part of the PDF reading process.

Image Extraction

  • pdf-parse:

    pdf-parse does not support image extraction. It is focused solely on text and metadata extraction, making it unsuitable for applications that need to extract images from PDF files.

  • pdf2json:

    pdf2json supports image extraction as part of the PDF-to-JSON conversion process. It captures image data along with its positioning information, allowing for extraction and manipulation of images within the PDF.

  • pdfreader:

    pdfreader does not provide built-in support for image extraction. It focuses on text and metadata, but developers can extend its functionality to handle images if needed.

Output Format

  • pdf-parse:

    pdf-parse outputs extracted text and metadata in a simple, structured format. The text is returned as a string, while metadata is provided as an object, making it easy to work with the extracted data.

  • pdf2json:

    pdf2json converts PDF files into a JSON format, representing the entire PDF structure, including text, images, and metadata. This allows for detailed analysis and manipulation of the PDF content in a programmatic way.

  • pdfreader:

    pdfreader provides extracted text and metadata in a structured format, but it does not convert the PDF into another format. The data is accessible through the library's API, allowing for custom processing and handling.

Ease of Use: Code Examples

  • pdf-parse:

    Extracting text and metadata with pdf-parse

    const fs = require('fs');
    const pdf = require('pdf-parse');
    
    const pdfBuffer = fs.readFileSync('example.pdf');
    
    pdf(pdfBuffer).then(data => {
      console.log('Text:', data.text);
      console.log('Metadata:', data.metadata);
    });
    
  • pdf2json:

    Extracting text and metadata with pdf2json

    const fs = require('fs');
    const PDFParser = require('pdf2json');
    
    const pdfParser = new PDFParser();
    
    pdfParser.on('pdfParser_dataReady', pdfData => {
      console.log('PDF Data:', pdfData);
    });
    
    pdfParser.loadPDF('example.pdf');
    
  • pdfreader:

    Extracting text and metadata with pdfreader

    const fs = require('fs');
    const { PdfReader } = require('pdfreader');
    
    fs.readFile('example.pdf', (err, data) => {
      if (err) throw err;
    
      const reader = new PdfReader();
      reader.parseBuffer(data, (err, item) => {
        if (err) throw err;
        if (item && item.text) {
          console.log('Text:', item.text);
        }
      });
    });
    
How to Choose: pdf-parse vs pdf2json vs pdfreader
  • pdf-parse:

    Choose pdf-parse if you need a simple and lightweight solution for extracting text and metadata from PDF files. It is easy to use and integrates well with streams, making it ideal for quick text extraction tasks.

  • pdf2json:

    Choose pdf2json if you require a comprehensive representation of the PDF structure, including text, images, and metadata. It converts PDF files into a JSON format, allowing for detailed analysis and manipulation of the content.

  • pdfreader:

    Choose pdfreader if you need a library that provides a high-level API for reading PDF files and extracting text. It supports text extraction with layout information and allows for custom processing of PDF content.

README for pdf-parse

pdf-parse

Pure javascript cross-platform module to extract texts from PDFs.

version downloads node status

Similar Packages

Installation

npm install pdf-parse

Basic Usage - Local Files

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {

	// number of pages
	console.log(data.numpages);
	// number of rendered pages
	console.log(data.numrender);
	// PDF info
	console.log(data.info);
	// PDF metadata
	console.log(data.metadata); 
	// PDF.js version
	// check https://mozilla.github.io/pdf.js/getting_started/
	console.log(data.version);
	// PDF text
	console.log(data.text); 
        
});

Basic Usage - HTTP

You can use crawler-request which uses the pdf-parse

Exception Handling

const fs = require('fs');
const pdf = require('pdf-parse');

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer).then(function(data) {
	// use data
})
.catch(function(error){
	// handle exceptions
})

Extend

  • v1.0.9 and above break pagerender callback changelog
  • If you need another format like json, you can change page render behaviour with a callback
  • Check out https://mozilla.github.io/pdf.js/
// default render callback
function render_page(pageData) {
    //check documents https://mozilla.github.io/pdf.js/
    let render_options = {
        //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`.
        normalizeWhitespace: false,
        //do not attempt to combine same line TextItem's. The default value is `false`.
        disableCombineTextItems: false
    }

    return pageData.getTextContent(render_options)
	.then(function(textContent) {
		let lastY, text = '';
		for (let item of textContent.items) {
			if (lastY == item.transform[5] || !lastY){
				text += item.str;
			}  
			else{
				text += '\n' + item.str;
			}    
			lastY = item.transform[5];
		}
		return text;
	});
}

let options = {
    pagerender: render_page
}

let dataBuffer = fs.readFileSync('path to PDF file...');

pdf(dataBuffer,options).then(function(data) {
	//use new format
});

Options

const DEFAULT_OPTIONS = {
	// internal page parser callback
	// you can set this option, if you need another format except raw text
	pagerender: render_page,
	// max page number to parse
    max: 0,
    //check https://mozilla.github.io/pdf.js/getting_started/
    version: 'v1.10.100'
}

pagerender (callback)

If you need another format except raw text.

max (number)

Max number of page to parse. If the value is less than or equal to 0, parser renders all pages.

version (string, pdf.js version)

check pdf.js

  • 'default'
  • 'v1.9.426'
  • 'v1.10.100'
  • 'v1.10.88'
  • 'v2.0.550'

default uses version v1.10.100
mozilla.github.io/pdf.js

Test

Support

I use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.

WhatsApp

Submitting an Issue

If you find a bug or a mistake, you can help by submitting an issue to GitLab Repository

Creating a Merge Request

GitLab calls it merge request instead of pull request.

License

MIT licensed and all it's dependencies are MIT or BSD licensed.