rss-parser vs feedparser
RSS和Atom Feed解析
rss-parserfeedparser类似的npm包:

RSS和Atom Feed解析

RSS和Atom Feed解析库是用于从RSS(Really Simple Syndication)和Atom格式的Web内容源中提取和解析数据的工具。这些库通常用于构建聚合器、内容抓取器或任何需要从这些格式中提取信息的应用程序。它们能够处理不同的XML结构,提取标题、链接、描述、发布日期等信息,并将其转换为易于处理的JavaScript对象。feedparser是一个功能强大的流式解析器,支持多种RSS和Atom版本,适合处理大型Feed。rss-parser则是一个轻量级的解析器,提供简单易用的API,适合快速解析和提取Feed数据。

npm下载趋势

3 年

GitHub Stars 排名

统计详情

npm包名称
下载量
Stars
大小
Issues
发布时间
License
rss-parser568,1611,5031.87 MB803 年前MIT
feedparser01,978-206 年前MIT

功能对比: rss-parser vs feedparser

解析性能

  • rss-parser:

    rss-parser在解析小到中等大小的Feed时表现良好,但对于非常大的Feed,可能会遇到内存使用问题。它一次性读取整个Feed,适合快速解析较小的内容。

  • feedparser:

    feedparser支持流式解析,允许您逐块处理数据,适合处理大型Feed而不占用过多内存。它的设计使得在解析过程中可以实时处理数据,减少了对内存的需求。

支持的Feed格式

  • rss-parser:

    rss-parser主要支持RSS 2.0和Atom 1.0,针对常见格式进行了优化,但对较旧或不常见的格式支持有限。它适合处理主流的Feed格式,但在处理一些特定格式时可能会遇到限制。

  • feedparser:

    feedparser支持多种RSS和Atom格式,包括RSS 0.9、1.0、2.0以及Atom 1.0。它对不同版本的Feed提供了广泛的支持,适合需要处理各种格式的应用。

错误处理

  • rss-parser:

    rss-parser在遇到解析错误时会抛出异常,但对错误的处理相对简单。它适合快速解析,但在处理复杂错误时可能需要额外的处理逻辑。

  • feedparser:

    feedparser提供了详细的错误处理机制,能够处理解析过程中出现的各种问题,包括无效的XML和不符合规范的Feed。它提供了事件回调,允许开发者及时处理解析错误。

API设计

  • rss-parser:

    rss-parser提供了简单易用的API,特别适合快速集成和使用。它的设计理念是简化Feed解析,降低开发者的学习成本。

  • feedparser:

    feedparser的API设计较为复杂,提供了丰富的事件和回调,适合需要深入定制解析过程的开发者。它的文档详细,适合对解析过程有深入需求的用户。

示例代码

  • rss-parser:

    使用rss-parser解析Feed

    const Parser = require('rss-parser');
    const parser = new Parser();
    
    parser.parseURL('https://example.com/feed', (err, feed) => {
      if (err) throw err;
      console.log(feed.title);
      feed.items.forEach(item => {
        console.log(item.title + ':' + item.link);
      });
    });
    
  • feedparser:

    使用feedparser解析Feed

    const FeedParser = require('feedparser');
    const request = require('request');
    
    const req = request('https://example.com/feed');
    const feedparser = new FeedParser();
    
    req.on('error', (error) => { console.error(error); });
    req.pipe(feedparser);
    
    feedparser.on('error', (error) => { console.error(error); });
    feedparser.on('readable', () => {
      let item;
      while (item = this.read()) {
        console.log(item);
      }
    });
    

如何选择: rss-parser vs feedparser

  • rss-parser:

    选择rss-parser如果您需要一个轻量级、易于使用的解析器,适合快速提取Feed数据。它的API简单,适合对性能要求高但不需要处理非常复杂Feed的应用。

  • feedparser:

    选择feedparser如果您需要处理复杂或大型的RSS/Atom Feed,特别是当您需要流式解析以节省内存时。它提供了对多种Feed格式的广泛支持,适合需要深入解析的应用。

rss-parser的README

rss-parser

Version Build Status Downloads

A small library for turning RSS XML feeds into JavaScript objects.

Installation

npm install --save rss-parser

Usage

You can parse RSS from a URL (parser.parseURL) or an XML string (parser.parseString).

Both callbacks and Promises are supported.

NodeJS

Here's an example in NodeJS using Promises with async/await:

let Parser = require('rss-parser');
let parser = new Parser();

(async () => {

  let feed = await parser.parseURL('https://www.reddit.com/.rss');
  console.log(feed.title);

  feed.items.forEach(item => {
    console.log(item.title + ':' + item.link)
  });

})();

TypeScript

When using TypeScript, you can set a type to control the custom fields:

import Parser from 'rss-parser';

type CustomFeed = {foo: string};
type CustomItem = {bar: number};

const parser: Parser<CustomFeed, CustomItem> = new Parser({
  customFields: {
    feed: ['foo', 'baz'],
    //            ^ will error because `baz` is not a key of CustomFeed
    item: ['bar']
  }
});

(async () => {

  const feed = await parser.parseURL('https://www.reddit.com/.rss');
  console.log(feed.title); // feed will have a `foo` property, type as a string

  feed.items.forEach(item => {
    console.log(item.title + ':' + item.link) // item will have a `bar` property type as a number
  });
})();

Web

We recommend using a bundler like webpack, but we also provide pre-built browser distributions in the dist/ folder. If you use the pre-built distribution, you'll need a polyfill for Promise support.

Here's an example in the browser using callbacks:

<script src="/node_modules/rss-parser/dist/rss-parser.min.js"></script>
<script>

// Note: some RSS feeds can't be loaded in the browser due to CORS security.
// To get around this, you can use a proxy.
const CORS_PROXY = "https://cors-anywhere.herokuapp.com/"

let parser = new RSSParser();
parser.parseURL(CORS_PROXY + 'https://www.reddit.com/.rss', function(err, feed) {
  if (err) throw err;
  console.log(feed.title);
  feed.items.forEach(function(entry) {
    console.log(entry.title + ':' + entry.link);
  })
})

</script>

Upgrading from v2 to v3

A few minor breaking changes were made in v3. Here's what you need to know:

  • You need to construct a new Parser() before calling parseString or parseURL
  • parseFile is no longer available (for better browser support)
  • options are now passed to the Parser constructor
  • parsed.feed is now just feed (top-level object removed)
  • feed.entries is now feed.items (to better match RSS XML)

Output

Check out the full output format in test/output/reddit.json

feedUrl: 'https://www.reddit.com/.rss'
title: 'reddit: the front page of the internet'
description: ""
link: 'https://www.reddit.com/'
items:
    - title: 'The water is too deep, so he improvises'
      link: 'https://www.reddit.com/r/funny/comments/3skxqc/the_water_is_too_deep_so_he_improvises/'
      pubDate: 'Thu, 12 Nov 2015 21:16:39 +0000'
      creator: "John Doe"
      content: '<a href="http://example.com">this is a link</a> &amp; <b>this is bold text</b>'
      contentSnippet: 'this is a link & this is bold text'
      guid: 'https://www.reddit.com/r/funny/comments/3skxqc/the_water_is_too_deep_so_he_improvises/'
      categories:
          - funny
      isoDate: '2015-11-12T21:16:39.000Z'
Notes:
  • The contentSnippet field strips out HTML tags and unescapes HTML entities
  • The dc: prefix will be removed from all fields
  • Both dc:date and pubDate will be available in ISO 8601 format as isoDate
  • If author is specified, but not dc:creator, creator will be set to author (see article)
  • Atom's updated becomes lastBuildDate for consistency

XML Options

Custom Fields

If your RSS feed contains fields that aren't currently returned, you can access them using the customFields option.

let parser = new Parser({
  customFields: {
    feed: ['otherTitle', 'extendedDescription'],
    item: ['coAuthor','subtitle'],
  }
});

parser.parseURL('https://www.reddit.com/.rss', function(err, feed) {
  console.log(feed.extendedDescription);

  feed.items.forEach(function(entry) {
    console.log(entry.coAuthor + ':' + entry.subtitle);
  })
})

To rename fields, you can pass in an array with two items, in the format [fromField, toField]:

let parser = new Parser({
  customFields: {
    item: [
      ['dc:coAuthor', 'coAuthor'],
    ]
  }
})

To pass additional flags, provide an object as the third array item. Currently there is one such flag:

  • keepArray (false) - set to true to return all values for fields that can have multiple entries.
  • includeSnippet (false) - set to true to add an additional field, ${toField}Snippet, with HTML stripped out
let parser = new Parser({
  customFields: {
    item: [
      ['media:content', 'media:content', {keepArray: true}],
    ]
  }
})

Default RSS version

If your RSS Feed doesn't contain a <rss> tag with a version attribute, you can pass a defaultRSS option for the Parser to use:

let parser = new Parser({
  defaultRSS: 2.0
});

xml2js passthrough

rss-parser uses xml2js to parse XML. You can pass these options to new xml2js.Parser() by specifying options.xml2js:

let parser = new Parser({
  xml2js: {
    emptyTag: '--EMPTY--',
  }
});

HTTP Options

Timeout

You can set the amount of time (in milliseconds) to wait before the HTTP request times out (default 60 seconds):

let parser = new Parser({
  timeout: 1000,
});

Headers

You can pass headers to the HTTP request:

let parser = new Parser({
  headers: {'User-Agent': 'something different'},
});

Redirects

By default, parseURL will follow up to five redirects. You can change this with options.maxRedirects.

let parser = new Parser({maxRedirects: 100});

Request passthrough

rss-parser uses http/https module to do requests. You can pass these options to http.get()/https.get() by specifying options.requestOptions:

e.g. to allow unauthorized certificate

let parser = new Parser({
  requestOptions: {
    rejectUnauthorized: false
  }
});

Contributing

Contributions are welcome! If you are adding a feature or fixing a bug, please be sure to add a test case

Running Tests

The tests run the RSS parser for several sample RSS feeds in test/input and outputs the resulting JSON into test/output. If there are any changes to the output files the tests will fail.

To check if your changes affect the output of any test cases, run

npm test

To update the output files with your changes, run

WRITE_GOLDEN=true npm test

Publishing Releases

npm run build
git commit -a -m "Build distribution"
npm version minor # or major/patch
npm publish
git push --follow-tags