3

Can somebody recommend a Node.Js module or a Javascript library (not based on Readability), which can be used to extract content from web pages and RSS feeds?

I found a good PHP library that can do the job - http://fivefilters.org/content-only/ - but looking for a Node.Js module that would do the same.

Thank you!

Aerodynamika
  • 7,883
  • 16
  • 78
  • 137

3 Answers3

12

I wrote a Node.js module just for this purpose called 'unfluff':

https://github.com/ageitgey/node-unfluff

Hopefully that will solve your problem.

Unfluff is based on the popular "python-goose" and "goose" (Scala) page extraction libraries in case you are familiar with those.

user3806840
  • 136
  • 1
  • 3
2

I would recommend cheerio. There are a couple of good tutorials out there including this one:

http://maxogden.com/scraping-with-node.html

TankofVines
  • 1,107
  • 2
  • 14
  • 23
  • Thank you, @TankofVines, but do you know of any implementation of cheerio specifically designed to scraping the contents of a web page, which could be used by simply calling a function and not writing much extra code? I found also something with cheerio on here - http://www.codeproject.com/Tips/702699/Web-scraping-with-Node-js - but again it seems like you have to fine-tune it and I'd like some ready solution, because I already have a lot of work with the other parts of the code. Will really appreciate if you can give me some leads! – Aerodynamika Mar 21 '14 at 19:36
  • @deemeetree, sorry I don't know of a higher level solution in node.js. I did a quick search of npm and I found a few modules, but I think they are still at a lower level than what you are looking for. Hopefully others will chime in. Good luck. – TankofVines Mar 24 '14 at 13:57
  • @deemeetree, what kind of "simple" API are you looking for? – user949300 Jul 05 '14 at 03:49
1

extract-main-text also can extract content well from HTML. node-unfluff is not stable for Japanese(maybe CJK) contents in my case.

Takuya Matsuyama
  • 589
  • 1
  • 5
  • 11