Recently I found node-osmosis
is a relatively new module but has powerful features, such as accepting both CSS and XPath selectors, rapid scraping, and nice syntax.
So I made a comparison between node-osmosis and x-ray by running some scrapes using both CSS and XPath. I encountered two problems as follows.
Problem 1: unknown result of node-osmosis
node-osmosis
provides a simple example in its homepage, reads
var osmosis = require('osmosis');
osmosis
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.follow('@href')
.find('header + div + div li > a')
.set('category')
.follow('@href')
.paginate('.totallink + a.button.next:first')
.find('p > a')
.follow('@href')
.set({
'title': 'section > h2',
'description': '#postingbody',
'subcategory': 'div.breadbox > span[4]',
'date': 'time@datetime',
'latitude': '#map@data-latitude',
'longitude': '#map@data-longitude',
'images': ['img@src']
})
.data(function(listing) {
// do something with listing data
})
.log(console.log)
.error(console.log)
.debug(console.log)
If I just want to get location
information, I change to
osmosis
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.log(console.log)
.error(console.log)
.debug(console.log)
However what I get is
(get) starting
(get) loaded [get] www.craigslist.org/about/sites
(find) found 714 results for "h1 + div a"undefined
It turns out that osmosis found 714 entries h1+div a
but I could not figure out what is undefined
here.
Problem 2: inconsistent result node-osmosis, x-ray, and Chrome console
I would like to retrieve product information of RobotShop. I decided to use XPath selector
osmosis
.get('http://www.robotshop.com/en/robots-to-build.html')
.find('//div[@class="wrap-thumbnailCatTop"]')
.set('products')
.log(console.log)
.debug(console.log)
but this is what I get. I get nothing.
(get) starting
(get) loaded [get] http://www.robotshop.com/en/robots-to-build.html
(get) (process) stack: 3, RAM: 30.49Mb (+30.49Mb) requests: 1, heap: 9.20Mb / 16.24Mb
(get) (process) stack: 0, RAM: 30.49Mb (+0.00Mb) requests: 1, heap: 9.22Mb / 16.24Mb
I think my XPath is valid because I tested it in the console of Chrome
$x('//div[@class="wrap-thumbnailCatTop"]')
and got product descriptions I want. I also tried to use CSS selector $('.wrap-thumbnailCatTop')
in the console but could not retrieve anything. Eventually I tried this CSS selector .wrap-thumbnailCatTop
using x-ray, which is built upon cheerio, and got nice result! The code is:
x('http://www.robotshop.com/en/robots-to-build.html', '.wrap-thumbnailCatTop', [{
image: 'a img@src',
product: '.product-name a@title',
code: 'product-code',
ratings: '.rating .amount a',
price: '.price-box .regular-price .price'
}])
.write('results.json')
and the results.json
is
[
{
"image": "http://www.robotshop.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/a/r/arduino-uno-usb-microcontroller-rev-3_2.jpg",
"product": "Arduino Uno USB Microcontroller Rev 3",
"price": "USD $21.89"
},
{
"image": "http://www.robotshop.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/h/i/hitec-hs422-servo-motor-13.jpg",
"product": "HS-422 Servo Motor"
},
So after all I have a feeling that there are different standards, or different implementations probably, in parsing selectors. Can anyone show me the right way to do this?