0

Recently I found node-osmosis is a relatively new module but has powerful features, such as accepting both CSS and XPath selectors, rapid scraping, and nice syntax.

So I made a comparison between node-osmosis and x-ray by running some scrapes using both CSS and XPath. I encountered two problems as follows.

Problem 1: unknown result of node-osmosis

node-osmosis provides a simple example in its homepage, reads

var osmosis = require('osmosis');
osmosis
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.follow('@href')
.find('header + div + div li > a')
.set('category')
.follow('@href')
.paginate('.totallink + a.button.next:first')
.find('p > a')
.follow('@href')
.set({
    'title':        'section > h2',
    'description':  '#postingbody',
    'subcategory':  'div.breadbox > span[4]',
    'date':         'time@datetime',
    'latitude':     '#map@data-latitude',
    'longitude':    '#map@data-longitude',
    'images':       ['img@src']
})
.data(function(listing) {
    // do something with listing data
})
.log(console.log)
.error(console.log)
.debug(console.log)

If I just want to get location information, I change to

osmosis
.get('www.craigslist.org/about/sites')
.find('h1 + div a')
.set('location')
.log(console.log)
.error(console.log)
.debug(console.log)

However what I get is

(get) starting
(get) loaded [get] www.craigslist.org/about/sites 
(find) found 714 results for "h1 + div a"undefined

It turns out that osmosis found 714 entries h1+div a but I could not figure out what is undefined here.

Problem 2: inconsistent result node-osmosis, x-ray, and Chrome console

I would like to retrieve product information of RobotShop. I decided to use XPath selector

osmosis
  .get('http://www.robotshop.com/en/robots-to-build.html')
  .find('//div[@class="wrap-thumbnailCatTop"]')
  .set('products')
  .log(console.log)
  .debug(console.log)

but this is what I get. I get nothing.

(get) starting
(get) loaded [get] http://www.robotshop.com/en/robots-to-build.html 
(get) (process) stack: 3, RAM: 30.49Mb (+30.49Mb) requests: 1, heap: 9.20Mb / 16.24Mb
(get) (process) stack: 0, RAM: 30.49Mb (+0.00Mb) requests: 1, heap: 9.22Mb / 16.24Mb

I think my XPath is valid because I tested it in the console of Chrome

$x('//div[@class="wrap-thumbnailCatTop"]')

and got product descriptions I want. I also tried to use CSS selector $('.wrap-thumbnailCatTop') in the console but could not retrieve anything. Eventually I tried this CSS selector .wrap-thumbnailCatTop using x-ray, which is built upon cheerio, and got nice result! The code is:

x('http://www.robotshop.com/en/robots-to-build.html', '.wrap-thumbnailCatTop', [{
  image: 'a img@src',
  product: '.product-name a@title',
  code: 'product-code',
  ratings: '.rating .amount a',
  price: '.price-box .regular-price .price'
}])
  .write('results.json')

and the results.json is

[
  {
    "image": "http://www.robotshop.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/a/r/arduino-uno-usb-microcontroller-rev-3_2.jpg",
    "product": "Arduino Uno USB Microcontroller Rev 3",
    "price": "USD $21.89"
  },
  {
    "image": "http://www.robotshop.com/media/catalog/product/cache/1/small_image/135x/9df78eab33525d08d6e5fb8d27136e95/h/i/hitec-hs422-servo-motor-13.jpg",
    "product": "HS-422 Servo Motor"
  },

So after all I have a feeling that there are different standards, or different implementations probably, in parsing selectors. Can anyone show me the right way to do this?

pateheo
  • 430
  • 1
  • 5
  • 13

1 Answers1

1

You are not seeing anything because Osmosis does not log the data it is collecting by default. It fetched the page and matched the elements you wanted but you did not tell it what to do with the data. The following code will print out the data as it is processed.

osmosis
  .get('www.craigslist.org/about/sites')
  .find('h1 + div a')
  .set('location')
  .data(function(data) {
    console.log(data);
  });

You could also accumulate your data in an array and then do something with the array at the end with .done()

Aner
  • 560
  • 4
  • 13