1

I'm trying to scrape a website: http://www.vehiculo-robado.com but is returning me this:

error:       null
statusCode:  200
body:        <html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=6-31980899-0%202NNN%20RT%281508782951589%204%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29&incident_id=874000030218433631-157072954141311030&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 874000030218433631-157072954141311030</iframe></body></html>

The web have html...

This is my middleware to scrape the web:

const request = require('request');

function webScraped(req,res,next){      
    const url = `http://www.vehiculo-robado.com`
    req.webParsed = function webToScrape (callback){ 
        request(url, function(error, response, body){
            console.log('error:', error);
            console.log('statusCode:', response && response.statusCode);
            console.log('body =========>', body)
            return callback(false, body);
        })
    }
    next()
}

module.exports = webScraped

I tried with other websites like Google and it's returning me html fine. I don't know what I'm doing wrong.

skirtle
  • 27,868
  • 4
  • 42
  • 57

1 Answers1

0

That website (vehiculo-robado) is using a scraping-protection service called SiteLock. That's why it's denying your request and sending you basically an empty html. This is what I got back as response:

<html style="height:100%">

<head>
  <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
  <meta name="format-detection" content="telephone=no">
  <meta name="viewport" content="initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</head>

<body style="margin:0px;height:100%"><iframe
    src="/_Incapsula_Resource?SWUDNSAI=9&xinfo=3-7455753-0%200NNN%20RT%281550759526831%201%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c316%2c0%29%20U10000&incident_id=511001260010653929-37058068785072099&edet=12&cinfo=04000000"
    frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula
    incident ID:
    511001260010653929-37058068785072099</iframe></body>

</html>

Bypassing it should be possible by shaping your request to look as a regular user's browser request.

cbdeveloper
  • 27,898
  • 37
  • 155
  • 336