-1

I try to extract data from https://www.marinetraffic.com/en/ais/details/ships/imo:9829069/ using the following scrapy's spider and then I save the response to file.html.

# -*- coding: utf-8 -*-
import scrapy
from fake_useragent import UserAgent

class MarinetrafficSpider(scrapy.Spider):
    name = 'marinetraffic'
    allowed_domains = ['marinetraffic.com']
    ua = UserAgent()
    ua.update()

    def start_requests(self):
        urls = [
                    'https://www.marinetraffic.com/en/ais/details/ships/imo:9829069/'
            ]
        headers= {'User-Agent': self.ua['google chrome'] }
        for url in urls:
            yield scrapy.Request(url, callback=self.parse, headers=headers)

    def parse(self, response):
        with open('file.html', 'wb') as f:
            f.write(response.body)
        self.log('Saved file')

But I don't take the expected response. The returned response is in file.html

Please check the debug results.

What modifications do I need to do on the above code so that the returned response be the same as the response I take from the browser?

I will apprisiate your notings.

Alex
  • 1
  • 1
  • 1
    If you right-click the page and click 'Save as...' you can download it as an .HTML file. Use a text editor and you can see it just as your browser would. But I assume you want to scrape multiple pages? This is just my suggestion if you don't have any other option. Let me know if you get the data you wanted. – dram95 May 03 '20 at 23:54
  • You'll need to render the page using a headless browser to get a response that's similar to what you see in your browser. You can use Splash (https://splash.readthedocs.io/en/latest/) for this. – Wim Hermans May 04 '20 at 04:33
  • @dram95 Yes I want to scrape multiple pages in which imo number is variable (https://www.marinetraffic.com/en/ais/details/ships/imo:...). – Alex May 04 '20 at 14:58
  • @WimHermans I will check Splash. Interesting. – Alex May 04 '20 at 15:03

1 Answers1

0

The reason you do not see anything is that the website is rendered via JavaScript. In other words, MarineTraffic server sends you a very basic HTML page, along with a JS script that will load the content, construct and display the required HTML for you.

To get the full HTML, with the data you are looking for, you need to emulate a real browser. If you're using Python, you can have a look at Selenium, along with Chromedriver.

But beware, last time I checked (3 years ago) MarineTraffic had a very strong anti-crawler protection, that would block you after a couple pages visited with the Selenium + Chromedriver setup.

Nicolas
  • 193
  • 1
  • 10