1

Can somebody help me by telling me what is the error in my code?

I write "scrapy crawl provincia -o table_data_results.csv" in the cmd but the excel is empty. I think it isn't scraping anything.


from scrapy import Spider
from scrapy.http import FormRequest

class ProvinciaSpider(Spider):
    name = 'provincia'
    allowed_domains = ['aduanet.gob.pe']
    start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']

    def parse(self, response):
        data ={ 'accion': 'consultaManifExpProvincia',
        'salidaPro': 'YES',
        'strMenu': '-',
        'strEmpTransTerrestre': '-',
        'CMc1_Anno': '2022',
        'CMc1_Numero': '96',
        'CG_cadu': '046',
        'viat': '1'}

        yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias', formdata=data, callback=self.parse_form_page)
    
    def parse_form_page(self, response):
        table = response.xpath('/html/body/form[1]/table[5]/tbody/tr/td/table/tbody/tr[1]/td/table')
        trs= table.xpath('.//tr')[1:]
        for tr in trs:
            puerto_llegada= tr.xpath('.//td[0]/text()').extract_first().strip()
            pais= tr.xpath('.//td[0]/text()').extract_first().strip()
            bl= tr.xpath('.//td[2]/text()').extract_first().strip()
            peso= tr.xpath('.//td[7]/text()').extract_first().strip()
            bultos= tr.xpath('.//td[8]/text()').extract_first().strip()
            consignatario= tr.xpath('.//td[11]/text()').extract_first().strip()
            embarcador= tr.xpath('.//td[12]/text()').extract_first().strip()

            yield {'puerto_llegada': puerto_llegada,
                   'pais': pais,
                   'bl': bl,
                   'peso': peso,
                   'bultos': bultos,
                   'consignatario': consignatario,
                   'embarcador': embarcador}

EDIT: If I want to put this inside my code


links=tr.xpath('.//td[4]/text()')
            yield response.follow(links.get(), callback= self.parse_categories)

def parse_categories(self, response):
        tabla_des= response.xpath('/html/body/form//td[@class="beta"]/table')
        trs3= tabla_des.xpath('.//tr')[1:]
        for tr3 in trs3:
            descripcion= tr.xpath('.//td[7]/text()').extract_first().strip()

and in the yield part I want it like this:
yield {'puerto_llegada': puerto_llegada,
                   'pais': pais,
                   'bl': bl,
                   'peso': float("".join(peso.split(','))),
                   'bultos': float("".join(bultos.split(','))),
                   'consignatario': consignatario,
                   'embarcador': embarcador,
                   'descripcion': descripcion}

Where should I put it?

  • Have you printed the response you get back, to make sure it looks like you think it does? – Tim Roberts Mar 15 '22 at 04:02
  • Forgive me but I'm new with scrapy, I only know how to print the response when in the cmd I write scrapy shell and the URL, but that doesn't include the extracting part. – GONZALO EMILIO CONDOR TASAYCO Mar 15 '22 at 04:09
  • I don't know what you're saying. This is just basic debugging. If you add `print(response.text)` as the first thing in your callback, you can see what it returned. Right? – Tim Roberts Mar 15 '22 at 04:14
  • ??? What do you mean by that? Do you mean you get the string "okay"? – Tim Roberts Mar 15 '22 at 05:21
  • servers may send different HTML to different browsers and devices and you should check if you really have element which you need and if they are in expected places - and better use `id`, `class` and other attributes to search elements. – furas Mar 15 '22 at 06:29
  • btw: browsers in DevTools show `tbody` in `table` but usually HTML doesn't have - so using `tbody` in `xpath` can make problem – furas Mar 15 '22 at 06:30
  • browser send also other values in form - they are empty but server may check if they exists and skip results. – furas Mar 15 '22 at 06:35

1 Answers1

1

I found two problems

  1. xpath can't find table - print(len(table)) shows 0 - so I used different xpath

    '/html/body/form[1]//td[@class="beta"]/table'
    
  2. xpath is indexing at 1 but you use td[0] - so I used td[1] and changed indexes in other td

from scrapy import Spider
from scrapy.http import FormRequest

class ProvinciaSpider(Spider):
    name = 'provincia'
    allowed_domains = ['aduanet.gob.pe']
    start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']

    def parse(self, response):
        data ={ 'accion': 'consultaManifExpProvincia',
        'salidaPro': 'YES',
        'strMenu': '-',
        'strEmpTransTerrestre': '-',
        'CMc1_Anno': '2022',
        'CMc1_Numero': '96',
        'CG_cadu': '046',
        'viat': '1'}

        yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias', formdata=data, callback=self.parse_form_page)
    
    def parse_form_page(self, response):
        table = response.xpath('/html/body/form[1]//td[@class="beta"]/table')
        print('table:', len(table))
        trs = table.xpath('.//tr')[1:]
        print('trs:', len(trs))
        for tr in trs:
            tds = tr.xpath('.//td')
            print('tds:', len(tds))
            if not tds:
                print('empty row')
            else:
                puerto_llegada= tr.xpath('.//td[1]/text()').extract_first().strip()
                pais= tr.xpath('.//td[1]/text()').extract_first().strip()
                bl= tr.xpath('.//td[3]/text()').extract_first().strip()
                peso= tr.xpath('.//td[8]/text()').extract_first().strip()
                bultos= tr.xpath('.//td[9]/text()').extract_first().strip()
                consignatario= tr.xpath('.//td[12]/text()').extract_first().strip()
                embarcador= tr.xpath('.//td[13]/text()').extract_first().strip()
    
                yield {'puerto_llegada': puerto_llegada,
                       'pais': pais,
                       'bl': bl,
                       'peso': peso,
                       'bultos': bultos,
                       'consignatario': consignatario,
                       'embarcador': embarcador}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(ProvinciaSpider)
c.start()

Result:

puerto_llegada,pais,bl,peso,bultos,consignatario,embarcador
JPYOK,JPYOK,MAEU1KT407500,"21,320.00",709,"HOWA SHOJI CO., LTD",GEALE AGROTRADING E.I.R.L.
BEANR,BEANR,MAEU216307459,"19,980.00",285,"Greencof B.V.,",COOPERATIVA AGRARIA RODRIGUEZ DE MENDOZA
NLRTM,NLRTM,MAEU216473104,"83,890.00",5280,AGROFAIR BENELUX BV.,TULIPAN NARANJA S.A.C.
BEANR,BEANR,MAEU216473141,"23,710.00",1080,AGROFAIR BENELUX BV.,TULIPAN NARANJA S.A.C.
BEANR,BEANR,MAEU216473186,"47,420.00",2160,AGROFAIR BENELUX BV,COOPERATIVA AGRARIA APPBOSA
NLRTM,NLRTM,MAEU216473211,"22,520.00",1080,AgroFair Benelux BV,COOPERATIVA AGRARIA DE USUARIOS RIO Y VALLE
BEANR,BEANR,MAEU216632137,"22,270.00",1080,FYFFES INTERNATIONAL,AGRO PACHA S.A.
KRPUS,KRPUS,MAEU913722041,"24,480.00",1175,TO THE ORDER,PERUPEZ S.A.C.
ITCVV,ITCVV,MAEU913779677,"66,950.00",3240,BATTAGLIO SPA,IREN PERU SOCIEDAD ANONIMA CERRADA - IREN PERU S.A
NLRTM,NLRTM,MAEU913798070,"24,700.00",5544,FRUTOS TROPICALES EUROPE B.V.,FRUTOS TROPICALES PERU EXPORT SOCIEDAD ANONIMA CER
furas
  • 134,197
  • 12
  • 106
  • 148
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/243030/discussion-on-answer-by-furas-scrapy-doesnt-print-anything). – Samuel Liew Mar 17 '22 at 14:06