Links doesn't have url format in order to scrape them scrapy

Question

This is my code:


import scrapy
from scrapy import Spider
from scrapy.http import FormRequest

class ProvinciaSpider(Spider):
    name = 'provincia'
    allowed_domains = ['aduanet.gob.pe']
    start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']

    def parse(self, response):
        data ={ 'accion': 'consultaManifExpProvincia',
        'salidaPro': 'YES',
        'strMenu': '-',
        'strEmpTransTerrestre': '-',
        'CMc1_Anno': '2022',
        'CMc1_Numero': '96',
        'CG_cadu': '046',
        'viat': '1'}

        yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias', formdata=data, callback=self.parse_form_page)

    def parse_form_page(self, response):
        table = response.xpath('/html/body/form[1]//td[@class="beta"]/table')
        trs = table.xpath('.//tr')[1:]
        for tr in trs:
            puerto_llegada= tr.xpath('.//td[1]/text()').extract_first().strip()
            pais= tr.xpath('.//td[1]/text()').extract_first().strip()
            bl= tr.xpath('.//td[3]/text()').extract_first().strip()
            peso= tr.xpath('.//td[8]/text()').extract_first().strip()
            bultos= tr.xpath('.//td[9]/text()').extract_first().strip()
            consignatario= tr.xpath('.//td[12]/text()').extract_first().strip()
            embarcador= tr.xpath('.//td[13]/text()').extract_first().strip()
            links=tr.xpath('.//td[4]/a/@href')

            yield response.follow(links.get(),
                                 callback=self.parse_categories,
                                 meta={'puerto_llegada': puerto_llegada,
                                       'pais': pais,
                                       'bl': bl,
                                       'peso': float("".join(peso.split(','))),
                                       'bultos': float("".join(bultos.split(','))),
                                       'consignatario': consignatario,
                                       'embarcador': embarcador})
    def parse_categories(self, response):
        puerto_llegada = response.meta['puerto_llegada']
        pais = response.meta['pais']
        bl = response.meta['bl']
        peso = response.meta['peso']
        bultos = response.meta['bultos']
        consignatario = response.meta['consignatario']
        embarcador = response.meta['embarcador']


        tabla_des= response.xpath('/html/body/form//td[@class="beta"]/table')
        trs3= tabla_des.xpath('.//tr')[1:]
        for tr3 in trs3:
            descripcion= tr.xpath('.//td[7]/text()').extract_first().strip()

            yield {'puerto_llegada': puerto_llegada,
                   'pais': pais,
                   'bl': bl,
                   'peso': PROCESOS,
                   'bultos': bultos,
                   'consignatario': consignatario,
                   'embarcador': embarcador,
                   'descripcion': descripcion}

And I get this error:

ValueError: Missing scheme in request url: javascript:jsDetalle2('154');

Every link that I want to extract data from has that format, so my code for extracting the data inside each link doesn't work.

The link format is like javascript:jsDetalle2('154'), only the numbers change.

The problem is that it isn't http//........ or /manifiesto...... in the first case you only have to follow the link and that's all, in the second case you have to join the second part of the URL with the first response URL. But this case is none, so I don't know how to make it work.

How can I write it in order to work?

it is not normal link but JavaScript code - and normally browser simply execute this code when you click this link. But Scrapy can't run JavaScript. And adding `http://` is useless. It need running browser with original page which has loaded function `jsDetalle2()` — furas, Mar 16 '22 at 21:56
first you would have to check in `DevTools` in browser what browser is doing when you click link like this. Maybe it only show/hide element on page. Or maybe it load data from url which you could use instead `jsDetalle2()` (and only replace `154` in this url). Or maybe you will need to use [Selenium](https://selenium-python.readthedocs.io/) to control real web browser which can run JavaScript. — furas, Mar 16 '22 at 22:02
I checked in browser and when I click this link then it send POST to `http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias` with many values and one of them is `CMc2_NumDet: "154"` - so you will have to do the same. — furas, Mar 16 '22 at 22:13
The numbers are random, I will not know what numbers the page will have (inside each javascript link). Should I use splash or can I solve it just with scrapy? — GONZALO EMILIO CONDOR TASAYCO, Mar 16 '22 at 22:27
numbers are NOT random. Link `javascript:jsDetalle2('154')` runs `POST` with value `154` - and you can get this value from `javascript:jsDetalle2('154')` or from `text()` in `` — furas, Mar 16 '22 at 22:34
What I meant is that the first link form the table is 154 and then it can be 1 or 500. That's what I meant. It would work with text() in a. How should I write it in code? — GONZALO EMILIO CONDOR TASAYCO, Mar 16 '22 at 22:39
Should I use splash or can I solve it just with scrapy? And I don't want to use selenium I have the exact same program with selenium and beatifulsoup but it has problems, I searched and I found that they were bugs. So I started learning scrapy, but in these case I have to use splash right? or it isn't necessary? — GONZALO EMILIO CONDOR TASAYCO, Mar 16 '22 at 22:40
you don't need Splash if you know how it send POST - if you know what value it sends when you check it in DevTools. It easy to recognize that `CMc2_NumDet` need value from link, and `CMc2_numcon` need value from your variable `bl`, etc. — furas, Mar 16 '22 at 22:40
it seems there is bigger problem - page with categories probably needs Cookies - at this moment I get page with search results when I try to get details. — furas, Mar 16 '22 at 23:51

furas · Accepted Answer · 2022-03-17T01:02:26.467

I checked this link in browser - and when I click link with text 154 then it runs POST with many values and one of them is 'CMc2_NumDet': '154' - so I can get this number from link and use in POST.

In browser you can see 'CMc2_Numero': "+++96" but in code you need space instead of + like " 96" (and scrapy will use + instead of space) or you can remove all + like "96" .

BTW: I put in meta all values as item: {...} so later I can get all values using one line with meta['item']

        number = tr.xpath('.//td[4]/a/text()').get()

        data = {
            'accion': "consultaManifExpProvinciaDetalle",
            'CMc2_Anno': "2022",
            'CMc2_Numero': "96",    # <--- without `+`
            'CG_cadu': "046",
            'CMc2_viatra': "1",
            'CMc2_numcon': "",
            'CMc2_NumDet': number,  # <---
            'tipo_archivo': "",
            'reporte': "ExpPro",
            'backPage': "ConsulManifExpPro",
        }

        yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias',
                          formdata=data,
                          callback=self.parse_categories,
                          meta={"item": {'puerto_llegada': puerto_llegada,
                                         'pais': pais,
                                         'bl': bl,
                                         'peso': float("".join(peso.split(','))),
                                         'bultos': float("".join(bultos.split(','))),
                                         'consignatario': consignatario,
                                         'embarcador': embarcador}})
    
def parse_categories(self, response):
    print('[parse_form_page] url:', response.url)

    item = response.meta['item']

    tabla_des = response.xpath('/html/body/form//td[@class="beta"]/table')
    trs3 = tabla_des.xpath('.//tr')[1:]
    for tr3 in trs3:   # trs3[:1]: for single result
        item['descripcion'] = tr3.xpath('.//td[7]/text()').extract_first().strip()
        yield item

Full working code.

Page with categories may have many rows in table (with different Peso Bruto which you don't use) so it may give many rows in CSV.

If you need only one row then use trs3[:1]: instead of trs3:

I used different xpath to find table with "Descripcion" - because previous version didn't check if table has Descripcion and it could get 3 tables instead of one.

import scrapy
from scrapy import Spider
from scrapy.http import FormRequest

class ProvinciaSpider(Spider):
    
    name = 'provincia'
    allowed_domains = ['aduanet.gob.pe']
    start_urls = ['http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias?accion=cargaConsultaManifiesto&tipoConsulta=salidaProvincia']

    def parse(self, response):
        payload = {
            'accion': 'consultaManifExpProvincia',
            'salidaPro': 'YES',
            'strMenu': '-',
            'strEmpTransTerrestre': '-',
            'CMc1_Anno': '2022',
            'CMc1_Numero': '96',
            'CG_cadu': '046',
            'viat': '1'
        }

        yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias',
                          formdata=payload,
                          callback=self.parse_form_page)

    def parse_form_page(self, response):
        print('[parse_form_page] url:', response.url)
        
        table = response.xpath('/html/body/form[1]//td[@class="beta"]/table')
        trs = table.xpath('.//tr')[1:]
        for tr in trs:
            item = {
                'puerto_llegada': tr.xpath('.//td[1]/text()').extract_first().strip(),
                'pais': tr.xpath('.//td[1]/text()').extract_first().strip(),
                'bl': tr.xpath('.//td[3]/text()').extract_first().strip(),
                'peso': tr.xpath('.//td[8]/text()').extract_first().strip().replace(',', ''),    # <---
                'bultos': tr.xpath('.//td[9]/text()').extract_first().strip().replace(',', ''),  # <---
                'consignatario': tr.xpath('.//td[12]/text()').extract_first().strip(),
                'embarcador': tr.xpath('.//td[13]/text()').extract_first().strip(),
            }

            number = tr.xpath('.//td[4]/a/text()').get().strip()
            print(number.strip())
            
            payload = {
                'accion': "consultaManifExpProvinciaDetalle",
                'CMc2_Anno': "2022",
                'CMc2_Numero': "96",     # without `+` or use `space` instead of `+`
                'CG_cadu': "046",
                'CMc2_viatra': "1",
                'CMc2_numcon': "",
                'CMc2_NumDet': number,   # <---
                'tipo_archivo': "",
                'reporte': "ExpPro",
                'backPage': "ConsulManifExpPro",
            }

            yield FormRequest('http://www.aduanet.gob.pe/cl-ad-itconsmanifiesto/manifiestoITS01Alias',
                              formdata=payload,
                              callback=self.parse_categories,
                              meta={"item": item})
        
    def parse_categories(self, response):
        print('[parse_form_page] url:', response.url)

        item = response.meta['item']

        table = response.xpath('//table[./tr/th[contains(text(), "Descripcion")]]')
        print('len(table):', len(table))

        trs = table.xpath('.//tr')[1:]
        print('len(trs):', len(trs))
        
        for tr in trs:   # trs[:1]: for single result
            item['descripcion'] = tr.xpath('.//td[7]/text()').extract_first().strip()
            yield item

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(ProvinciaSpider)
c.start()

Result (with trs[:1])

puerto_llegada,pais,bl,peso,bultos,consignatario,embarcador,descripcion
BEANR,BEANR,MAEU216473186,47420.00,2160,AGROFAIR BENELUX BV,COOPERATIVA AGRARIA APPBOSA,YT GREEN ORGANIC FRESH BANANAS CARTON BOXES AND IN POLYETHYLENE BAGS.
NLRTM,NLRTM,MAEU216473104,83890.00,5280,AGROFAIR BENELUX BV.,TULIPAN NARANJA S.A.C.,FYT GREEN ORGANIC FRESH BANANAS CARTON BOXES AND IN POLYETHYLENE BAGS.
BEANR,BEANR,MAEU216307459,19980.00,285,"Greencof B.V.,",COOPERATIVA AGRARIA RODRIGUEZ DE MENDOZA,285 BAGS OF 69 KG NET OF PERU ORGANIC GREEN COFFEE FAIRTRADE CERTIFIED
JPYOK,JPYOK,MAEU1KT407500,21320.00,709,"HOWA SHOJI CO., LTD",GEALE AGROTRADING E.I.R.L.,GREEN ORGANIC FRESH BANANAS CARTON BOXES AND IN POLYETHYLENE BAGS. BAN
ITCVV,ITCVV,MAEU913779677,66950.00,3240,BATTAGLIO SPA,IREN PERU SOCIEDAD ANONIMA CERRADA - IREN PERU S.A,GREEN ORGANIC FRESH BANANAS CARTON BOXES AND IN POLYETHYLENE BAGS. BAN
NLRTM,NLRTM,MAEU913798070,24700.00,5544,FRUTOS TROPICALES EUROPE B.V.,FRUTOS TROPICALES PERU EXPORT SOCIEDAD ANONIMA CER,"FRESH MANGOES NET WEIGHT: 22,176.00 KG P.A.: 0804.50.20.00 TR.: JKXYA0"
BEANR,BEANR,MAEU216473141,23710.00,1080,AGROFAIR BENELUX BV.,TULIPAN NARANJA S.A.C.,FYT GREEN ORGANIC FRESH BANANAS CARTON BOXES AND IN POLYETHYLENE BAGS.
BEANR,BEANR,MAEU216632137,22270.00,1080,FYFFES INTERNATIONAL,AGRO PACHA S.A.,"GREEN FRESH ORGANIC BANANAS, PACKED IN CARTON BOXES AND POLYETHILENE B"
KRPUS,KRPUS,MAEU913722041,24480.00,1175,TO THE ORDER,PERUPEZ S.A.C.,"NET WEIGHT: 23,500 KG GROSS WEIGHT: 24,480 KG 1,175 SACKS 23,500 KG FR"
NLRTM,NLRTM,MAEU216473211,22520.00,1080,AgroFair Benelux BV,COOPERATIVA AGRARIA DE USUARIOS RIO Y VALLE,ORGANIC FAIRTRADE BANANAS GREEN FRESH CAVENDISH PACKED CARDBOARD BOXES

is it okay if in the first part I put conditionals inside item?. For processing the information. And another question, I also want to extract the second column in the next table (the table after the "descripccion" table. But I want it in one line, so my question is if the yield command can be used for a list? — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 02:01
I suppose I can do it like how you did it before, right? Writing all the elements in the meta part — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 02:18
if you have new problem then create new question on new page. — furas, Mar 17 '22 at 02:21
as for `yield`: it may use different objects - dict, dataclass, special class `Item` in `scrapy` but all of them have `key:value` and I don't know if it works with list - but the simplest answer is: use yield with list and see what will happen. — furas, Mar 17 '22 at 02:25
simply use it and see what will happen. it is standard method to learn. — furas, Mar 17 '22 at 02:26
as for comparition - I keep it as dictionary so now it is `item["puerto_llegada"] == ...` — furas, Mar 17 '22 at 02:39
And if you have many elements to compare then you can keep also as dictionary `rules = {"RULED" :"ST PETERSBURG", ... }` and use `for`-loop to make code shorter : `for key,val in rules.items(): if key in item["puerto_llegada"]: item["puerto_llegada"] = val` — furas, Mar 17 '22 at 02:42
Yeah yeah I solved it. But the "descripcion" part isn't extracting well. Is happening just like my selenium+ Beautifulsoup program. It extracts correctly until some point, and then it starts extracting randomly. The 452 should be according to the table. — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 02:56
with this problem I can only suggest to `print()` all values and result (or save in `logging`) and later compare with results in broweser when you manually visit pages - maybe there is some other element which was missed in code. OR maybe problem is not code but server - maybe it gives wrong results if it has too many requests in small time. — furas, Mar 17 '22 at 03:00
I've already compared them with the print(), I think is the server. But how can I solve it? — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 03:08
if problem is server then you can't solve it. You would have to write to Admins and ask to fix it. And if problem is that server can't response for so many requests then simply slow down, use `time.sleep()` betweeen requsts. — furas, Mar 17 '22 at 03:14
I tried user agents, download delay, but nothing. It seems like if I extract from the 3rd page(inside the links). It starts extracting everything just fine but at some point every element starts malfunctioning and extracts randomly. So I put everything from the 3rd page as a comment and run it, and it extract everything very well. What do you think? — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 04:47
I have no idea what is the problem. It would need to see and observe requests between browser and server and compare with requests between code and server - maybe it would need to use local proxy server like [Charles](https://www.charlesproxy.com/) to see all requests. And all this needs time. — furas, Mar 17 '22 at 06:06
Well man, I guess this is all then. You really know a lot. Thanks for everything really. Idk if you have facebook or something to stay in contact?. If you don't want to I understand. Anyway, good luck and thanks. — GONZALO EMILIO CONDOR TASAYCO, Mar 17 '22 at 19:04

Links doesn't have url format in order to scrape them scrapy

1 Answers1