0

I am trying to scrape data from this website: Website link.

I want to download all the PDF files from specific dates.

While I've managed to get the files from the first page and download them correctly, I cannot change the date so I can go back in previous dates and get the old PDFs too.

I have tried this line:

scrapy.FormRequest.from_response(response,formxpath='//table//td//input[@type="text"]', formdata={'value': "20.05.2017"}, clickdata={'type':'submit'}, method='POST')

In the scrapy shell but the view(response) always shows me the current date.

I am not sure that this is correct by any means, I am new to scrapy and I'm trying to figure things out. I think that the method is correct since when I change the date the link does not change, so it should be POST and not GET.

Any ideas on how I can get this to work?
I thought the FormRequest() would be the best option here but I haven't seen any other examples online and the documentation on scrapy's website did not help me that much, so I tried to study the examples that had Login credentials involved, they all used FormRequest.from_response()

PS: I have included a screenshot of the HTML code segment that has to do with the date change.

enter image description here

Suraj Kumar
  • 5,547
  • 8
  • 20
  • 42
Stavros G
  • 45
  • 6

1 Answers1

1

The input field name is "date", not "value":

    <form id="dailyFekForm" name="dailyFekForm" action="/idocs-nph/search/dailyFekForm.html" method="post">
        <br>
        <div>

        </div>  
      <div class="non-printable" style="padding-left:20px;">
            <table>
                <tr>
                    <td style="font-size:100%; color:#3399FF;" align="left" >
                        <table>
                            <tr>
                                <td valign="center" style="font-size:100%; color:#3399FF;" ><b>Ημερομηνία Κυκλοφορίας</b></td>
                                <td>
                                    <img title="Επιλέξτε ημερομηνία για ημερήσια κυκλοφορία" border="0" src="/idocs-nph/images/tooltip.gif" >
                                </td>
                            </tr>
                        </table> 
                    </td>
                    <td><input id="date" name="date" type="text" value="29.05.2017"/></td>
                    <td><img src="/idocs-nph/images/admin/calendar.gif" id="triggerDate"/></td>
                    <td><input class="save" type="submit" value="Αναζήτηση" name="search" id="search"/></td>
                </tr>
            </table>

You can also check what your browser sends using its dev tools: https://i.stack.imgur.com/p46Eq.jpg (check "Form data" at the bottom)

Hence, you can use:

scrapy.FormRequest.from_response(response,formdata={'date': "19.05.2017"})

Sample session using scrapy shell, showing different table rows:

$ scrapy shell http://www.et.gr/idocs-nph/search/dailyFekForm.html
>>> from pprint import pprint
>>> pprint(response.css('table#result_table tr:not(.prop) td b').xpath('normalize-space()').getall())
['ΦΕΚ A 77 - 26.05.2017',
 'ΦΕΚ B 1836 - 25.05.2017',
 'ΦΕΚ B 1837 - 25.05.2017',
 (...)
 'ΦΕΚ Α.Α.Π. 112 - 25.05.2017',
 'ΦΕΚ Α.Α.Π. 113 - 26.05.2017',
 'ΦΕΚ Α.Α.Π. 114 - 26.05.2017',
 'ΦΕΚ Α.Α.Π. 115 - 26.05.2017']
>>> fetch(scrapy.FormRequest.from_response(response,formdata={'date': "19.05.2017"}))
2017-05-29 14:42:50 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.et.gr/idocs-nph/search/dailyFekForm.html> (referer: None) ['partial']
>>> pprint(response.css('table#result_table tr:not(.prop) td b').xpath('normalize-space()').getall())
['ΦΕΚ A 72 - 19.05.2017',
 'ΦΕΚ A 73 - 19.05.2017',
 'ΦΕΚ A 74 - 19.05.2017',
 (...)
 'ΦΕΚ Υ.Ο.Δ.Δ. 234 - 18.05.2017',
 'ΦΕΚ Α.Α.Π. 105 - 16.05.2017',
 'ΦΕΚ Α.Α.Π. 108 - 16.05.2017']
>>> fetch(scrapy.FormRequest.from_response(response,formdata={'date': "16.05.2017"}))
2017-05-29 14:45:53 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.et.gr/idocs-nph/search/dailyFekForm.html> (referer: None) ['partial']
>>> pprint(response.css('table#result_table tr:not(.prop) td b').xpath('normalize-space()').getall())
['ΦΕΚ A 69 - 16.05.2017',
 'ΦΕΚ B 1638 - 15.05.2017',
 'ΦΕΚ B 1639 - 15.05.2017',
 (...)
 'ΦΕΚ Υ.Ο.Δ.Δ. 228 - 16.05.2017',
 'ΦΕΚ Υ.Ο.Δ.Δ. 229 - 16.05.2017',
 'ΦΕΚ Α.Α.Π. 102 - 15.05.2017']
>>> 
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • Thank you very much for your time. Does this work for you? I've tried in Scrapy shell and when I use the view(response) it returns me the same date. Shouldn't I be seeing the changed one? – Stavros G May 29 '17 at 12:07
  • Wasn't using fetch.. I feel like an idiot, I've been looking for this error for 3 days now. Thank you so much, really appreciate it. – Stavros G May 29 '17 at 12:52