4

I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).

My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.

Main problems that I am not sure how to solve:

  • The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.

  • When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .

Approaches tried:

  • Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.

  • Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .

If you could propose some solution for this task, thank you in advance. :-)

Loko
  • 41
  • 2

2 Answers2

2

You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:

from selenium import webdriver

download_dir = '/Users/doug/Desktop/'

chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')

container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
duhaime
  • 25,611
  • 17
  • 169
  • 224
  • Hi @duhaime good solution, can you tell me way to read html content through selenium ? – Naga kiran Oct 01 '18 at 17:26
  • 1
    @NagaKiran Sure thing, using the code above, we'd call `driver.page_source` - that will return the HTML for the current page. I hope that helps! – duhaime Oct 01 '18 at 17:37
0

You should do like so:

import requests

txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode

def download(file_type):
    url = f'https://www.ceps.cz/download-data/?format={txt_format}'

    response = requests.get(url)

    if file_type is txt_format:
        with open(f'file.{file_type}', 'w') as file:
            file.write(response.text)
    else:
        with open(f'file.{file_type}', 'wb') as file:
            file.write(response.content)

download(txt_format)
Federico Rubbi
  • 714
  • 3
  • 16