1

I have an excel list of DOIs of papers I'm interested in. Based on this list, I would like to download all the papers.

I tried to do it with request, as recommended in their documentation. But the pdf files I get are damaged. They are just some KB big. I changed the chunk_size several times from None till 1024*1024 and I have read many posts already. Nothing helps.

Please, what are your ideas?

import pandas as pd
import os
import requests


def get_pdf(doi, file_to_save_to):
    url = 'http://api.elsevier.com/content/article/doi:'+doi+'?view=FULL'
    headers = {
        'X-ELS-APIKEY': "keykeykeykeykeykey",
        'Accept': 'application/pdf'
    }
    r = requests.get(url, stream=True, headers=headers)
    if r.status_code == 200:
        for chunk in r.iter_content(chunk_size=1024*1024):
            file_to_save_to.write(chunk)
            return True


doi_list = pd.read_excel('list.xls')
doi_list.columns = ['DOIs']
count = 0
for doi in doi_list['DOIs']:
    doi = doi.replace('DOI:','')
    pdf = doi.replace('/','%')
    if not os.path.exists(f'path/{pdf}.pdf'):
        file = open(f'path/{pdf}.pdf', 'wb') 
        get_pdf(doi, file)
        count += 1
        print(f"Dowloaded: {count} of {len(doi_list['DOIs'])} articles")
denis
  • 21,378
  • 10
  • 65
  • 88
renrei
  • 11
  • 2

1 Answers1

1

I think your problem is the return True in for chunk in r.iter_content. With that line, you'll only ever write one chunk of the PDF of size chunk_size.

You should also open files using with; as is, you'll never close the file handles.

import pandas as pd
import os
import requests


HEADERS = {
    'X-ELS-APIKEY': "keykeykeykeykeykey",
    'Accept': 'application/pdf'
}


def get_pdf(doi, file_to_save_to):
    url = f'http://api.elsevier.com/content/article/doi:{doi}?view=FULL'
    with requests.get(url, stream=True, headers=HEADERS) as r:
        if r.status_code == 200:
            for chunk in r.iter_content(chunk_size=1024*1024):
                file_to_save_to.write(chunk)


doi_list = pd.read_excel('list.xls')
doi_list.columns = ['DOIs']
count = 0
for doi in doi_list['DOIs']:
    doi = doi.replace('DOI:','')
    pdf = doi.replace('/','%')
    if not os.path.exists(f'path/{pdf}.pdf'):
        with open(f'path/{pdf}.pdf', 'wb') as file:
            get_pdf(doi, file)
        count += 1
        print(f"Dowloaded: {count} of {len(doi_list['DOIs'])} articles")
Kirk
  • 1,779
  • 14
  • 20
  • Thank you for your input! I changed it, but the pdfs are still some 100KB big. So, I can only see the first page and not the rest of the document. – renrei Dec 05 '19 at 19:56
  • @renrei Could you share the program with the changes somehow? – AMC Dec 05 '19 at 20:48
  • @AlexanderCécile, It looks exactly like Kirk's suggested one – renrei Dec 05 '19 at 21:25
  • @renrei The program doesn’t write the PDF if the file already exists, correct? I’m curious, can you share the content of `doi_list`? Also, you should specify the column names in `read_excel`, there’s no reason not to AFAICT. Where is the excel file coming from, I’m surprised to see it’s in the old XLS format. – AMC Dec 05 '19 at 21:28
  • @AlexanderCécile, I do specify the column name. There is just one column, called "DOIs". The excel list is probably not the issue here. The code does work; it downloads the papers. But it doesn't download the full file, just the first page (some KB). – renrei Dec 06 '19 at 07:37
  • 1
    @renrei Can you confirm that you deleted all the existing files before trying again? – Kirk Dec 06 '19 at 14:11
  • @Kirk, yes I can confirm that. I do that before every new try. – renrei Dec 06 '19 at 16:08
  • 1
    https://requests.readthedocs.io/en/master/user/advanced/#body-content-workflow perhaps you need to use a with statement, or flush the request. I'll update the code example – Kirk Dec 06 '19 at 16:24
  • @renrei I agree with (haha) Kirk, a context manager is always a good idea. – AMC Dec 08 '19 at 05:17
  • I got in contact with ScienceDirect and they told me that the code is working perfectly fine, but they limit the MB you can download via API. So, it is something internally at ScienceDirect. Thank you for your effort and your input, guys! – renrei Jan 02 '20 at 10:50