I'm trying to download a PDF from a site and then read it, all in a single python script running on a single worker dyno in Heroku. However, my script requires that file be temporarily stored in the ephemeral filesystem in order to be read.
From the documentation, this should be possible:
Each dyno gets its own ephemeral filesystem, with a fresh copy of the most recently deployed code. During the dyno’s lifetime its running processes can use the filesystem as a temporary scratchpad, but no files that are written are visible to processes in any other dyno and any files written will be discarded the moment the dyno is stopped or restarted.
Yet no matter what I do, it seems to throw an error which is similar to what I get when I run it on my local machine and the file does not exist (the script otherwise runs fine on the local machine).
See the relevant part of my code below, I am using Tabula to process the PDF into a CSV.
Another point to note is when checking the filesize in Heroku it returns the correct value, so the file has been downloaded and is in the file system, but cannot be read by the Tabula wrapper for some reason.
#urllib.urlretrieve(url[, filename[, reporthook[, data]]])
urllib.urlretrieve(url, 'downloaded.pdf')
#check if pdf downloaded by checking file size
filesize = os.path.getsize('downloaded.pdf')
print filesize # this returns the correct value
#if pdf was downloaded correctly then convert info to csv
if (filesize > 30000):
tabula.convert_into("downloaded.pdf", # error at this line
"downloaded.csv",
pages="all",
output_format="csv")
else:
print ('404 error')
sys.exit
My question is similar to this question, except I am running the script on a single dyno, which should make it possible.