0

TODO: Convert a TIFF file from URL into PDF

I have a .tiff based file that I need to download/get from a url. When trying to save/download that file into cwd, it is downloaded, however, I am unable to open it.

enter image description here

1. Command used to download the file into cwd:

import urllib.request
sample_tiff_url = "https://www.gati.com/viewPOD2.jsp?dktno=322012982"
urllib.request.urlretrieve(sample_tiff_url, "check.tiff")

My reasoning for downloading it was that I'll download it in local and then convert it into pdf using this thread

2. But as the file is not opening into local, I tried a different approach thinking I'll convert the bytes of response received into PDF.

import requests
sample_tiff_url = "https://www.gati.com/viewPOD2.jsp?dktno=322012982"
resp = requests.get(sample_tiff_url,stream=True)
print(resp.content)

enter image description here

print(type(resp.content))
>>>bytes

3. Another thing I tried is;

import img2pdf
import base64
img_content = base64.b64decode(resp.content)
content = img2pdf.convert(img_content)

which gives following error:

ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO object at 0x7ff46368c410>

Along with this;

from PIL import Image
import io
pil_bytes = io.BytesIO(resp.content)
pil_image = Image.open(pil_bytes)

Error:

UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7ff4642e2650>

4. Lastly;

import requests
from PyPDF2 import PdfFileMerger, PdfFileReader
sample_tiff_url = ""https://www.gati.com/viewPOD2.jsp?dktno=322012982""
resp = requests.get(sample_tiff_url,stream=True)
PdfFileReader(resp.content)

which gives: enter image description here

TBH, I have not worked with image libraries and files so I am not understanding all the errors that I get.

TLDR; Either download the .tiff file into local or how to read the contents from that URL giving bytes type data and convert/write it into a PDF.

user10089194
  • 339
  • 1
  • 5
  • 14

1 Answers1

2
import requests
import io
from PIL import Image

url = 'https://www.gati.com/viewPOD2.jsp?dktno=322012982'
r = requests.get(url)

pil_bytes = io.BytesIO(r.content)
pil_image = Image.open(pil_bytes)

# Needed to get around ValueError: cannot save mode RGBA
rgb = Image.new('RGB', pil_image.size)
rgb.paste(pil_image)
rgb.save('downloaded_image.pdf', 'PDF')
GordonAitchJay
  • 4,640
  • 1
  • 14
  • 16
  • Here is a tiff file, I was getting the same error as above, though the response was different. "https://www.gati.com/viewPOD2.jsp?dktno=322012982". Used a sample file since this link was temporarily down. I'll update the link in question too. – user10089194 Jul 28 '22 at 12:13
  • The following code part seemed to be working for me now: from PIL import Image import io pil_bytes = io.BytesIO(resp.content) pil_image = Image.open(pil_bytes) – user10089194 Jul 28 '22 at 12:25
  • Yeah I just tried my code with the new url (minus the `url.replace` call), and it converted to a pdf file just fine. – GordonAitchJay Jul 28 '22 at 12:26
  • Can you please update your answer for others who don't bother going to comments. – user10089194 Jul 28 '22 at 12:30
  • No worries. Will do. Cheers – GordonAitchJay Jul 28 '22 at 12:31
  • 1 more help. Links sometimes gives a tiff file & sometimes html. In your previous answer, you had posted the logic for html files. I tried that without the line url.replace.. part because I don't know JS and could not figure out where is it redirected to. Run into following error : UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7fa209755470>. Could you help me understand how to get the redirected link part in python? Because I am making a generic image converter func so I'll be needing to pass different links all the time and cannot hardcode – user10089194 Jul 31 '22 at 01:40
  • Here is the link for the file: "https://www.gati.com/showPOD.jsp?dktNo=151642237". I am talking about how to get the values in this part from your previous answer ---> url = url.replace('file-examples.com/wp-content/uploads/', 'file-examples.com/storage/fe52cb0c4862dc676a1b341/') – user10089194 Jul 31 '22 at 05:30
  • That links doesn't work for me - 404 not found. Use the Network Monitor in your Developer Tools of your browser (press Ctrl + Shift + I). Then enter the URL. The Network Monitor will display all the http requests. There are a few ways it could work. The first http request might return 30x status code, and your browser will automatically be redirected to another location which is the file. Alternatively, the first http request could return 200 OK, and then JavaScript in your browser is executed which makes another http request to the file. The stack trace in the Network Monitor can help you. – GordonAitchJay Jul 31 '22 at 06:50
  • URL---->>> https: // www.gati.com/showPOD.jsp?dktNo=151642237 – user10089194 Jul 31 '22 at 08:57
  • https: // www.gati.com/showPOD.jsp?dktNo=151642237. No spaces in URL. The URL does download a file but looks like it is corrupted. Even in inspect->network tab, I can see request, headers etc but nothing that gives me the idea if it is redirecting. I tried response=requests.get(url) and then using attributes like response.history or response.url etc to understand but no luck. – user10089194 Jul 31 '22 at 09:11
  • Yeah same. I suspect the server has issues. I was able to get the right tiff file twice, but it mostly returned 200 OK, with the contents being HTML with `java.io.EOFException` in the body. I was thinking maybe it was discriminating against the Python-Requests `User-Agent`, but I don't think that's the case. – GordonAitchJay Jul 31 '22 at 11:00
  • The file the server sends you isn't corrupt, it's just HTML with a .tiff extension (except for when it sends you the actual image file, on occasion). – GordonAitchJay Jul 31 '22 at 11:01