1

I have a table in SQL Server with a varbinary(max) column that contains file blobs. I am exporting these with Python and pyodbc like this:

import pyodbc
conn = pyodbc.connect('DSN=SQL Server;UID=username;PWD=password')
cursor = conn.cursor()

with open("output.pdf", "wb") as output_file:
    cursor.execute("SELECT top 1 filedata from schm.table_name")
    blob = cursor.fetchone()
    output_file.write(blob[0])

This works for text files, but all other file types (e.g. pdf, xlsx, etc.) are corrupted. Opening the exported file in notepad shows the same characters as casting the column as varchar in SQL Server.

How do I remedy this?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • Try selecting from a row that contains a PDF file, use `blob = cursor.fetchval()` followed by `print(repr(blob))`. Does the value it displays start with `b'%PDF` ? – Gord Thompson Jun 21 '19 at 12:30
  • No - it starts with `b';\x98s`. I'm guessing that means it somehow needs more metadata? I have a `file_name` field that I could strip the file extension from - not sure how to incorporate that info though. – OverflowingTheGlass Jun 21 '19 at 12:42
  • If you do `blob.find(b'%PDF')` does it return a positive value? – Gord Thompson Jun 21 '19 at 14:06
  • no - it returns `-1` – OverflowingTheGlass Jun 21 '19 at 15:09
  • Is there a corresponding front-end application you can use to extract the PDF file intact so you can compare it with the blob data to investigate how it might be encoded? – Gord Thompson Jun 21 '19 at 16:50
  • I tried a newer document (the original was from a decade ago). The newer one does start with `b'%PDF'`, but it looks like a stream. Exporting it results in a corrupted file. Other recent documents also don't have '%PDF', so it seems to be hit or miss. – OverflowingTheGlass Jun 21 '19 at 17:05
  • Early versions of the PDF file format were proprietary so I can't be 100% sure, but I don't *ever* recall seeing a PDF file that didn't have `%PDF` as its [magic number](https://en.wikipedia.org/wiki/File_format#Magic_number). Do you have *any* information on the process(es) that inserted those blobs? – Gord Thompson Jun 21 '19 at 19:15
  • Unfortunately, no. We are pulling blobs over from Exadata as varbinary fields, and this is what they look like. I can try to get my hands on one of the actual files from the front-end, but I'm not hopeful. – OverflowingTheGlass Jun 21 '19 at 19:31

0 Answers0