2

I suppose in-memory file-like objects are expected to behave like files. I am not able to get Textract to "read" a

<StringIO.StringIO instance at 0x05039EB8>

although the program runs fine if I save the JPEG file to disk and read in the normal course.

The jpeg file is being extracted from pdfs, per Ned Batchelder's excellent blog Extracting JPGs from PDFs. Relevant code below:

type(jpg) --> str (on 2.7)
buff = StringIO.StringIO()
buff.write(jpg)
buff.seek(0)
type(buff) --> instance
print buff --><StringIO.StringIO instance at 0x05039EB8>
dt=Image.open(buff)
print dt --><PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=2630x597 at 0x58C2A90>
text=textract.process(dt)`

This line fails. Textract cannot read the JpegImageFile If I do

text=textract.process(buff.getvalue())

I get an error: must be encoded string without NULL bytes, not str

How do I get Textract to read from the in-memory file or streams?

Pradeep
  • 350
  • 3
  • 16

1 Answers1

0

I have found a solution; in-memory files is not the way to deal with legacy code. Routing the jpg extract to a hard coded tempfile worked.

tempfile.NamedTemporaryFile

It is a bit tedious to write the data stream to a tempfile and textract.process it; I couldn't figure out the BytesIO/StringIO way topassing the byte stream to textract. According to Textract docs, it expects a file. updated workaround code snippet:

pdf = file('file name', "rb").read()

startmark = "\xff\xd8"
startfix = 0
endmark = "\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find("stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream+20)
    if istart < 0:
    i = istream+20
        continue
    iend = pdf.find("endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend-20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print "JPG %d from %d to %d" % (njpg, istart, iend)
    jpg = pdf[istart:iend]

    njpg += 1
    i = iend

import tempfile
temp=tempfile.NamedTemporaryFile(delete=False,suffix='.jpg')
temp.write(jpg)
temp.close()
text=textract.process(temp.name)
print text

Info: Python 2.7 on Win7; forced UTF-8 encoding

reload(sys)
sys.setdefaultencoding('UTF8'). 

Hope this helps someone, because textract is actually a great piece of code. The pdf to jpeg converter code is courtesy Ned Batchelder Extracting JPGs from PDFs (2007).

Pradeep
  • 350
  • 3
  • 16