1

I have a string base64 image that need to convert so then I can read it as image to analyze with pytesseract:

import base64
import io
from PIL import Image
import pytesseract
import sys


base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."

img_data = base64.b64decode(base64_string)

img = Image.open(io.BytesIO(img_data)) # <== ERROR LINE

text = pytesseract.image_to_string(img, config='--psm 6')

print(text)

gives the error:

Traceback (most recent call last):
  File "D:\aa\xampp\htdocs\xbanca\aa.py", line 14, in <module>
    img = Image.open(io.BytesIO(img_data))
  File "D:\python3.10.10\lib\site-packages\PIL\Image.py", line 3283, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x000001A076F673D0>

I tried using numpy and request libraries but all have the same result.. and the base64 example image is working ok in any another converter.

Community
  • 1
  • 1
MrPimiBurn
  • 15
  • 5

1 Answers1

2

That's a very common misunderstanding. The string

base64_string = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFh....."

is not a Base64 string, but a DataURL

URLs prefixed with the data: scheme, allow content creators to embed small files inline in documents

that contains a Base64 string. The Base64 string starts directly after 'base64,'. Therefore you need to cut off the 'data:image/jpeg;base64,' part.

e.g.:

b64 = base64_string.split(",")[1]

after that you can decode the data:

img_data = base64.b64decode(b64)

I modified the code from the question and used the base64 of the following small JPEG image which I base64 encoded on https://www.base64encode.org/: enter image description here

and got the expected text output:

1 Answer

jps
  • 20,041
  • 15
  • 75
  • 79
  • i try that but get "in load raise OSError(msg) OSError: image file is truncated (6 bytes not processed)" and also try adding "Image.LOAD_TRUNCATED_IMAGES = True" but get the same error – MrPimiBurn Feb 19 '23 at 17:17
  • Did you try with a complete string? The dataURL string that you show in your question (which I copied to tmy answer) is truncated (the .... on the end). – jps Feb 19 '23 at 17:26
  • yes of course, im trying with real and complete images converted to base64 – MrPimiBurn Feb 19 '23 at 17:53
  • and on which line does the error occur? – jps Feb 19 '23 at 19:29
  • Share your JPEG and your base64 string via Google Drive or Dropbox and you'll get the answer very quickly... – Mark Setchell Feb 19 '23 at 19:32
  • I installed everything necessary, created a small image, converted it to base64 and used it with your code and the modification according to my answer and got the expected result. So there might be something wrong with your image. But the main problem was the wrong handling of the DataURL, which should be solved now. – jps Feb 19 '23 at 21:23
  • @MarkSetchell here are some images to test [link](https://www.dropbox.com/s/x1w75fl1097wqrb/img-to-test.zip?dl=0) they are all mobile screen captures, the weird is that with some of the img works and with others dont, check for yourself.. – MrPimiBurn Feb 19 '23 at 23:10
  • I downloaded the pictures, converted all of them to base64 with the encoder mentioned in the answer, inserted the base64 strings one by one into the dataURL in the code, and ran it. It worked totally fine for all 5 pictures. – jps Feb 20 '23 at 08:16
  • 1
    @jps Likewise! It would be more useful if OP indicated *which* image failed and shared his corresponding base64 string, per my request. – Mark Setchell Feb 20 '23 at 08:19