0

I am trying to convert pdf files to Image and then use pytesseract to ocr the files. I was able to do it successfully on the files which are present in the linux local path but not with hdfs path.

    from wand.image import Image as wi

       >>> wi(filename = 'hdfs://boboda02.boobo.com:8020/bda/clamsops/raw/personal_brella_test/09_29_2015_090902.pdf',resolution = 300)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/home/sam/my_env_1/lib/python2.7/site-packages/Wand-0.4.2-py2.7.egg/wand/image.py", line 2534, in __init__

File "/home/sam/my_env_1/lib/python2.7/site-packages/Wand-0.4.2-py2.7.egg/wand/image.py", line 2601, in read

File "/home/sam/my_env_1/lib/python2.7/site-packages/Wand-0.4.2-py2.7.egg/wand/resource.py", line 222, in raise_exception

wand.exceptions.MissingDelegateError: no decode delegate for this image format `//boboDA02.boobo.COM' @ error/constitute.c/ReadImage/501
SVK
  • 1,004
  • 11
  • 25

1 Answers1

1

You'll need to create instructions for the hdfs delegation.

I'm not familiar with hadoop, but the documentation to copy a file locally seems straightforward.

fs get --from <source> --to <local>

Create a simple XML file entitled delegates.xml with the following content...

<delegatemap>
  <delegate decode="hdfs" command="&quot;fs&quot; get --from &quot;hdfs:%M&quot; --to &quot;%o&quot;"/>
</delegatemap>

See the Resources documentation to learn about how ImageMagick loads delegation files, and which option works for your environment. You can also ask ImageMagick's identify utility where system paths are located.

identify -list configure | grep SHARE_PATH

If there's no delegates.xml file located at the SHARE_PATH location, then copy the newly created XML file to that location. Else, if the file exists, you should edit the file to include the <delegate line inside the existing <delegatemap>.

If you do not have admin access, or the system is managed via a package-manager, explore other options that work for your applications. Like $HOME/.config/ImageMagick/, or application directory. Refer to docs linked above.

You can verify that the delegation for HDFS has been mapped correctly by running the following.

identify -list delegate | grep hdfs

Then test it with the convert utility.

convert hdfs://boboda02.boobo.com:8020/bda/clamsops/raw/personal_brella_test/09_29_2015_090902.pdf output_%04d.jpg

Wand should now understand the hdfs protocol.

emcconville
  • 23,800
  • 4
  • 50
  • 66