extracting text from using pdfclown function 'textextractor'

Question

i am getting an error while using textextractor of pdfclown library. The code i used is

TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

  //  Extract the page text!
  Map textStrings = textExtractor.extract(page);

a part of the error i got is

exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>

I also found out that this happens when my pdf contains some bullets for example

item 1
item 2
item 3

Plz help me out to extract the text from such pdfs.

@mkl I am facing the same problem in many other pdfs. One such pdf is [this one](https://docs.google.com/file/d/0B9xa_HtrD7kcUjM4cjAyX2JGVkk/edit?usp=sharing) — utkarsh, May 18 '13 at 19:53
I just tested your PDF with your source fragment (obviously with a closing `}` added), and the PDF was extracted all right, at least no exception was thrown and all the text (except the title) extracted all right. I used the current trunk version of PDF Clown in a java 6 environment. Thus, you may want to check the version you use and, if that didn't help, provide more complete source code and stack traces. — mkl, May 18 '13 at 22:59
@mkl This time i tried with java 6 environment but the same error. I am sharing my code with the library i am using with you. Please check it out and help me to fix this problem. Get the code [here](https://drive.google.com/folderview?id=0B9xa_HtrD7kcNGhWQXV3dGpPbDA&usp=sharing).Thank you. — utkarsh, May 19 '13 at 08:36
I just used your `highlighter.java` together with the current PDF Clown trunk version as jar, and the PDF was processed without incident, especially without `NullPointerException` (the highlights partially were not at the right position, though). Looking at your shared google drive contents, though, I assume you do not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder. The PDF Clown jar files contain additional ressources, though, which your setup does not include. Thus, please use your `highlighter.java` with `pdfclown.jar` in the classpath. — mkl, May 19 '13 at 21:20

score 0 · Accepted Answer · answered May 20 '13 at 09:19

(The following comment turned out to be the solution:)

Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).

After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.

The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:

Your highlighter.java has to be used with pdfclown.jar in the classpath.

extracting text from using pdfclown function 'textextractor'

1 Answers1