0

I have the same problem of extracting arabic text from pdf File, can any one help if got the solution ??? I have tried many times with pdfbox but no result.

  • 1
    *"I have the same problem.."* What problem exactly? – Andrew Thompson Dec 05 '11 at 10:28
  • I had quite good results with PDFBox extracting text. Often better than libraries, however many PDFs don't store the text in a sensible linear way, which can make extracting a readable text automatically from them impossible. (However I don't have experience with Arabic). Are you sure the text you have is actually text and not an image embedded inside the PDF? – RoToRa Dec 05 '11 at 11:09

1 Answers1

0

There are several things, that can go wrong while extracting text from a PDF:

  1. The PDF is encrypted. In this case you need the password to extract data.
  2. PDF as a format is not really meant to have text extracted. So pdfbox usually tries to identify characters that are placed close to each other and combine them to words. As you can imageing, this can easily go wrong.

Check out this question for more infos.

Community
  • 1
  • 1
nfechner
  • 17,295
  • 7
  • 45
  • 64
  • Does your program need to extract text directly from pdf? If not you could use an ocr to convert pdf to text and read it from txt file. – Mr1159pm Dec 05 '11 at 10:28