0

I am trying to read from a pdf file using file streams and I want to write it to a writer in cp1252 encodeded format. Following is the code:

byte buf[] = new byte[8192];
InputStream is = new FileInputStream(f); 
ByteArrayOutputStream oos = new ByteArrayOutputStream(); 
int c=0; 
while ((c = is.read(buf)) != -1) { 
   oos.write(buf, 0, c); 
}
byte out[] = oos.toByteArray();
String str = oos.toString(out,"UTF-8");
char[] ch = str.toCharArray();
writer.write(ch);
is.close(); 
oos.close();

But the output is erroneous as the text is not readable(not properly converted). How do I fix this ?

mzy
  • 1,754
  • 2
  • 20
  • 36
  • 1
    What does "f" contain? Is this an actual PDF file? – David van Driessche Mar 21 '16 at 15:46
  • *UTF-8 formatted pdf file* - what is that? PDF is a binary format. Full stop. – mkl Mar 21 '16 at 20:51
  • Its a pdf file: File f = new File("C:\Users\myfile.pdf"); I checked out the properties of the file and in eclipse it says it's encoding is by default UTF-8 – Ria Katoch Mar 22 '16 at 02:51
  • Also my pdf file contains tables and graphs, do I need to use some special library to read this kind of pdf file – Ria Katoch Mar 22 '16 at 07:11
  • *I checked out the properties of the file and in eclipse it says it's encoding is by default UTF-8* - then eclipse falsely assumes that file is in some text/* format. But Pdf definitively is a binary format. – mkl Mar 31 '16 at 08:37

1 Answers1

0

You are probably encountering an error while trying to read from the PDF file. Try using PDFBox for extracting text from the PDF file. It's probably one of the best ways to do so. Once you have the required text, you can then save it using cp1252 encoding.

You can check out examples of text extraction using PDFBox from here

Regarding conversion to cp1252, if you are using a Windows machine, then the default encoding is cp1252. So simply trying to save the text should hopefully save it in cp1252 encoding.

Abhilash Panigrahi
  • 1,455
  • 1
  • 13
  • 31