How to read raw text from pdf file using java

Question

I am using pdf box parser to read data from pdf file using java.It will read all the content from pdf file.

Below is my sample code to read data from pdf file and store it into text file. Sample Code:

public class PDFTextParser {
    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    PdfReader read;
    COSDocument cosDoc;
    PDDocumentInformation pdDocInfo;
    PdfTextExtractor extract;
    // PDFTextParser Constructor 
    public PDFTextParser() {
    }

    // Extract text from PDF Document
    String pdftoText(String fileName) {

        System.out.println("Parsing text from PDF file " + fileName + "....");
        File f = new File(fileName);

        if (!f.isFile()) {
            System.out.println("File " + fileName + " does not exist.");
            return null;
        }

        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {
            System.out.println("Unable to open PDF Parser.");
            return null;
        }

        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc); 

        } catch (Exception e) {
            System.out.println("An exception occured in parsing the PDF Document.");
            e.printStackTrace();
            try {
                   if (cosDoc != null) cosDoc.close();
                   if (pdDoc != null) pdDoc.close();
               } catch (Exception e1) {
               e.printStackTrace();
            }
            return null;
        }      
        System.out.println("Done.");
        return parsedText;
    }

    // Write the parsed text from PDF to a file
    void writeTexttoFile(String pdfText, String fileName) {

        System.out.println("\nWriting PDF text to output text file " + fileName + "....");
        try {
            PrintWriter pw = new PrintWriter(fileName);
            pw.print(pdfText);
            pw.close();     
        } catch (Exception e) {
            System.out.println("An exception occured in writing the pdf text to file.");
            e.printStackTrace();
        }
        System.out.println("Done.");
    }

    //Extracts text from a PDF Document and writes it to a text file
    public static void test() {


        String args[]={"C://Sample_Voice.pdf","C://CNP/Sample.txt"};
        if (args.length != 2) {
            System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
            System.exit(1);
        }

        PDFTextParser pdfTextParserObj = new PDFTextParser();
        String pdfToText = pdfTextParserObj.pdftoText(args[0]).replaceAll("®", "");
        if (pdfToText == null) {
            System.out.println("PDF to Text Conversion failed.");
        }
        else {
            System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
            pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
        }
    }  

    public static void main(String args[]) throws IOException
    {
        test();
    }
}

My requirement is to get raw text only with out getting

1)header
2)footer
3)hiperlinks.

How to do this.Please suggest me.

Thanks

Another link: http://thottingal.in/blog/2009/06/24/pdfbox-extract-text-from-pdf/ — Stephen C, Aug 16 '13 at 10:22
using above link related example i will get over all content in to text file.But i want to remove header and footer from pdf file. — user2664353, Aug 16 '13 at 10:32

How to read raw text from pdf file using java

0 Answers0