0

I am trying to read the contents of a PDF file using Java-Selenium. Below is my code. getWebDriver is a custom method in the framework. It returns the webdriver.

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());

The second line of the code gives compile time error if I don't parse it to RandomAccessRead type.

compilation error

And when I parse it, I get this run time error:

java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

runtime error

I need help with getting rid of these errors.

mkl
  • 90,588
  • 15
  • 125
  • 265
Suchandra
  • 1
  • 1
  • 1
  • Is there a specific reason why you explicitly use the `PdfParser` class instead of letting PDFBox care for the document loading details itself? – mkl Jul 04 '18 at 10:33
  • Not really! I just found this snippet and tried to see if this works for me. – Suchandra Jul 04 '18 at 10:52
  • [Related question](https://stackoverflow.com/q/39233547/1729265), but the accepted answer there (downgrading to 1.8.x) in general is not the appropriate one, better to use the API correctly as proposed in the other answer. – mkl Jul 04 '18 at 12:02

1 Answers1

2

First of, unless you want to interfere in the PDF loading process, there is no need to explicitly use the PdfParser class. You can instead use a static PDDocument.load method:

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDDocument document = PDDocument.load(fileToParse);

String output = new PDFTextStripper().getText(document);

Otherwise, if you do want to interfere in the loading process, you have to create a RandomAccessRead instance for your BufferedInputStream, you cannot simply cast it because the classes are not related.

You can do that like this

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMainMemoryOnly();
ScratchFile scratchFile = new ScratchFile(memUsageSetting);
PDFParser parser;
try
{
    RandomAccessRead source = scratchFile.createBuffer(fileToParse);
    parser = new PDFParser(source);
    parser.parse();
}
catch (IOException ioe)
{
    IOUtils.closeQuietly(scratchFile);
    throw ioe;
}

String output = new PDFTextStripper().getText(parser.getPDDocument());

(This essentially is copied and pasted from the source of PDDocument.load.)

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks @mkl for the alternative. The previous issue is gone. But what I am not getting is - java.io.IOException: Error: End-of-File, expected line java.io.IOException: Error: End-of-File, expected line at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119) at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2570) at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2541) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:213) – Suchandra Jul 12 '18 at 12:52
  • *"The previous issue is gone."* - Great. If the solution was along the lines of my answer, please accept it. *"But what I am not getting is"* - please ask follow-up questions as separate questions here on stack overflow, not in a comment. That been said, though... `End-of-File` in `readLine` called by `parseHeader` sounds like the file in question has been empty. – mkl Jul 12 '18 at 13:00