How to read from PDF using Selenium webdriver and Java

Question

I am trying to read the contents of a PDF file using Java-Selenium. Below is my code. getWebDriver is a custom method in the framework. It returns the webdriver.

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDFParser parser = new PDFParser((RandomAccessRead) fileToParse);
parser.parse();

String output = new PDFTextStripper().getText(parser.getPDDocument());

The second line of the code gives compile time error if I don't parse it to RandomAccessRead type.

And when I parse it, I get this run time error:

java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead

I need help with getting rid of these errors.

Is there a specific reason why you explicitly use the `PdfParser` class instead of letting PDFBox care for the document loading details itself? — mkl, Jul 04 '18 at 10:33
Not really! I just found this snippet and tried to see if this works for me. — Suchandra, Jul 04 '18 at 10:52
[Related question](https://stackoverflow.com/q/39233547/1729265), but the accepted answer there (downgrading to 1.8.x) in general is not the appropriate one, better to use the API correctly as proposed in the other answer. — mkl, Jul 04 '18 at 12:02

score 2 · Answer 1 · answered Jul 04 '18 at 11:03

First of, unless you want to interfere in the PDF loading process, there is no need to explicitly use the PdfParser class. You can instead use a static PDDocument.load method:

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

PDDocument document = PDDocument.load(fileToParse);

String output = new PDFTextStripper().getText(document);

Otherwise, if you do want to interfere in the loading process, you have to create a RandomAccessRead instance for your BufferedInputStream, you cannot simply cast it because the classes are not related.

You can do that like this

URL urlOfPdf = new URL(this.getWebDriver().getCurrentUrl());

BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());

MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMainMemoryOnly();
ScratchFile scratchFile = new ScratchFile(memUsageSetting);
PDFParser parser;
try
{
    RandomAccessRead source = scratchFile.createBuffer(fileToParse);
    parser = new PDFParser(source);
    parser.parse();
}
catch (IOException ioe)
{
    IOUtils.closeQuietly(scratchFile);
    throw ioe;
}

String output = new PDFTextStripper().getText(parser.getPDDocument());

(This essentially is copied and pasted from the source of PDDocument.load.)

Thanks @mkl for the alternative. The previous issue is gone. But what I am not getting is - java.io.IOException: Error: End-of-File, expected line java.io.IOException: Error: End-of-File, expected line at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119) at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2570) at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2541) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:213) — Suchandra, Jul 12 '18 at 12:52
*"The previous issue is gone."* - Great. If the solution was along the lines of my answer, please accept it. *"But what I am not getting is"* - please ask follow-up questions as separate questions here on stack overflow, not in a comment. That been said, though... `End-of-File` in `readLine` called by `parseHeader` sounds like the file in question has been empty. — mkl, Jul 12 '18 at 13:00

How to read from PDF using Selenium webdriver and Java

1 Answers1