1

I'm using Jsoup to parse some HTML to get some PDF url.

The PDF is shown in an <embed> tag like:

<html>
<body marginwidth="0" marginheight="0" style="background-color: rgb(38,38,38)">
<embed width="100%" height="100%" name="plugin" src="http://www.domain.com/apdf_id.pdf?tp=&amp;arnumber=1253069&amp;isnumber=28038" type="application/pdf">
</body>
</html>

How can I get the PDF URL from that page, so that I can download it to local machine?

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Jayyrus
  • 12,961
  • 41
  • 132
  • 214

1 Answers1

1

Just select the <embed type="application/pdf"> element and get its src attribute as absolute URL.

String pdfURL = document.select("embed[type=application/pdf").first().absUrl("src");

You could also select specifically the <embed name="plugin"> instead.

String pdfURL = document.select("embed[name=plugin").first().absUrl("src");

Then you can use java.net.URL to obtain it in flavor of InputStream.

InputStream input = new URL(pdfURL).openStream();

Finally just write it to an arbitrary OutputStream such as FileOutputStream the usual way.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555