2

I am using PDF Box version 2.0.9 in my application. I have to parse large pdf files from web. Following is the code I am using

MimeDetector Class

    @Getter
    @Setter
    class MimeTypeDetector {
        private ByteArrayInputStream byteArrayInputStream;
        private BodyContentHandler bodyContentHandler;
        private Metadata metadata;
        private ParseContext parseContext;
        private Detector detector;
        private TikaInputStream tikaInputStream;

        MimeTypeDetector(ByteArrayInputStream byteArrayInputStream) {
            this.byteArrayInputStream = byteArrayInputStream;
            this.bodyContentHandler = new BodyContentHandler(-1);
            this.metadata = new Metadata();
            this.parseContext = new ParseContext();
            this.detector = new DefaultDetector();
            this.tikaInputStream = TikaInputStream.get(new CloseShieldInputStream(byteArrayInputStream));
        }
    }

    
    private void crawlAndSave(String url, DomainGroup domainGroup)  {
        MimeTypeDetector mimeTypeDetector = null;
        try {
            String decodeUrl = URLDecoder.decode(url, WebCrawlerConstants.UTF_8);
            ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(HTMLFetcher.fetch(WebCrawlerUtil.encodeUrl(url)));
            mimeTypeDetector = new MimeTypeDetector(byteArrayInputStream);
            String contentType = getContentType(mimeTypeDetector);
            if (isPDF(contentType)) {
                crawlPDFContent(decodeUrl, mimeTypeDetector, domainGroup);
            } else if (isWebPage(contentType)) {
                // fetching HTML web Page Content
            } else {
                log.warn("Skipping URL::" + url + ".Not a supported crawler format");
                linksVisited.remove(url);
            }
        } catch (IOException e) {
            log.error("crawlAndSave:: Error occurred while decoding URL:" + url + " : " + e.getMessage());
            // some catch operation
        } finally {
            if (Objects.nonNull(mimeTypeDetector)) {
                IOUtils.closeQuietly(mimeTypeDetector.getByteArrayInputStream());
            }
        }
    }

    private String getContentType(MimeTypeDetector mimeTypeDetector) throws IOException {
        TikaInputStream tikaInputStream = mimeTypeDetector.getTikaInputStream();
        String contentType = mimeTypeDetector.getDetector().detect(tikaInputStream, mimeTypeDetector.getMetadata()).toString();
        tikaInputStream.close();
        return contentType;
    }

    private void crawlPDFContent(String url, MimeTypeDetector mimeTypeDetector, DomainGroup domainGroup) {
        try {
            private PDFParser pdfParser = new PDFParser();
            pdfParser.parse(mimeTypeDetector.getByteArrayInputStream(), mimeTypeDetector.getBodyContentHandler(),
                    mimeTypeDetector.getMetadata(), mimeTypeDetector.getParseContext());
            // Some Database operation
        } catch (IOException | TikaException | SAXException e) {
            //Some Catch operation
            log.error("crawlPDFContent:: Error in crawling PDF Content" + " : " + e.getMessage());
        }
    }

HTML Fetcher

    public class HTMLFetcher {

    private HTMLFetcher() {
    }

    /**
     * Fetches the document at the given URL, using {@link URLConnection}.
     *
     * @param url
     * @return
     * @throws IOException
     */
    public static byte[] fetch(final URL url) throws IOException {

        TrustManager[] trustAllCerts = new TrustManager[]{new X509TrustManager() {
            public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                return null;
            }

            public void checkClientTrusted(X509Certificate[] certs, String authType) {
            }

            public void checkServerTrusted(X509Certificate[] certs, String authType) {
            }

        }};

        SSLContext sc = null;
        try {
            sc = SSLContext.getInstance("SSL");
            sc.init(null, trustAllCerts, new java.security.SecureRandom());
            HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (NoSuchAlgorithmException | KeyManagementException e) {
            e.printStackTrace();
        }

        // Create all-trusting host name verifier
        HostnameVerifier allHostsValid = (hostname, session) -> true;

        HttpsURLConnection.setDefaultHostnameVerifier(allHostsValid);

        setAuthentication(url);
        //Taken from Boilerpipe
        final HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        InputStream in = conn.getInputStream();
        byte[] byteArray = IOUtils.toByteArray(in);
        in.close();
        conn.disconnect();
        return byteArray;
    }

    private static void setAuthentication(URL url) {
        AuthenticationDTO authenticationDTO = WebCrawlerUtil.getAuthenticationFromUrl(url);
        if (Objects.nonNull(authenticationDTO)) {
            Authenticator.setDefault(new Authenticator() {
                protected PasswordAuthentication getPasswordAuthentication() {
                    return new PasswordAuthentication(authenticationDTO.getUserName(),
                            authenticationDTO.getPassword().toCharArray());
                }
            });
          }
       }
    }

But when I am checking memory stats, the memory usage is increasing constantly. I verified this using visualVM and YourKit Java profiler.

Check the attached image.

enter image description here.

Is there anything I am doing wrong? I searched for similar issues like this and this but it was mentioned that this issue has been fixed in latest versions.

Ahmet Koylu
  • 159
  • 1
  • 3
  • 14
Richa
  • 7,419
  • 6
  • 25
  • 34
  • 3
    Please try with 2.0.11, that is the latest version. And load PDF files with PDDocument.load(). And include more code so that one can see what you are doing, and that you are closing the documents. The parse() call does not look like it is from PDFBox. Is it from Tika? In any case, you'd need to tell more what you're doing. – Tilman Hausherr Jul 31 '18 at 14:49
  • I have added some more code. Please have a look. – Richa Jul 31 '18 at 15:01
  • tika is at 1.18. Can you reduce your code to get rid of all non PDF stuff, e.g. only have one single PDF file processed again and again? Btw I'm not a tika expert. But simplification would help to come closer to the cause. Btw some type1 fonts (the "standard 14") will stay. But it should not "increase constantly". – Tilman Hausherr Jul 31 '18 at 15:12
  • I am getting PDFs from web. So I fetch them by creating URLConnection ( which makes use Of `HTMLFetcher`) and then find the mimeType using Apache Tika. If it is PDF, I send that to `crawlPDFContent` method which parses it. So that is the main code that is doing PDF stuff. – Richa Jul 31 '18 at 15:21
  • But testing without URL downloading would clarify whether it happens only when loading URLs, or also when loading local files. If the later, then your code to reproduce the effect would be smaller. Btw the biggest part of your use memory is a HashMap. Find out where this is used. And update to the latest version. – Tilman Hausherr Jul 31 '18 at 15:31
  • Thanks @TilmanHausherr- I will test it without URL downloading . – Richa Aug 01 '18 at 01:50
  • @Richa Did you discover anything with your test? Is this memory leak still present? We are thinking of using PDFBox in a commercial product. – simgineer Jan 25 '19 at 23:00
  • @simgineer: The issue was not there in the latest versions of PDFBox Library. There was some issues with our code itself. – Richa Jan 29 '19 at 05:50

1 Answers1

0

Please use below while loading document MemoryUsageSetting.setupTempFileOnly()

Shahid Hussain Abbasi
  • 2,508
  • 16
  • 10