PDF comparison Using JAVA

Question

I have to compare two PDFs. I can not use any compare utility ,as we want to automate the testing , and documents are stored at different server and locations.

I have used PDFbox to compare PDFs. One document contains QR code , one contains Barcode . On the other hand , both the documents contain one more PDF enclosed in it.

I am using below given code:

public List<Object> compareDoc(Message<PDFStore> message) throws Exception{
    List<String> insertedRows = compareResult.getInsertedRows(); 
    List<String> deletedRows = compareResult.getDeletedRows();
    List<String> oldValueOfChangedRows = compareResult.getOldValueOfChangedRows();
    List<String> newValueOfChangedRows = compareResult.getNewValueOfChangedRows();
    try{
        compareResult.init();
        PDFStore store = message.getPayload();
        byte[] scritturaPDF = store.getScritturaPDF();
        FileUtils.writeByteArrayToFile(new File("C:\\Old Sys BackUp\\Data\\text1.pdf"), scritturaPDF);

        byte[] reconfPDF = store.getReconfPDF();
        FileUtils.writeByteArrayToFile(new File("C:\\Old Sys BackUp\\Data\\text2.pdf"), reconfPDF);

        strip("C:\\Old Sys BackUp\\Data\\text1.pdf", "C:\\Old Sys BackUp\\Data\\text3.pdf");
        strip("C:\\Old Sys BackUp\\Data\\text2.pdf", "C:\\Old Sys BackUp\\Data\\text4.pdf");

        String[] fileArray = {"C:\\Old Sys BackUp\\Data\\text3.pdf" ,"C:\\Old Sys BackUp\\Data\\text3.txt"}; 
        String[] fileArray1 = {"C:\\Old Sys BackUp\\Data\\text4.pdf","C:\\Old Sys BackUp\\Data\\text4.txt"};

        ExtractText.main(fileArray);
        ExtractText.main(fileArray1);

        List<String> scritturaPDFlist = fileToLines("C:\\Old Sys BackUp\\Data\\text3.txt");
        List<String> reconfPDFList = fileToLines("C:\\Old Sys BackUp\\Data\\text4.txt");
        Patch patch = DiffUtils.diff(scritturaPDFlist, reconfPDFList);
        DiffRowGenerator.Builder builder = new DiffRowGenerator.Builder();
        builder.showInlineDiffs(true);
        DiffRowGenerator dfg = builder.build();
        List<DiffRow> diffList = dfg.generateDiffRows(scritturaPDFlist, reconfPDFList);
        if(patch.getDeltas().size() > 0){
            List<Delta> deltas = patch.getDeltas();
            for (Delta delta: deltas) {
                  logger.trace("difference is " +  delta);
            }
            logger.trace("difference count is " + patch.getDeltas().size());
        }
        if(diffList !=null){
            for(DiffRow diffRow : diffList){
                 if (diffRow.getTag().equals(DiffRow.Tag.INSERT)){
                    insertedRows.add(changeString( diffRow.getNewLine()));             
                 }
                 else if (diffRow.getTag().equals(DiffRow.Tag.DELETE)){
                     deletedRows.add(changeString( diffRow.getOldLine()));

                 }
                 else if (diffRow.getTag().equals(DiffRow.Tag.CHANGE)){
                     oldValueOfChangedRows.add(changeString( diffRow.getNewLine()));
                     newValueOfChangedRows.add(changeString( diffRow.getOldLine()));
                 }
            }
        }

    }catch(Exception e){
        e.printStackTrace();
    }

    List<Object> o = new ArrayList<Object> ();
    o.add(insertedRows);
    o.add(deletedRows);
    o.add(oldValueOfChangedRows);
    o.add(newValueOfChangedRows);
    return o;

}

public static void strip(String pdfFile, String pdfFileOut) throws Exception {

     PDDocument document = PDDocument.load(pdfFile);


        System.out.println("page coutn -> " + document.getNumberOfPages());
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save(pdfFileOut);

}

private static String changeString(String s){
    if (s != null && s.contains("<")){
         int i = s.indexOf("<");
         int j = s.indexOf(">");
         String s1 = s.substring(0, i);
         String s2 = s.substring(j+1,s.length());
         s = s1 + s2;
         s = changeString(s);
    }
    return s;
}
private  static List<String> fileToLines(String filename) {
    List<String> lines = new LinkedList<String>();
    String line = "";
    try {
            BufferedReader in = new BufferedReader(new FileReader(filename));
            while ((line = in.readLine()) != null) {
                    if (!(line.contains("Name:")) && !(line.contains("Title:")))
                        lines.add(line);
            }
    } catch (IOException e) {
            e.printStackTrace();
    }
    return lines;
 }

I am not able to get 100% correct mismatches. Can anyone please help ?

maybe its already sufficient for your needs to check the md5 sum of the content? — MKorsch, Jun 23 '14 at 11:58
By design, no two PDFs are created equally, not even when running the same code. Hence the suggestion to use an md5 sum of the content is wrong. See http://stackoverflow.com/questions/20039691/reason-why-pdf-files-have-differences/ — Bruno Lowagie, Jun 23 '14 at 12:12
*I have to compare two PDFs* - Generic PDF comparison is not easy. Please define beforehand which differences shall be found and which not. E.g. shall differing metadata (for example the creation date or the author) be reported or ignored? Shall bar codes representing the same data but generated differently (e.g. one using a bar code font, one using a bar code image) be reported or ignored? Shall writing be compared only visually (by comparing them rendered to bitmaps)? Or shall the text it extracts to also be compared? Etc. pp. — mkl, Jun 23 '14 at 12:16
I have to ignore signature images used in the documents, Barcodes in the difference, as they are different , hence I want to ignore that content while comparing. — priyas, Jun 23 '14 at 12:28
Also, If I will open up the text file generated by the above code , it contains some encrypted kind of text(not sure of this , what it is actually ) . I wanted to avoid its comparison too. Document.isEncrypted() function of PDFbox is returning null. — priyas, Jun 23 '14 at 12:30
Ok, you want to ignore signature images, bar codes, and text you cannot extract. What do you want to *not ignore*? As you don't *get 100% correct mismatches*, you might want to explain which differences were falsely ignored and which non-differences were falsely reported. — mkl, Jun 23 '14 at 14:29
There is some migration process happening. these documents are generated by thunderhead (external tool ) .To ensure the integrity , we have to compare the Pdfs. The text will be more or less same , there would be some data as (may be date ) , that is dynamic, which I want to compare . — priyas, Jun 23 '14 at 16:52

PDF comparison Using JAVA

0 Answers0