4

Tabula looks like a great tool for extracting tabular data from PDFs. There are plenty of examples of how to call it from the command line or use it in Python but there doesn't seem to be any documentation for use in Java. Does anyone have a worked example?

Note, tabula does provide source code but it seems confused between versions. For example, the example on GitHub references a TableExtractor class which does not seem to exist in the JAR.

https://github.com/tabulapdf/tabula-java

emd
  • 75
  • 2
  • 8

2 Answers2

8

you can use the following code to call tabula from java, hope this helps

  public static void main(String[] args) throws IOException {
    final String FILENAME="../test.pdf";

    PDDocument pd = PDDocument.load(new File(FILENAME));

    int totalPages = pd.getNumberOfPages();
    System.out.println("Total Pages in Document: "+totalPages);

    ObjectExtractor oe = new ObjectExtractor(pd);
    SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
    Page page = oe.extract(1);

    // extract text from the table after detecting
    List<Table> table = sea.extract(page);
    for(Table tables: table) {
        List<List<RectangularTextContainer>> rows = tables.getRows();

        for(int i=0; i<rows.size(); i++) {

            List<RectangularTextContainer> cells = rows.get(i);

            for(int j=0; j<cells.size(); j++) {
                System.out.print(cells.get(j).getText()+"|");
            }

           // System.out.println();
        }
    }

}
  • 1
    Is there any documentation or reference module for Tabula? – Akshit Gupta Jul 14 '20 at 14:57
  • How can I avoid a "rawtypes" warning (without suppressing the warning) when I don't parameterize `RectangularTextContainer` e.g. RectangularTextContainer> – RTF Nov 26 '21 at 14:43
0
// ****** Extract text from the table after detecting & TRANSFER TO XLSX *****
    XSSFWorkbook wb = new XSSFWorkbook();
    Sheet sheet = wb.createSheet("Barang Baik");
    List<Table> table = sea.extract(page);
    for (Table t : table) {
        int rowNumber = 0;
        try {
            while (sheet.getRow(rowNumber).getCell(0) != null) {
                rowNumber++;
            }
        } catch (Exception e) { }

        List<List<RectangularTextContainer>> rows = t.getRows();
        for (int i = 0; i < rows.size(); i++) {
            List<RectangularTextContainer> cells = rows.get(i);
            Row row = sheet.createRow(i+rowNumber);
            for (int j = 0; j < cells.size(); j++) {
                Cell cell = row.createCell(j);
                String cellValue = cells.get(j).getText();
                cell.setCellValue(cellValue);
            }
        }
        FileOutputStream fos = new FileOutputStream("C:\\your\\file.xlsx");
        wb.write(fos);
        fos.close();
    }
Zakee Fa
  • 23
  • 3
  • I edited the bottom half of Khalifa's code posted to give output in xlsx. Hope it helps someone if you are looking for something similar from python working in java. – Zakee Fa Oct 22 '20 at 16:18