Getting number of Excel sheets using apache tika

Question

While extracting excel content using apache tika I can extract content from the first sheet only. How can I find the total number of sheets? The code I used is shown below.

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("D:\\ExtractExcel\\test.xlsx"));
ParseContext pcontext = new ParseContext();
OOXMLParser msofficeparser = new OOXMLParser();
msofficeparser.parse(inputstream, handler, metadata, pcontext);

score 0 · Answer 1 · answered May 29 '15 at 13:54

If you want to only process certain sheets, you'll need to call Apache Tika in a way that outputs HTML. You can see one way to do that in the Apache Tika Parsing to XHTML example. If you do that, you'll see that the XHTML you get from an Excel file is along the lines of

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="meta:last-author" content="RIBEN9"/>
<meta name="dcterms:modified" content="2007-10-01T16:31:43Z"/>
<title>Simple Excel document</title>
</head>
<body>
   <div class="page"><h1>Feuil1</h1>
   <table><tbody><tr>   <td>Sample Excel Worksheet - Numbers and their Squares</td></tr>
   <tr> <td/></tr>
   </tbody></table>
   </div>
   <div class="page"><h1>Feuil2</h1>
   <table><tbody><tr>   <td/></tr>
   </tbody></table>
   </div>
</body></html>

As you can see, each sheet is in its own <div class="page"> section, so you can split by that

If you look at the Tika example for Fetching just certain bits of the XHTML, you'll see how to go about pulling out just one sheet

Getting number of Excel sheets using apache tika

1 Answers1