0

Please refer the background thread for better understanding of my dilemma ;)

As mentioned in the above thread, I decided to use Tika to have a generic interface to parse docs. and extract the content. Now to do this, I have decided to convert each document to XML/HTML using the appropriate ContentHandler.

Below is the sample output :

    File type is application/vnd.openxmlformats-officedocument.wordprocessingml.document
    Handler <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="cp:revision" content="2" />
    <meta name="meta:last-author" content="ogilvie.f" />
    <meta name="Last-Author" content="ogilvie.f" />
    <meta name="meta:save-date" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Name" content="Microsoft Office Word" />
    <meta name="Author" content="ogilvie.f" />
    <meta name="dcterms:created" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Version" content="12.0000" />
    <meta name="Character-Count-With-Spaces" content="21667" />
    <meta name="date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:Template" content="Normal" />
    <meta name="meta:line-count" content="153" />
    <meta name="creator" content="ogilvie.f" />
    <meta name="publisher" content="Procter &amp; Gamble" />
    <meta name="Word-Count" content="3240" />
    <meta name="meta:paragraph-count" content="43" />
    <meta name="Creation-Date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:AppVersion" content="12.0000" />
    <meta name="meta:author" content="ogilvie.f" />
    <meta name="Line-Count" content="153" />
    <meta name="extended-properties:Application" content="Microsoft Office Word" />
    <meta name="Paragraph-Count" content="43" />
    <meta name="Last-Save-Date" content="2012-04-24T15:24:00Z" />
    <meta name="Last-Printed" content="2012-03-29T15:06:00Z" />
    <meta name="Revision-Number" content="2" />
    <meta name="meta:print-date" content="2012-03-29T15:06:00Z" />
    <meta name="meta:creation-date" content="2012-04-24T15:24:00Z" />
    <meta name="dcterms:modified" content="2012-04-24T15:24:00Z" />
    <meta name="Template" content="Normal" />
    <meta name="Page-Count" content="15" />
    <meta name="meta:character-count" content="18470" />
    <meta name="dc:creator" content="ogilvie.f" />
    <meta name="meta:word-count" content="3240" />
    <meta name="extended-properties:Company" content="Procter &amp; Gamble" />
    <meta name="Last-Modified" content="2012-04-24T15:24:00Z" />
    <meta name="custom:ContentTypeId" content="0x010100832DCE57D1DD144A851051A25C75E147" />
    <meta name="modified" content="2012-04-24T15:24:00Z" />
    <meta name="xmpTPg:NPages" content="15" />
    <meta name="dc:publisher" content="Procter &amp; Gamble" />
    <meta name="Character Count" content="18470" />
    <meta name="meta:page-count" content="15" />
    <meta name="meta:character-count-with-spaces" content="21667" />
    <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
    <title></title>
    </head>
    <body><p class="body_Text"><b>CONFIDENTIAL</b></p>
    <table><tbody><tr>  <td><p>principle</p>
</td>   <td><p>optimum</p>
</td>   <td><p>rationale</p>
</td></tr>
<tr>    <td><p>Number of  suppliers</p>
</td>   <td><p class="list_Paragraph">2-3 per plant</p>
<p class="list_Paragraph">&gt;80% with 5 per region/country cluster</p>
</td>   <td><p class="list_Paragraph">Competition is local</p>
<p class="list_Paragraph">Scale the spend with central accounts</p>
</td></tr>
<tr>    <td><p>Global/local suppliers</p>
</td>   <td><p>Regional is sufficient</p>
</td>   <td><p class="list_Paragraph">No advantage to global as scale is regional only and there is limited IP to transfer.</p>
<p class="list_Paragraph">Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.</p>
</td></tr>
<tr>    <td><p>Approach to suppliers</p>
</td>   <td><p>collaborative</p>
</td>   <td><p>Competition to drive price is clear; preferential and value-add deals require collaboration</p>
</td></tr>
<tr>    <td><p>Make v buy</p>
</td>   <td><p>buy</p>
</td>   <td><p>Multiple suppliers; commoditised technologies</p>
</td></tr>
<tr>    <td><p>Distance of suppliers to plant</p>
</td>   <td><p class="list_Paragraph">Max 300km for boxes (300miles in NA); up to 1000km for paper reels.</p>
<p class="list_Paragraph">Can be longer for specialist print grades or to countries with no high quality local supply</p>
</td>   <td><p class="list_Paragraph">Economic max as high volume product (air in the fluting)</p>
<p class="list_Paragraph">Need recent built paper machines to produce paper strong enough to run on high-speed corrugators</p>
</td></tr>
<tr>    <td><p>Type of suppliers</p>
</td>   <td><p class="list_Paragraph">Integrated with containerboard making</p>
<p />
<p class="list_Paragraph">Corrugators on-site</p>
</td>   <td><p class="list_Paragraph">To assure supply and avoid being leveraged by paper making scale</p>
<p class="list_Paragraph">Cost structure not competitive if have to buy in board (shipping air)</p>
</td></tr>
<tr>    <td><p>Purchase of feedstocks</p>
</td>   <td><p>Not if integrated suppliers</p>
</td>   <td><p>Integrated suppliers have 20x our scale</p>
</td></tr>
<tr>    <td><p>Length and nature of contracts</p>
</td>   <td><p>Multiple year (2-3), but with fixed glidepath pricing/value every year</p>
</td>   <td><p>Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.</p>
</td></tr>
<tr>    <td><p>Specifications</p>
</td>   <td><p class="list_Paragraph">Standard board weights</p>
<p />
<p />
<p class="list_Paragraph">Tailored box sizes</p>
</td>   <td><p class="list_Paragraph">Paper scale much higher so uneconomic to make tailored weight</p>
<p class="list_Paragraph">Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.</p>
</td></tr>
<tr>    <td><p>Terms</p>
</td>   <td><p>Standard, including payment terms</p>
</td>   <td><p>High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.</p>
</td></tr>
</tbody></table>
    <p>date</p>
    </td></tr>
    </tbody></table>
    <p />
    <p />
    <p>1</p>
    <p class="footer" />
    </body></html>

The challenge begins when I want to extract elements, say from the handler. I was suggested to use XPath and via a regex get the tables. I got the concept but wasn't able to do it using Tika as explained here.

After reading threads like this,I'm wondering if I should quit Tika altogether in favour of JAXP or use a combination(?).

Can anyone guide me as to where my assumptions, directions are wrong and how I should proceed?

Community
  • 1
  • 1
Kaliyug Antagonist
  • 3,512
  • 9
  • 51
  • 103
  • 2
    Your question is not a good fit for SO, for multiple reasons. First, we do not recommend tools here. Secondly, which tool to use is highly subjective on a major number of factors. Furthermore, you are missing the really important details (e.g. what does `wasn't able to do"` mean? What didn't work, any error messages, etc. pp.?) – dirkk Aug 05 '14 at 11:59
  • 1
    Also, I didn't really get your problem. So you have this XML document now and want to extract some information. This should be an easy task using XPath (or if it is more complicated XPath 3.0 or XQuery). As you wrote something about regex here comes my usual rant: Do not use regex to parse XML - It simply can not be done correctly! – dirkk Aug 05 '14 at 12:01
  • I agree the question is not 'direct' but please check the background thread and the other mentioned threads which will make it clear, also I do not wish anyone to 'recommend' any tool - I just want to establish which approach, hence, tool fits my scenario(mentioned in detail in the background thread). – Kaliyug Antagonist Aug 06 '14 at 03:14
  • First of all, you should include **all** details in your question, so we do not have to check background threads. But I actually did take a look at it and still have the questions as in my second comment. Also, clearly you are looking for a tool recommendation as the title already points out: You would like recommendation to either use Tika or JAXP. We can not decide that for you. – dirkk Aug 06 '14 at 05:19

0 Answers0