I am building a process to go through a large collection of XML files, looking for CALS model tables (most of the files contain at least one) and then store the tabular data back to a database. In the first instance, many of these tables will have dissimilar structures, but I want to use the database to analyse commonality of structure. I list below an (incomplete) extract from an example file.
<table frame="none">
<tgroup cols="6" colsep="0" rowsep="0">
<colspec colname="1" colnum="1" colwidth="127pt" align="center"/>
<colspec colname="2" colnum="2" colwidth="39pt" align="center"/>
<colspec colname="3" colnum="3" colwidth="30pt" align="center"/>
<colspec colname="4" colnum="4" colwidth="33pt" align="center"/>
<colspec colname="5" colnum="5" colwidth="33pt" align="center"/>
<colspec colname="6" colnum="6" colwidth="87pt"/>
<thead>
<row valign="bottom">
<entry align="center">Product</entry>
<entry>SKU</entry>
<entry>Length</entry>
<entry>Depth</entry>
<entry align="center">Weight</entry>
<entry align="center">Remarks</entry>
</row>
<row valign="bottom">
<entry></entry>
<entry></entry>
<entry>(m)</entry>
<entry>(m)</entry>
<entry align="center">(kg) </entry>
<entry align="center"> </entry>
</row>
</thead>
<tbody>
<row>
<entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
</row>
<row>
<entry>Albion</entry>
<entry>12345</entry>
<entry>398</entry>
<entry>15.5</entry>
<entry> </entry>
<entry>N/A </entry>
</row>
<row>
<entry>Rotorua</entry>
<entry>12346</entry>
<entry>398</entry>
<entry>15.5</entry>
<entry> </entry>
<entry> </entry>
</row>
<row>
<entry>Quintep</entry>
<entry>12347</entry>
<entry>398</entry>
<entry>15.5</entry>
<entry> </entry>
<entry> </entry>
</row>
Because of the dissimilar structures that I know I will encounter during the process, I am unable to create a table schema in the database that will hold all this data, other than to store it as entities with the following properties:
- doc_id
- table_id
- row_id
- col_id
- col_name
- col_units
- entry_value
This will allow me to store both string and numeric values as strings.
An additional property that is required in some instances is the 'category' as defined in the straddle row in the example above at tbody/row[1]. Here the original document creator has added a straddle to categorise the rows immediately following:
<row>
<entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
</row>
These straddle rows are causing me some difficulties. Where they occur, they 'categorise' the rows following UNTIL the next straddle occurs.
My initial document analysis has indicated that there are a number of possible 'types' for the table structure in relation to these straddles:
- Type 1. That no straddles occur in the tables - simple and easy to process
- Type 2. That the first row in the tbody is a straddle and that there are zero or more further straddles below this in the tbody
- Type 3. That the table contains straddles but NOT in the first row of the tbody
Type 3 'could' be treated as 2 separate tables, one of type 1 (all the rows up to but excluding the 1st straddle row) and a Type 2 (all the rows from the 1st straddle forwards)
So it seems that the key to solving this processing problem is to identify the position of the 1st straddle, treat everything (zero or more rows) before the straddle as Type1 and treat everything from the straddle forward as Type 2.
But I am having some difficulty identifying the position of there 1st straddle. My definition of 1st straddle - in Xpath terms - is
tbody/row[entry [@nameend > @namest]][1]
This allows for the possibility that the straddle is not always keyed from column 1 and does not always extend into the last column. Both of these possibilities do exist in the real world data.
So my problem still goes back to identifying the location of the 1st straddle.
There are many similar solutions listed on this page http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d5010e13
But I am having difficulty applying them to my instances.
Something like this may work
<xsl:key name="straddles" match="row[entry[@nameend > @namest]]" use="entry/@namest"/>
But I'm unsure what to use to define the @use attribute of the key?
When I try to define a first-straddle variable, I don't have a defining value to pass to the key() function?
<xsl:variable name="first-straddle" select="table/tgroup/tbody/row[generate-id() = generate-id(key('straddles',?????))]"/>
How do I find the the location of the first straddle?
Thanks in advance
Additional Notes: ![Screen shot of example table rendered in browser]https://dl.dropboxusercontent.com/u/5065004/Screen%20Shot%202014-02-25%20at%2013.54.08.png The above screenshot shows the example table rendered in the browser. The straddle rows 'categorise' the data in the following rows - right up to the next straddle. The @nasmest and @nameend attributes define the number of COLUMNS straddled by the heading, not the number of ROWS until the next straddle. The latter value is IMPLICIT in the markup and is the XPath statement i am searching for. The former - the column straddle - is EXPLICIT in the markup.