0

I am building a process to go through a large collection of XML files, looking for CALS model tables (most of the files contain at least one) and then store the tabular data back to a database. In the first instance, many of these tables will have dissimilar structures, but I want to use the database to analyse commonality of structure. I list below an (incomplete) extract from an example file.

   <table frame="none">
        <tgroup cols="6" colsep="0" rowsep="0">
            <colspec colname="1" colnum="1" colwidth="127pt" align="center"/>
            <colspec colname="2" colnum="2" colwidth="39pt" align="center"/>
            <colspec colname="3" colnum="3" colwidth="30pt" align="center"/>
            <colspec colname="4" colnum="4" colwidth="33pt" align="center"/>
            <colspec colname="5" colnum="5" colwidth="33pt" align="center"/>
            <colspec colname="6" colnum="6" colwidth="87pt"/>
            <thead>
                <row valign="bottom">
                    <entry align="center">Product</entry>
                    <entry>SKU</entry>
                    <entry>Length</entry>
                    <entry>Depth</entry>
                    <entry align="center">Weight</entry>
                    <entry align="center">Remarks</entry>
                </row>
                <row valign="bottom">
                    <entry></entry>
                    <entry></entry>
                    <entry>(m)</entry>
                    <entry>(m)</entry>
                    <entry align="center">(kg) </entry>
                    <entry align="center"> </entry>
                </row>
            </thead>
            <tbody>
                <row>
                    <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
                </row>
                <row>
                    <entry>Albion</entry>
                    <entry>12345</entry>
                    <entry>398</entry>
                    <entry>15.5</entry>
                    <entry> </entry>
                    <entry>N/A </entry>
                </row>
                <row>
                    <entry>Rotorua</entry>
                    <entry>12346</entry>
                    <entry>398</entry>
                    <entry>15.5</entry>
                    <entry> </entry>
                    <entry> </entry>
                </row>
                <row>
                    <entry>Quintep</entry>
                    <entry>12347</entry>
                    <entry>398</entry>
                    <entry>15.5</entry>
                    <entry> </entry>
                    <entry> </entry>
                </row>

Because of the dissimilar structures that I know I will encounter during the process, I am unable to create a table schema in the database that will hold all this data, other than to store it as entities with the following properties:

  • doc_id
  • table_id
  • row_id
  • col_id
  • col_name
  • col_units
  • entry_value

This will allow me to store both string and numeric values as strings.

An additional property that is required in some instances is the 'category' as defined in the straddle row in the example above at tbody/row[1]. Here the original document creator has added a straddle to categorise the rows immediately following:

<row>
    <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
</row>

These straddle rows are causing me some difficulties. Where they occur, they 'categorise' the rows following UNTIL the next straddle occurs.

My initial document analysis has indicated that there are a number of possible 'types' for the table structure in relation to these straddles:

  • Type 1. That no straddles occur in the tables - simple and easy to process
  • Type 2. That the first row in the tbody is a straddle and that there are zero or more further straddles below this in the tbody
  • Type 3. That the table contains straddles but NOT in the first row of the tbody

Type 3 'could' be treated as 2 separate tables, one of type 1 (all the rows up to but excluding the 1st straddle row) and a Type 2 (all the rows from the 1st straddle forwards)

So it seems that the key to solving this processing problem is to identify the position of the 1st straddle, treat everything (zero or more rows) before the straddle as Type1 and treat everything from the straddle forward as Type 2.

But I am having some difficulty identifying the position of there 1st straddle. My definition of 1st straddle - in Xpath terms - is

tbody/row[entry [@nameend &gt; @namest]][1]

This allows for the possibility that the straddle is not always keyed from column 1 and does not always extend into the last column. Both of these possibilities do exist in the real world data.

So my problem still goes back to identifying the location of the 1st straddle.

There are many similar solutions listed on this page http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d5010e13

But I am having difficulty applying them to my instances.

Something like this may work

<xsl:key name="straddles" match="row[entry[@nameend &gt; @namest]]" use="entry/@namest"/>

But I'm unsure what to use to define the @use attribute of the key?

When I try to define a first-straddle variable, I don't have a defining value to pass to the key() function?

<xsl:variable name="first-straddle" select="table/tgroup/tbody/row[generate-id() = generate-id(key('straddles',?????))]"/>

How do I find the the location of the first straddle?

Thanks in advance

Additional Notes: ![Screen shot of example table rendered in browser]https://dl.dropboxusercontent.com/u/5065004/Screen%20Shot%202014-02-25%20at%2013.54.08.png The above screenshot shows the example table rendered in the browser. The straddle rows 'categorise' the data in the following rows - right up to the next straddle. The @nasmest and @nameend attributes define the number of COLUMNS straddled by the heading, not the number of ROWS until the next straddle. The latter value is IMPLICIT in the markup and is the XPath statement i am searching for. The former - the column straddle - is EXPLICIT in the markup.

Feargal
  • 41
  • 8
  • the "straddles" that you refer is equal to `colspan` in html tables. you need to consider another attribute, `morerows`, which is equivalent to `rowspan` in html tables. – Joel M. Lamsen Feb 25 '14 at 10:23
  • Joel - the morerows attribute is not used anywhere in the data collection that I am analysing. I would need to go around inserting it. And to do that I would need to know where to insert it. Chicken and egg, methinks. – Feargal Feb 25 '14 at 10:53

1 Answers1

0

If I follow this correctly (big IF), each <entry> can find its corresponding category by looking at:

<xsl:value-of select="parent::row/preceding-sibling::row[entry[@namest]][1]/entry/hd4"/>

[EDIT ----------------------------------
To clarify: this is called from the context of an entry. Read this as follows:
1. Go up to the parent row;
2. Look backwards for the nearest row in the same parent tbody that has an entry that has a namest attribute;
3. Get the value of the hd4 element of the entry of the row found in step 2.
------------------------------------------- ]

However, it seems like that is not enough; the <entry> also needs to check if its position inside its parent row falls within the range set by the namest and nameend attributes? If that is correct, then it raises another question: is it possible to have two co-existing "straddles"? For example, one category for columns 1 to 3, and another category for columns 5 to 6?


EDIT

Please examine the following example:

<tbody>
    <row>
        <entry namest="1" nameend="3"><hd4>Colors</hd4></entry>
    </row>
    <row>
        <entry>Red</entry>
        <entry>Green</entry>
        <entry>Blue</entry>
    </row>
    <row>
        <entry>Black</entry>
        <entry>White</entry>
    </row>
    <row>
        <entry>Cyan</entry>
        <entry>Magenta</entry>
        <entry>Yellow</entry>
        <entry>Widget(not a color)</entry>
    </row>
    <row>
        <entry namest="4" nameend="5"><hd4>Shapes</hd4></entry>
    </row>
    <row>
        <entry>Brown</entry>
        <entry>Pink</entry>
        <entry>Golden</entry>
        <entry>Circle</entry>
    </row>
    <row>
        <entry>Azure</entry>
        <entry>Grey</entry>
        <entry>Orange</entry>
        <entry>Square</entry>
        <entry>Triangle</entry>
        <entry>Gadget(neither a color nor a shape)</entry>
    </row>
</tbody>

If I understand correctly, the second "straddle" does NOT supersede the first one, but comes in addition to it. So if we now try to apply the above xPath from the context of :

<entry>Pink</entry>

we will get the wrong result (Shapes) because we looked for the first preceding "straddle" - while we should have looked for the first preceding "straddle" that applies to column #2.

michael.hor257k
  • 113,275
  • 6
  • 33
  • 51
  • Yes. It is possible to have 2 straddles in a single row as you describe. – Feargal Feb 25 '14 at 10:40
  • Yes. It is possible to have 2 straddles in a single row as you describe. But I'm not sure it's a big issue. The presence of ANY straddle describes a row as a 'category' row. So the presence of 2 straddles is as good as just one in defining the row to use. I'm a little unclear on the use of multiple "::" axes. Could you expand on how that XPath is read? I'm presuming its read Right-to-Left? What is parent::row/preceding-sibling::row referring to? The parent element of the rows which occur B4 the row in question? Not sure that parent:: is necessary? – Feargal Feb 25 '14 at 10:51
  • @Feargal **1.** I have added a clarification to my answer. **2.** I am not sure you understood my concern regarding having two "straddles" **applying** to a single row - I have added an example to illustrate. – michael.hor257k Feb 25 '14 at 12:48
  • i have updated my original question and added a screenshot (although it is only linked at present - still getting the hang of markdown) - you have slightly the wrong end of the stick. The values of nameend and namest are irrelevant in deciding which straddle row applies to a current data row. It is ALWAYS the closest preceding straddle, whatever the values of namest or nameend. The straddle does not need to 'cover' the column in question. Does that make sense? – Feargal Feb 25 '14 at 14:05
  • @Feargal "*The values of nameend and namest are irrelevant in deciding which straddle row applies to a current data row. It is ALWAYS the closest preceding straddle, whatever the values of namest or nameend.*" I am not convinced that's the case in general - but if that's always true in your case, then my first answer should work for you just fine. – michael.hor257k Feb 25 '14 at 14:13