0

I have a CSV file of an SQL dump, and I am working with it in BaseX 8.4. The CSV headers contain a flattened representation of the SQL table structure.

CSV with header and first row:

country_id,country_code,country_name,publisher_id,publisher_name,country id,year_began,year_ended,series_id,series_name,sort_name,publisher_id
2,us,United States,78,Harvard University Press,2,1950,NULL,15,A New Series,New Series,78

The BaseX CSV parser produces the following XML representation:

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </record>
</csv>

Regarding the original data, I know that the start of a table begins with its unique ID, but those ID names will also appear repeated in other tables as foreign keys.

I would like to create windows/groups that reconstruct the original table structure by matching the first occurrence of a table's unique ID (while ignoring each subsequent occurrence). What I have so far doesn't work because it matches every occurrence of the ID, not just the first one:

<tables>{
    for tumbling window $w in /csv/record/*
    start $s at $p when name($s) = ("country_id", 
                                    "publisher_id", 
                                    "series_id", 
                                    "issue_id", 
                                    "id_activity_fact", 
                                    "id_person_dim", 
                                    "id_location_dim", 
                                    "id_phys_loc_dim", 
                                    "id_letter_dim")
    return <table id_name="{name($s)}">{$w}</table>
}</tables>

Output:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
  </table>
  <table id_name="country_id">
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
  </table>
</tables>

Desired output:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>      
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>      
    <publisher_id>78</publisher_id>
  </table>
</tables>
tat
  • 321
  • 1
  • 19
  • 1
    It would help to show us some actual XML input (preferably simplified, we don't need to wade through all 15 columns) rather than expecting us to work out what it looks like when converted from your CSV. Your current query is starting a new group whenever it encounters a particular element name, and I can't see any resemblance between that and your description of the problem. – Michael Kay Mar 02 '16 at 08:38
  • @MichaelKay XML input added. – tat Mar 02 '16 at 12:30

2 Answers2

1

I think you may need to use your windowing solution for the initial splitting into segments, and then use "group by" on the result to merge segments having the same key.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

After working on this for quite some time, I gave up and decided simply to mark the subsequent occurrences of the ID names with an underscore, like so:

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <_country_id>2</_country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <_publisher_id>78</_publisher_id>
  </record>
</csv>

In this way, the window expression works as desired; then, I simply strip off the underscores to return the element names to their original form:

<tables>{
        for tumbling window $w in /csv/record/*
        start $s when $s/name() = ("country_id", 
                                   "publisher_id", 
                                   "series_id", 
                                   "issue_id", 
                                   "id_activity_fact", 
                                   "id_person_dim", 
                                   "id_location_dim", 
                                   "id_phys_loc_dim", 
                                   "id_letter_dim")
        return 
            <table id_name="{$s/name()}">{
                for $e in $w
                return 
                   if (starts-with($e/name(), "_")) then
                       element {$e/substring-after(name(), "_")} { $e/string() }
                   else $e
            }</table>
}</tables>

Final result:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </table>
</tables>
tat
  • 321
  • 1
  • 19