I have a CSV file of an SQL dump, and I am working with it in BaseX 8.4. The CSV headers contain a flattened representation of the SQL table structure.
CSV with header and first row:
country_id,country_code,country_name,publisher_id,publisher_name,country id,year_began,year_ended,series_id,series_name,sort_name,publisher_id
2,us,United States,78,Harvard University Press,2,1950,NULL,15,A New Series,New Series,78
The BaseX CSV parser produces the following XML representation:
<csv>
<record>
<country_id>2</country_id>
<country_code>us</country_code>
<country_name>United States</country_name>
<publisher_id>78</publisher_id>
<publisher_name>Harvard University Press</publisher_name>
<country_id>2</country_id>
<year_began>1950</year_began>
<year_ended>NULL</year_ended>
<series_id>15</series_id>
<series_name>A New Series</series_name>
<sort_name>New Series</sort_name>
<publisher_id>78</publisher_id>
</record>
</csv>
Regarding the original data, I know that the start of a table begins with its unique ID, but those ID names will also appear repeated in other tables as foreign keys.
I would like to create windows/groups that reconstruct the original table structure by matching the first occurrence of a table's unique ID (while ignoring each subsequent occurrence). What I have so far doesn't work because it matches every occurrence of the ID, not just the first one:
<tables>{
for tumbling window $w in /csv/record/*
start $s at $p when name($s) = ("country_id",
"publisher_id",
"series_id",
"issue_id",
"id_activity_fact",
"id_person_dim",
"id_location_dim",
"id_phys_loc_dim",
"id_letter_dim")
return <table id_name="{name($s)}">{$w}</table>
}</tables>
Output:
<tables>
<table id_name="country_id">
<country_id>2</country_id>
<country_code>us</country_code>
<country_name>United States</country_name>
</table>
<table id_name="publisher_id">
<publisher_id>78</publisher_id>
<publisher_name>Harvard University Press</publisher_name>
</table>
<table id_name="country_id">
<country_id>2</country_id>
<year_began>1950</year_began>
<year_ended>NULL</year_ended>
</table>
<table id_name="series_id">
<series_id>15</series_id>
<series_name>A New Series</series_name>
<sort_name>New Series</sort_name>
</table>
<table id_name="publisher_id">
<publisher_id>78</publisher_id>
</table>
</tables>
Desired output:
<tables>
<table id_name="country_id">
<country_id>2</country_id>
<country_code>us</country_code>
<country_name>United States</country_name>
</table>
<table id_name="publisher_id">
<publisher_id>78</publisher_id>
<publisher_name>Harvard University Press</publisher_name>
<country_id>2</country_id>
<year_began>1950</year_began>
<year_ended>NULL</year_ended>
</table>
<table id_name="series_id">
<series_id>15</series_id>
<series_name>A New Series</series_name>
<sort_name>New Series</sort_name>
<publisher_id>78</publisher_id>
</table>
</tables>