-2

How can I parse the XML output from Postgres as an input for Basex in Linux?

Siyual
  • 16,415
  • 8
  • 44
  • 58
gph
  • 331
  • 1
  • 3
  • 11
  • If you're going to post a self-answered question, at least *try* to actually ask a question... The body of your question is not for you to just tag spam every product you can think of. – Siyual Mar 16 '17 at 18:36

2 Answers2

1

oh I see may answer is somehow outdated; yet I'll leave it here as in my opinion the appraoch you describe in your answer might be overkill for the task at hand.


I am not sure if you even have a question, yet I'd like to propose a fundamentally leaner approach ;-)

I hope it helps a little! Have fun!

For the current use case you may throw away awk, sed, postgres and wget, you can do all that you need in 25 lines of XQuery:

1) Some basics, fetch a file from a remote server:

fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')

2) Skip the first line. I decided to use the header that came with the original file, but you

fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)

So in total your Requirements condense to:

RQ1.:

(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)

(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})

for $record in $csv/csv/record
  group by $date := $record/REF_DATE
  order by $date ascending
return element year_total {
  attribute date { $date },
  attribute population { sum($record/POP_TOTAL) => format-number("0000000")}
}

RQ 2.:

(: Fetch CSV as Text, split it per line, skip the first line: :)
let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
=> tokenize(out:nl()) (: Split string by newline :)
=> tail() (: Skip first line :)
=> string-join(out:nl()) (: Join strings with newline :)

(: Parse the csv file, first line contains element names.:)
let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
for $record in $csv/csv/record
  group by $date := $record/REF_DATE
  order by $date ascending
return element year_total {
  attribute date { $date },
  attribute population { sum($record/POP_TOTAL) => format-number("0000000")},
  for $sub_item in $record
  group by $per-district := $sub_item/DISTRICT_CODE
  return element district {
    attribute name { $per-district },
    attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")}
  }
}

Including the file write and the date formatted in a more readable way:

(: wrap elements in single root element :)
let $result := element result {
  (: Fetch CSV as Text, split it per line, skip the first line: :)
  let $lines := fetch:text('https://www.wien.gv.at/statistik/ogd/vie_101.csv')
  => tokenize(out:nl()) (: Split string by newline :)
  => tail() (: Skip first line :)
  => string-join(out:nl()) (: Join strings with newline :)

  (: Parse the csv file, first line contains element names.:)
  let $csv := csv:parse($lines, map { "header": true(), "separator": ";"})
  for $record in $csv/csv/record
    group by $date := $record/REF_DATE
    order by $date ascending
  return element year_total {
    attribute date { $date => replace("^(\d{4})(\d{2})(\d{2})","$3.$2.$1")},
    attribute population { sum($record/POP_TOTAL) => format-number("0000000")},

    for $sub_item in $record
    group by $per-district := $sub_item/DISTRICT_CODE
    return element district {
      attribute name { $per-district },
      attribute population { sum($sub_item/POP_TOTAL) => format-number("0000000")},
      $sub_item
    }
  }
}

return file:write("result.xml", $result)
michael
  • 1,577
  • 1
  • 12
  • 18
-1

Setup

Data source : http://www.wien.gv.at/statistik/ogd/vie_101.csv Research questions (RQ): RQ1: How many people lived in Vienna in total per census? RQ2: How many people lived in each Viennese district per census?

Preparation

In order to answer the RQ the postgre DB was chosen. Adhering to the proverbial saying “Where there’s a shell, there’s a way” this code shows a neat solution for the BASH (CLI Debian/Ubuntu flavored). Also, it is much easier to interact with postgre from the BASH when creating files needed for further processing. Regarding the installation process please consult: https://tecadmin.net/install-postgresql-server-on-ubuntu/

First download the file with wget:

cd /path/to/directory/ ;
wget -O ./vie_101.csv http://www.wien.gv.at/statistik/ogd/vie_101.csv ;

Then look at the file with your favorite spread sheet calculation program (Libre Office Calc). vie_101 should be in UTF-8 encoding and probably uses a semicolumn \; delimiter. Open, check, change, save. Some reformatting is needed for ease of processing down the line. First, a header file is created with the appropriate column names. Second, the downloaded file is “beheaded” (first 2 rows are removed) and “cut” (into the columns of interest). Finally, it is attached to the header file.

echo 'DISTRICT,POPULATION,MALE,FEMALE,DATE' > ./vie.csv ;
declare=$(sed -e 's/,/ INT,/g' ./vie.csv)' INT' ;
sed 's/\;/\,/g' ./vie_101.csv | sed 's/\.//g' | tail -n+3 | cut -d ','
-f4,6-9 >> ./vie.csv ;

Postgre

In order to load data into postgres a schema needs to be created first: echo "create table vie ( $declare );" | sudo -u postgres psql ; In order to actually load data into postgres the previously created and formatted file (vie.csv) needs to be copied into the folder accessible by postgres by the super user. Only then the copy command can be executed to load data into postgres. It needs to be noted that root privileges are required for this operation (sudo).

sudo cp ./vie.csv /var/lib/postgresql/ ;
echo "\copy vie from '/var/lib/postgresql/vie.csv' delimiter ',' csv
header ;" | sudo -u postgres psql ;

XML Schema

Before we create our XML document, we have to design the structure of our file. We decided to create an XML schema (schema.xsd) instead of the DTD. Our schema defines a root element , and its child which are complex elements. The element can occur in any number. The children of element are , , , and . These 5 elements(siblings) are simple elements and the defined value type is always an integer.

Create XML with Postgre

Since the ultimate goal is to answer the RQ via an xquery an xml file is needed. This file (xml.xml) needs to be correctly formatted and well formed. As the next step the query_to_xml command is piped to postgres -Aqt is used to:

-A [aligned mode disable, remove header and + at end of line]
-q [quiet output]
-t [tuples only, removes footer]

echo "select query_to_xml( 'select * from vie order by date asc', true,
false, 'vie' ) ;" | sudo -u postgres psql -Aqt > ./vie_data.xml ;

Now, it is important to export the schema of the table with table_to_xmlschema().

echo "select table_to_xmlschema( 'vie', true, false, '') ;" | sudo -u
postgres psql -Aqt > ./vie_schema.xsd ;

This concludes all tasks within postgre and the BASH. As last command basex can be launched.

basexgui

Xquery

Using basex the XML file can easily be validated against the schema with via: validate:xsd('vie_data.xml', 'vie_schema.xsd') The XML file can be imporet by clicking:

  1. Database -> New
  2. General -> Browse Select XML file.
  3. Parsing Turn on "Enable Namespaces" if its not enabled.
  4. OK

RQ1 can only be answered by grouping the the data by ‘DATE’ via a for-loop. Results are saved via:

file:write( 'path/to/directory/file_name' ).
file:write( '/path/to/directory/population_year_total.xml',
for $row in //table/row
group by $date := $row/date
order by $date ascending
return <year_total date="{$date}"
population="{sum($row/population)}">
</year_total>)

RQ2 is answerd by nesting two for loops. The outer loop groups by DATE and returns the POPULATION total for each DATE given. The inner loop groups by DISTRICT, hence, it returns a sub-sum of the POPULATION.

file:write( '/path/to/directory/district_year_subtotal.xml',
for $row in //table/row
group by $date:= $row/date
order by $date ascending
return <sub_sum date="{$date}"
population="{sum($row/population)}">{
for $sub_item in $row
group by $district := $sub_item/district
order by $district ascending
return <sub_item district="{$district}"
population="{sum($sub_item/population)}"/>
}</sub_sum>)

Done

gph
  • 331
  • 1
  • 3
  • 11