0

Data Example:

<?xml version='1.0' encoding='UTF-8'?><osm version="0.6" generator="osmconvert 0.7P" timestamp="2013-07-20T19:00:02Z">
.
   <way id="128725988" version="1" timestamp="2011-09-03T08:06:56Z" changeset="9198624" uid="42429" user="42429">
      <nd ref="1421727256"/>
      <nd ref="1421727264"/>
      <nd ref="1421727238"/>
      <nd ref="1421727237"/>
      <nd ref="1421727256"/>
      <tag k="addr:housenumber" v="43"/>
      <tag k="addr:street" v="Wilhelm-Ahrens-Straße"/>
      <tag k="building" v="yes"/>
   </way>
.
.
   <node id="1964468590" lat="53.068416" lon="8.779039" version="1" timestamp="2012-10-14T12:29:02Z" changeset="13491909" uid="715371" user="cracklinrain"/>
   <node id="1964468593" lat="53.0684177" lon="8.7798644" version="1" timestamp="2012-10-14T12:29:02Z" changeset="13491909" uid="715371" user="cracklinrain">
      <tag k="natural" v="tree"/>
   </node>
.
.
.
   <way id="128725989" version="1" timestamp="2011-09-03T08:06:57Z" changeset="9198624" uid="42429" user="42429">
      <nd ref="1421728028"/>
      <nd ref="1421728023"/>
      <nd ref="1421728016"/>
      <nd ref="1421728024"/>
      <nd ref="1421728028"/>
      <tag k="addr:housenumber" v="44"/>
      <tag k="addr:street" v="Alma-Rogge-Straße"/>
      <tag k="building" v="yes"/>
   </way>
.
.

This is an example of a Xml File with an amount of 30GB data inside.

What I want to do is to get only the <tag> elements which contains specific wanted atributes like addr:housenumber.

One thing which is needed to keep connected is the id from the parent element.

My main problem is how to handle a 30 GB document. If it were about a few hundred MB it would be no problem to solve it by myself.

What I already tried:

  1. XmlReader

    Works very well for getting specific attributes but the connection to the parent id is lost.

  2. Things like xDocument, XmlDocument...

    Problem is the amount of Data. (30 GB)
    After loading ~ 1GB into memory get an OutOfMemoryException.
    I understand it would be crazy to load an amount of 30GB into memory.

I am already having a separate working solution by using a OpenSource Library for pbf files (but I want to process the clean data) and extracting the needed data by iterating through every node and using LinqToSql for adding it to the database.

Final result:

I want to import every street, housenumber, postalcode and city into a SQL Server database where StreetTable is connecty with CityTable (my first solution is working well but after an amount of 10 000 processed items it becomes very slow.)

I hope it is understandable what I want to do.

Community
  • 1
  • 1
Daniel
  • 95
  • 2
  • 9

2 Answers2

0

I'm not sure but these links might help:

https://wiki.openstreetmap.org/wiki/Osmconvert#Dispose_of_Ways_and_Relations_and_Convert_them_to_Nodes

https://wiki.openstreetmap.org/wiki/Osmconvert#Writing_CSV_Files

also useful: osmfilter, Osmosis

Some options of osmconvert and osmfilter require a strictly ordered input file: first all nodes, then all ways, and then all relations. Within reach group the data should be sorted by id.

Conversion and filtering will be faster if you use .o5m (or maybe .pbf) file format.

Markus
  • 1
  • Hi Markus, as i wrote in my explanation using .o5m or .pbf files is not the way i want to get a solution because i already have one by using .pbf files but that one is really slow.(I think the Library i am using for reading the .pbf file is the bottleneck) I checked Osmfilter and Osmosis but they don't bring some effort to my project with the data produced by them. But thanks for your suggestions! – Daniel Jul 26 '13 at 06:02
0

I have no experience with C# but as the XML file is very large and it will be enough to read/access it only once, a simple XML SAX parser seems sufficient. C#'s XmlReader seems to be similar to a SAX parser. So all you have to do is whenever a <node> or <way> element is read and a corresponding event is trigged, you just store the id attribute. And whenever a <tag> event is read and a corresponding event is trigged, you assign all of its attributes to the previously read id.

scai
  • 20,297
  • 4
  • 56
  • 72