3

Now I have a set of data in text file (big enough), suppose each line represents a rectangle:

x1,y1,x2,y2

After I read the file, how do I bulk load and build R-tree index using http://www.vividsolutions.com/jts/javadoc/index.html?

I checked its APIs, it seems that only insert can used when bulk loading.

Here is my test code:

    STRtree rtree = new STRtree();

    rtree.insert(new Envelope(1.0,2.0,1.2,3.4),new Integer(1));
    rtree.insert(new Envelope(4.0,3.2,1.9,4.4),new Integer(2));
    rtree.insert(new Envelope(3.4,3.8,2.2,5.2),new Integer(3));
    rtree.insert(new Envelope(2.1,5.3,5.2,3.6),new Integer(4));
    rtree.insert(new Envelope(4.2,2.2,2.9,10.3),new Integer(5));

    List<Object> list = rtree.query(new Envelope(1.4,5.6,2.0,3.0));

Is it the right way of building a R-tree index (just use insert method)?

Another question is, suppose the input file is big enough, for example, GB or even TB scale, stored in HDFS, in this case, I would like a parallel version of code above based on Apache Spark.

Last, Any idea of saving the R-tree into a file for storage, and good for recover for later use?

Edit: Now I read HDFS file to build index, here is my code:

    val inputDataPath = "hdfs://localhost:9000/user/chenzhongpu/testData.dat"
    val conf = new SparkConf().setAppName("Range Query")

    // notice that: the function names for queries differ accoss systems.
    // here we simply refer intersect.

    val sc = new SparkContext(conf)

    val inputData = sc.textFile(inputDataPath).cache()

    val strtree = new STRtree

    inputData.foreach(line => {val array = line.split(",").map(_.toDouble); strtree.insert(new Envelope(array(0),array(1),array(2),array(3)),
      new Rectangle(array(0),array(1),array(2),array(3)))})

I called insert in foreach, and when I print the size of strtree, is zero!

Why the insert method inside foreach doesn't work ? Did I miss something?

chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • This code is incorrect, you can't access the STRtree unless you broadcast it – aaronman Mar 19 '15 at 14:36
  • If I call `collect()` of `inputData`, it works. But this will cause it cannot fit in the memory if there is a big dataset. As you say, do `partition` first will be better. – chenzhongpu Mar 19 '15 at 15:02

1 Answers1

0

You are building everything correctly it looks like, STRTree does bulk loading until you query and after that it does not allow you to add or remove nodes. If you wanted to parallelize this with apache spark, you could make a custom partitioner (similar to a range partitioner) that partitioned your area into a large grid, and then run an STRTree for each partition. In spark (and standard java) you can easily save an STRTree to a file since it implements serializable.

The code for RangePartioner is pretty complex since it samples input data and creates a probabilistic partitioning of the range, if you already know your max bounds you can do something simpler by making the grid based on the parallelism you want (the partitioner would essentially work, by finding which part of the grid the geometry is and send all the geometries to that partition, the partitioner would also potentially use an STRTree for speed)

One more suggestion, for simplicity you could partition on range on just x or just y using the standard RangePartitioner in spark but using a custom one would likely work better

aaronman
  • 18,343
  • 7
  • 63
  • 78
  • As you say, `STRTree does bulk loading until you query `. It seems that saving the STRTree object itself doesn't make sense, because when I later do deserialisation from file and call **query** again, and then the bulk loading work will be executed again. Right ? – chenzhongpu Mar 18 '15 at 16:20
  • @ChenZhongPu I doubt they would have implemented it that way, there is probably an internal data structure they use to store it, and only that is loaded on deserialization – aaronman Mar 18 '15 at 16:23
  • @ChenZhongPu the way your using it won't work, ideally you should create it in mapPartitions and use it that way – aaronman Mar 19 '15 at 14:45