2

Now I have a set of numbers, such as1,4,10,23,..., and I would like to build a b-tree index for them using Apache Spark. The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one

The regular way of building b-tree index are shown in https://en.wikipedia.org/wiki/B-tree, but I now would like a distributed parallel version in Apache Spark .

In addition, the Wiki of B-tree introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree) It seems that I should sort it at advance, and I think for a big set of data, sorting is quite time-consuming and even can't be completed for limited memory. Is this method mentioned above a recommend one ?

chenzhongpu
  • 6,193
  • 8
  • 41
  • 79

1 Answers1

1

Sort the RDD with RDD.sort if it's not already sorted. Use RDD.mapPartitions to build an index for each partition. Then build a top-level index that connects the per-partition indices.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
  • Suppose the input file is from the `HDFS`, and I would like to save the b-tree index to another `HDFS` for persistence. Can you give me a more specific answer? – chenzhongpu Mar 08 '15 at 07:17
  • No. Your question does not include any details about how the input and output are structured and formatted. That's as specific as possible. – Daniel Darabos Mar 08 '15 at 07:43
  • Now suppose that the input file from `HDFS` contains a set of numbers, and the format is per line per record. And I have also no idea of the output file's format, I just want to find a recommend one. You can look up on http://spatialhadoop.cs.umn.edu/spatial-index.html, and that page shows format of `r-tree` and `grid` index of the output file. – chenzhongpu Mar 08 '15 at 08:31
  • How do you plan to use this index? – Daniel Darabos Mar 08 '15 at 15:23
  • The usage is not important. I am asked to do something about `index` of HDFS of spatial data. For a set of numbers, you can consider they are points in one-dimensional space. Therefore, I want to know how to design the output format of `b-tree` index and `r-tree` index. To be concise, I am doing something using `apache Spark` just like spatialhadoop.cs.umn.edu/spatial-index.html. – chenzhongpu Mar 09 '15 at 00:54
  • If you're not going to use the index, don't build it. If you are going to use it, build it according to how it will be used. If you are just interested in the generic approach, regardless of usage specifics, I've already answered that. Good luck! – Daniel Darabos Mar 09 '15 at 08:24