Get top 10 and last 10 from a million records

Question

I have a report which shows 2-4 million records. I get the records from oracle to java and push it to an excel report. All this is already done!

Now, I also need to add a new tab with top 10 and last 10 records. What would be the best way to do it?

Should i use PriorityQueue implementation in java or use a binary tree to keep a track of top 10 and last 10. I don't need to store the billion records in the data structure. I just need to save 10 at a time. ex:

PriorityQueue<DataObject> queueTop10 = new PriorityQueue<DataObject>(10, topComparator);
PriorityQueue<DataObject> queueLast10 = new PriorityQueue<DataObject>(10, leastComparator);
    while (data is coming from database)
    {
    // push to excel stuff here
    queueTop10 .add(dataObject);   OR binarytreeTop.insert(dataObject)
    queueLast10.add(dataObject);   OR binarytreeLeast.insert(dataObject)
    }

Please let me know if i can use some other data structure as well.

Thanks

What do you mean by "top 10"? Does each record have some sort of score? Or are you looking for the most frequently occurring key values? Or what? — erickson, May 11 '15 at 16:37
IMO it is less work to get only the minimum element using a heap. A tree is more organized but requires more computation to maintain that organization. In you case you need to access top 10 and bottom 10 records and a heap may not work for you . I believe you should go with tree implementation (`TreeMap`) and the extra overhead is perhaps justified. — akhil_mittal, May 11 '15 at 16:40
Who reads these reports? This many records starts getting into the realm of "if we give a page of this report to everyone in the country..." or "if we stacked the pages we would have a pile X% of the way to the moon." Also, [OutOfMemoryError](http://docs.oracle.com/javase/8/docs/api/java/lang/OutOfMemoryError.html). — , May 11 '15 at 16:48
Hey! Thanks for the quick response. I am really sorry for the typo. I meant 2-4 million records and not billion. And we save it in CSV format and its divided in different output files. — user1797559, May 11 '15 at 18:32
Yes, there is a score. So, the topComparator and leastComparator implement the logic. — user1797559, May 11 '15 at 18:33

score 2 · Accepted Answer · answered May 11 '15 at 16:46

Top hit algorithms use a min-heap (PriorityQueue in Java), but there should be some size checking in your algorithm. Suppose each item has a score, and you want to collect the 10 items with the highest score. PriorityQueue efficiently exposes the item with the lowest score:

PriorityQueue<DataObject> top = new PriorityQueue(10, comparator);
for (DataObject item : items) {
  if (top.size() < 10) top.add(item);
  else if(comparator.compare(top.peek(), item) < 0) {
    top.remove();
    top.add(item);
  }
}

score 0 · Answer 2 · edited May 23 '17 at 12:32

0

You can use a priority queue since it acts like a heap in Java. See How does Java's PriorityQueue differ from a min-heap? If no difference, then why was it named PriorityQueue and not Heap?

edited May 23 '17 at 12:32

Community

1
1

answered May 11 '15 at 16:43

nullPointer

133
7

score 0 · Answer 3 · answered May 11 '15 at 16:49

0

PriorityQueue<T> will not work with your code as-is, because 10 in the constructor is the initial capacity; your queue will grow to 1B items as you go.

However, TreeSet<T> will work, with a small modification. You need to add code that removes the eleventh item every time the queue grows past ten:

TreeSet<DataObject> top10 = new TreeSet<DataObject>(topComparator);
TreeSet<DataObject> bottom10 = new TreeSet<DataObject>(leastComparator);
while (data is coming from database) {
    top10.add(dataObject);
    if (top10.size() == 11) {
        top10.pollLast();
    }
    bottom10.add(dataObject);
    if (bottom10.size() == 11) {
        bottom10.pollLast();
    }
}

answered May 11 '15 at 16:49

Sergey Kalinichenko

714,442
84
1,110
1,523

Hey, Thanks a lot for the quick response!. If i am able to manage the priorityQueue to contain only 10 elements as shown below by erickson, which data structure do you think will be more efficient/faster. – user1797559 May 11 '15 at 18:46
@user1797559 I don't think there would be any difference at all, because the queue is tiny. In fact, you may probably change it to an array and do a linear search of ten items without seeing any difference (it's 3 comparisons at random locations in memory vs. 10 comparisons at contiguous locations in memory, so locality of reference may close the gap for you). If you go to 30..50 elements, the story may be different, but for 10 items it probably wouldn't matter. – Sergey Kalinichenko May 11 '15 at 18:53
@user1797559 `PriorityQueue`, like most heaps, is implemented with an array, so you get locality of reference with it. Because the array is so small, it may still be quicker to scan an array (despite the big-O scaling), because the code is so simple. – erickson May 11 '15 at 23:27

score 0 · Answer 4 · edited Mar 20 '17 at 10:18

0

4 billion records in an excel spreadsheet ? nah, you don't https://superuser.com/questions/366468/what-is-the-maximum-allowed-rows-in-a-microsoft-excel-xls-or-xlsx

You should do this on the database, and not rely on the java implementation. For this many records it is bound to be less efficient than an optimized db query.

edited Mar 20 '17 at 10:18

Community

1
1

answered May 11 '15 at 16:59

NimChimpsky

46,453
60
198
311

Hey! Thanks for the quick response. I am really sorry for the typo. I meant 2-4 million records and not billion. And we save it in CSV format and its divided in different output files. I dint wanna do it on DB as the sorting logic is a bit complicated and so the query would need a lot of joins. As i am already getting the data once, i thought it would be faster if i could use the same and extract the top 10 and least 10 records using the comparators topComparator and leastComparator. Please let me know what you think. – user1797559 May 11 '15 at 18:36

Get top 10 and last 10 from a million records

4 Answers4