2

Can anyone tell me how aggregate functions are implemented in SQL databases e.g.: Oracle or SQL Server.

I mean, do these database use some internal data structure or algorithm when a aggregate function is present in select clause.

The reason why I am asking this is because I have 100,000 records in java ArrayList and when I try to do the sum of all values it takes around 1 min but when the same 100,000 records are stored in DB and I use sum(column_nm) it executes in almost 1/4 the time.

I want to improve my java code performance in a similar way, for which I want to know SQL aggregate function internals.

Thanks.

Jeffrey Kemp
  • 59,135
  • 14
  • 106
  • 158
Ashay Batwal
  • 5,415
  • 7
  • 25
  • 30
  • One minute for summing up even a million records in an ArrayList sounds _horribly_ slow. Are you including the time it takes to fetch all the records from the database first? – Joachim Isaksson Feb 26 '13 at 05:47
  • That's one hundred thousand (one lakh) records, not one million. – minopret Feb 26 '13 at 05:50
  • Sorry guys, I meant 10,00,000 records and there isn't any DB operation but I do have some other memory operations in between. – Ashay Batwal Feb 26 '13 at 05:58
  • 1
    Your zeroes are still off. Not that it matters, but we get the idea :P. So.. a million records? – Scotch Feb 26 '13 at 06:02
  • The database is probably implemented in native code, and your `ArrayList` code might be running in a 'client' VM, especially from an IDE. If you load it all into memory and run in the 'server' VM, it might even outperform the database. – Andrew Mao Feb 26 '13 at 06:08
  • What does one of these records look like? Do you have some sense of how much memory is used by each record? How much memory have you allocated to the VM? Just to give you a reference point, my laptop adds 1 million integers stored in an ArrayList within 8ms, using two `System.nanoTime()` calls to measure the duration. – Dilum Ranatunga Feb 26 '13 at 06:29
  • As I'm pretty sure that Postgres would deliver the same kind of performance you might want to have a look into the Postgres source code (it's apparently very structured and easy to understand for someone who is good in C) –  Feb 26 '13 at 07:49
  • this paper https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=13&ved=0CD8QFjACOAo&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.50.5661%26rep%3Drep1%26type%3Dpdf&ei=4tnLUeanDMW6lAWdg4CIDw&usg=AFQjCNEVizH4HlQz4F-3AIhfFKSN2Ewn8A&sig2=8BKSl-ea9UnbqFSe_sy5fQ&bvm=bv.48340889,d.dGI&cad=rja and its references – didxga Jun 27 '13 at 06:24

5 Answers5

2

Although this won't exactly match how the internally defined aggregates work, in SQL Server you're allowed to create user-defined aggregates. It may be instructive to see the what methods such an aggregate must define:

  • Init:

The query processor uses this method to initialize the computation of the aggregation. This method is invoked once for each group that the query processor is aggregating. The query processor may choose to reuse the same instance of the aggregate class for computing aggregates of multiple groups. The Init method should perform any clean-up as necessary from previous uses of this instance, and enable it to re-start a new aggregate computation.

  • Accumulate:

... The query processor uses this method to accumulate the aggregate values. This is invoked once for each value in the group that is being aggregated. The query processor always calls this only after calling the Init method on the given instance of the aggregate-class. The implementation of this method should update the state of the instance to reflect the accumulation of the argument value being passed in.

  • Merge:

This method can be used to merge another instance of this aggregate class with the current instance. The query processor uses this method to merge multiple partial computations of an aggregation.

  • Terminate:

This method completes the aggregate computation and returns the result of the aggregation. ...

From the descriptions of Merge and Terminate, we can deduce that the server might be performing multiple partial aggregations within a single group in parallel. Once each of these parallel accumulations have occurred, all of the results will be Merged together before the final call to Terminate on one instance of the class produces the final aggregate result.

So, one obvious way to achieve speed up (if possible) is to parallelize the accumulation stage.

Damien_The_Unbeliever
  • 234,701
  • 27
  • 340
  • 448
2

There is a pretty simple explanation, why the java code is much slower:

You are using an ArrayList, thus I assume, you are putting Integer-Objects in there. They do have a significant overhead over ints in C in some stack. Second when you sum them up and for each partial sum you create another Integer, your GarbageCollector eats all the performance.

As stated in other answers,

  1. DB will use direct mathematical processor access to add ints just in registers - cannot be faster.
  2. good DB will not iterate only but map + reduce such aggregations as sums, min or max. Thus they gain the bonus of multiprocessors and almost ignore the I/O-latency.

For you to solve it in the code: Use an int[]

 int[] parts;
 sum=0;
 for (int i:parts) {
   sum+=i;
 }

You might want to test, if splitting (mapping) the array according your processor count and parallize this with Future is useful - depends on the size of your data.

Jan
  • 1,042
  • 8
  • 22
1

The performance difference is simply because to calculate a SUM, you don't need to store all the data in memory at the same time.

When you issue the query asking for the SUM directly to the database, it can read each record from disk, accumulate the running total in a single variable in memory, then read the next record - it never needs to keep all the records in memory at the same time. More importantly, it doesn't need to send those records across the network to any other server for processing - it only needs to send the resulting SUM as a single number at the end.

In addition, because a SUM over the whole is equal to the SUM of any distinct subsets of the whole, a SUM can be parallelized - e.g. if the data is partitioned, the database can issue multiple queries to be run in different sessions, each will SUM its portion of the data, then the controlling session can simply SUM over the results of each partition.

When you calculate the sum in your Java program using an Array, it must first issue a query to the database asking for all the data it needs; all the data needs to be transported from the database to the app server, and memory needs to be allocated to store all that data. Only after that does your program then iterate over the Array in memory and calculate the Sum; then, it probably needs to deallocate the Array from memory.

If the volume of data is low, the performance difference will probably be insignificant. If the volume is great, however, the difference can be expected to be very significant.

Jeffrey Kemp
  • 59,135
  • 14
  • 106
  • 158
0

Aggregates typically just iterate over the resultset and them perform their aggregation, whether it's sum, average, or count, etc.

If you're talking about the complexity of the operation, it's pretty much always O(n) where n is the number of records in your resultset for a simple aggregation.

I don't see why it would take longer to do in java, since your array would be instantiated into main memory, which is faster than reading from disk, like a RDBMS would. Honestly, an aggregation from a RDBMS should be slightly slower than an arraylist aggregation.

To expand on this, if you wanted ONE row for a specific entry (with a PK, or indexed), it would be O(1) for an arraylist and O(1) for a RDBMS with an appropriate index (for a standard linkedlist, it would be o(n) to get that row, but the same as arraylist for the aggregate). Iterating over your entire dataset (be it array or table), and performing an aggregation would pretty much always be O(n).

Scotch
  • 3,186
  • 11
  • 35
  • 50
0

interesting question.

a well-written rdbms is the culmination of thousands of work hours of phd mathematicians and database gurus. your attempt to mimic the performance of MSSQL or postgressql is admirable, but is tilting at windmills (read futile if you are not familiar with don quixote).

a common misunderstanding with rdbms is that relational means related tables. related actually refers to the mathematical relation. basically - rdbms focus on set theory. even with a great rdbms, developers can ruin performance by calculating things row by row, instead of using the inherent native sets. this is actually an apt comparison of the performance difference you are experiencing.

if you are limited to doing this calc in java instead of a db, you should look at optimizing the data structure (smallest data type) and loop efficiency. you still won't be able to compete with sql server or postgres. it may be worth storing the items in a db and calling them from java if you truly need the improved performance.

mson
  • 7,762
  • 6
  • 40
  • 70