Sort by two columns, why not do grouping first?

Question

I have two columns, one is a string field customer containing customer names and the other is a numeric field sales representing sales.

What I want to do is to group data by customer and then sort sales within group.

In SQL or Pandas, this is normally achieved by something like order by customer, sales on the table. But I am just curious about this implementation. Instead first sorting on customer and then sorting on sales, why not first group customer and sort sales. I don't really care about the order of the different customers since I only care about records of same customers being grouped together.

Grouping is essentially mapping and should run faster than sorting.

Why isn't there such implementation in SQL? Am I missing something?

Example data name,sales john,1 Amy,1 john,2 Amy,3 Amy,4

and I want it to group by name and then sort by sales: name,sales john,1 john,2 Amy,1 Amy,3 Amy,4

In SQL you probably would do select * from table order by name,sales

This would definitely do the job. But my confusion is since I don't care about the order of name, I should be able to do some kind of grouping (which should be cheaper than sorting) first and do sorting only on the numeric field. Am I able to do this? Why do a lot of examples from google simply uses sorting on the two fields? Thanks!

If I understand what you're asking, I don't know why this wouldn't be possible.You should be able to do this with both a `GROUP BY` and an `ORDER BY`. — Shawn, Jan 02 '18 at 18:43
Probably a great example here as to why you should provide your code/query and not just an (open ended) question. You would have an answer already if you did. Since we only are able to guess at what you are thinking, the answer we can give is "yes you can order and group in the same query". — Jacob H, Jan 02 '18 at 18:45
Example here: [SQL Group By with an Order By](https://stackoverflow.com/questions/27983/sql-group-by-with-an-order-by) — Jacob H, Jan 02 '18 at 18:48
Actually looking deeper into this issue, it may be one of the situations where Pandas is more complex than just generic SQL. I'm not familiar with Pandas, so it may be easier to do than I think, but again, this isn't a very complex operation. Follow Jacob's suggestion and write out an example of what you're trying to do. You may be able to answer your own question. (And if so, please post it back here.) — Shawn, Jan 02 '18 at 19:03
Thanks for your reply. This is actually an open question. I was trying to implement an external sorting algorithm and realized the first field (which is the name field) doesn't need to be sorted (although sorting will generate correct answers too). And I googled for a while and found a lot of examples and suggestions just using sorting on the two fields consecutively. I was just wondering if there was any particular reason that people prefer two sorts rather than group and sort. Or maybe I was wrong and people do use group and sort a lot. — Cheng, Jan 02 '18 at 20:27
@JacobH Thanks Jacob. I guess the link you provided won't work. The problem is groupby need aggregate function (which is "count" in your example) and that makes perfect sense for a reduce process. But in my case, I only need to group and reorder. — Cheng, Jan 03 '18 at 00:14
I added a Pandas tag. This question deals more with the specifics of how to GROUP and ORDER with Pandas. This is implemented and quite simple with generic SQL, however it seems to operate differently with Pandas. (But again, I am not fluent with Pandas, so it may be easier than I think.) — Shawn, Jan 03 '18 at 04:11
@Shawn I added more details in my questions. Can you specify how you would do this in SQL using only groupby and orderby? You need some sorts of aggregation function when you do groupby, and that field should also appear in your orderby clause, right? — Cheng, Jan 03 '18 at 04:28
Actually, depending on your actual data, you may not need to even use a `GROUP BY`. Very simple `SELECT name, sales FROM myTable ORDER BY name, sales` should work. You care about the order of `name` in the sense that you are using that to "group" your results together. — Shawn, Jan 03 '18 at 14:46
@Shawn Exactly why I asked the question at the first point. I knew order by two columns would work. I just want to know why not do "group and sort" since group should be computationally cheaper than sorting. I think know if you read through my original question, you would understand it better. — Cheng, Jan 03 '18 at 18:11
@cheng You don't need a `GROUP BY` because you aren't doing any aggregation here. If you were counting/summing/whatever your sales, you would need the grouping. The bigger issue here is that you seem to be using your data output to specify your display. These are two different areas of concern. Your data is your data; your application controls the display of that data. You shouldn't rely on your query to order or group your display. — Shawn, Jan 03 '18 at 18:20
@Shawn I didn't say I need GROUP BY in my question. I said grouping but I apologize if that confused you. This grouping should be similar operation to mapping. And the concern is more from computational standpoint. Please take a look at my reply to @ sqluser below. Thank you. — Cheng, Jan 03 '18 at 18:26
I think you may be making this issue much more complex than it really is. I understand what you are trying to say. From a "mapping" standpoint, you're trying to do the job that the query engine will already do for you; and do quite simply and quite well. From a query standpoint, there is no grouping in your example. There is only ordering of your result, again something that should not be dealt with in your query but in your display. You may need to come up with a more relevant example to the issue you're trying to solve. — Shawn, Jan 03 '18 at 18:44

score 0 · Answer 1 · answered Jan 03 '18 at 05:34

0

Here is the answer to it-

Grouping is done when you want to pull out the conclusion based on the entire group , like total of sales done,for each of the groups(in this case John and Amy) . It is used mostly with an aggregate function or sometimes to select distinct records only. What you wrote above is sorting the data in the order of name and sales , there is no grouping involved at all. Since the operation is sorting , its obvious that the command written for it would be sorting .

answered Jan 03 '18 at 05:34

jptr

178
10

Thanks for your answer. I certainly know sorting should do the job. I was just thinking from the computational efficiency standpoint. Maybe grouping is not the right term since it might be confused with groupby clause. I consider grouping similar to mapping. For records with same "name", they can be mapped to the same group, which takes O (n) operations to scan, and then sort within group. if using two sortings, the first sorting should be O(n log n) depending on the sorting algorithms and should be slower (although it should have better space complexity) . – Cheng Jan 03 '18 at 18:18
Sorting does give grouping result but the order of the group seems to be redundant information and shouldn't worth the additional complexity – Cheng Jan 03 '18 at 18:20
Complexity is a different concern altogether , and has more significance when you have multiple options to do the same thing. In a scenario like this you have just one option to achieve the desired outcome which is using sort.So if you rule out that stating the complexity , even though you may be correct it doesn't solve the problem. The order of the group is not redundant and cant be ruled out , because to group objects belonging to the same cluster the first thing you need is to sort the entire dataset and then only the nearest objects can be compared,whether its SQL or a Shell Script – jptr Jan 04 '18 at 03:44
I don't know why you say I don't have multiple options here. Scanning over all records and assign them into several groups can be easily achieved by say using a dictionary or nested lists in any programming languages. The order among different groups are redundant because I don't care if the names of the group are ranked alphabetically. But for SQL, you might be right order by two columns might be the only option. And that is exactly why I asked this question: if group and sort are more efficient than sort twice, why SQL doesn't include this implementation. Or is it because SQL only caresabout – Cheng Jan 04 '18 at 18:05
space complexity and using those data structures will consume a lot of memory? Or because SQL has indexing option so maybe that already serve as a grouping mechanism? – Cheng Jan 04 '18 at 18:07

Sort by two columns, why not do grouping first?

1 Answers1