Extract top 50 in a column, by factors in another column

Question

I have a dataframe of prescribing data from UK practices. The original data is at http://datagov.ic.nhs.uk/T201207.exe. I've wrangled it into a PCT level data frame, ordered by PCT and by the most common prescription (descending order in the 'items' column).

      pct sha chem.code items      nic act.cost
32360 5ZW Q39 0212000Y0 12421 17811.40 16888.21
28769 5ZW Q39 0209000A0  8741  7834.43  7554.72
4439  5ZW Q39 0103050P0  7733 21566.51 20210.05
...
82763  5D7 Q30 0603020L0     1 1.08     1.13
152673 5D7 Q30 1502010C0     1 0.92     0.85
5149   5D7 Q30 0104020N0     1 0.70     0.68
149501 5D7 Q30 1311060I0     1 0.50     0.49

There are 151 pct's and each has over 1000 items. I want to extract the top 50 items for each pct. I know I could write a for loop and just iterate over the levels of pct, but that's not R. I haven't figured out how to use apply or sapply to do the subset over the levels. This seems to be better at getting entire columns than getting a subset of the rows.

[check this out](http://stackoverflow.com/questions/14800161/how-to-find-the-top-n-values-by-group-or-within-category-groupwise-in-an-r-dat) :) — Anthony Damico, Feb 24 '13 at 14:10
@Arun the executable file is a 'self-expanding zip' which contains 2 csv files, which are the data. Thank the NHS. — Suz, Feb 24 '13 at 14:56
Thank you @Anthony. I spent about an hour looking, but I guess I didn't use the correct terms. I've added a couple of tags to that one so it might be more findable for the next person. — Suz, Feb 24 '13 at 15:05
I was going to suggest that this can be done straightforwardly with data,table, but it appears @arun already pointed that out in the question Anthony linked to. Perhaps close this as duplicate? — Ricardo Saporta, Feb 24 '13 at 19:30
yep. This is definitely a duplicate. I'll see if the close button works... Uh, ur. Nope. I need 3 others to vote to close as well. Any takers? — Suz, Feb 24 '13 at 21:08

score 1 · Answer 1 · answered Feb 24 '13 at 14:06

1

Not quite sure if I get it, but my best guess is this:

require(plyr)
ddply(df, .(pct), function(x) x[1:50, ])

This'll pick the first 50 items for each pct (assuming there are definitely 50 items).

answered Feb 24 '13 at 14:06

Arun

116,683
26
284
387

This is a good answer, and it works. I've voted it up. I have been trying to learn the R way and staying with base functions, but I may have to give in. I keep seeing plyr used in useful ways. I have voted to close this question as it is identical to a previous one ('how to find top N values by group...'). However, the plyr way is not suggested on that question. Perhaps you could add it there. (I'm happy to vote it up..) – Suz Feb 24 '13 at 21:14
This answer and the one you were linked to are NOT the same. This just picks the first 50 elements, irrespective of ties. They are similar, but not identical. I don't mind voting to close the question since you've done so. But read the other post carefully and see if that's what you require, because from your question, it isn't obvious. – Arun Feb 24 '13 at 22:51
In this case, I don't care about ties. I have ordered the data on 3 fields. I'm using one as a factor to group the data, a second as a ranking that I'm interested in, and the 3rd to define the edges (break ties). So it's well resolved. The other question includes this case as a subset, and @Ista's 1st suggestion there answered my question. Answers on that page *also* address the question of ties in some detail, but as a secondary issue. I don't see that the questions are sufficiently distinct to keep this question open, but perhaps your point is that `ddply()` won't handle the ties. – Suz Feb 26 '13 at 13:44
Both lsta's and my first solution there answer this, to be precise. However, that wasn't supposed to be the answer to THAT question, as Anthony specifically asked for dealing with ties in the question. However, the issue seems to be resolved. The question seems to be closed. All is well. good luck. – Arun Feb 26 '13 at 13:50
I should mention that, `ddply()` solution provided *here* won't handle ties. You can always get `ddply()` to handle ties within your function definition. – Arun Feb 26 '13 at 13:51

Extract top 50 in a column, by factors in another column

1 Answers1