2

I've got a dataframe with a text column name and factor city. It is ordered alphabetically firstly by city and then name. Now I need to get a data frame which contains only nth element in each city, keeping this ordering. How can it be done in a pretty way without loops?

I have:

name    city
John    Atlanta
Josh    Atlanta
Matt    Atlanta
Bob     Boston
Kate    Boston
Lily    Boston
Matt    Boston

I want a function, which returns n'th element by city, i.e., if it is 3rd, then:

name    city
Matt    Atlanta
Lily    Boston

It should return NULL for name if it is out of range for the selected city, i.e., for 4th:

name    city
NULL    Atlanta
Matt    Boston

Using only base R please?

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
sashkello
  • 17,306
  • 24
  • 81
  • 109
  • Could you give a reproducible example? Say, show a short example dataframe similar to what you have and another showing what you want it to become? – sebastian-c Oct 12 '12 at 00:00
  • with `plyr`: `ddply(yourdata, .(city), function(x, n) x[n,], n=10)` But what if you're selecting an `n` greater than the number of entries for a city? – Justin Oct 12 '12 at 00:02
  • can this be done using dplyr? – steadyfish Jul 30 '14 at 17:20

3 Answers3

5

In base R using by:

Set up some test data, including an additional out of range value:

test <- read.table(text="name    city
John    Atlanta
Josh    Atlanta
Matt    Atlanta
Bob     Boston
Kate    Boston
Lily    Boston
Matt    Boston
Bob     Seattle
Kate    Seattle",header=TRUE)

Get the 3rd item in each city:

do.call(rbind,by(test,test$city,function(x) x[3,]))

Result:

        name    city
Atlanta Matt Atlanta
Boston  Lily  Boston
Seattle <NA>    <NA>

To get exactly what you want, here is a little function:

nthrow <- function(dset,splitvar,n) {
    result <- do.call(rbind,by(dset,dset[splitvar],function(x) x[n,]))
    result[,splitvar][is.na(result[,splitvar])] <- row.names(result)[is.na(result[,splitvar])]
    row.names(result) <- NULL
    return(result)
}

Call it like:

nthrow(test,"city",3)

Result:

  name    city
1 Matt Atlanta
2 Lily  Boston
3 <NA> Seattle
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • beat me to it. @sashkello please try to be as specific as possible in your initial questions, especially when using extra packages is out of the question since so much of R is built on user contributed features. – Justin Oct 12 '12 at 00:15
3

A data.table solution

library(data.table)
DT <- data.table(test)

# return all columns from the subset data.table
n <- 4
DT[,.SD[n,] ,by = city]
##      city name
## 1: Atlanta   NA
## 2:  Boston Matt
## 3: Seattle   NA

# if you just want the nth element of `name` 
# (excluding other columns that might be there)
# any of the following would work

DT[,.SD[n,] ,by = city, .SDcols = 'name']


DT[, .SD[n, list(name)], by = city]


DT[, list(name = name[n]), by = city]
mnel
  • 113,303
  • 27
  • 265
  • 254
  • selectedCol = "city", step= 4 , DT[,.SD[seq(1,.N,by=step),] ,by = selectec_Col] also work even I don't understand if it's better or worse – Jojostack Oct 15 '21 at 10:32
2

You can use plyr for this:

dat <- structure(list(name = c("John", "Josh", "Matt", "Bob", "Kate", 

"Lily", "Matt"), city = c("Atlanta", "Atlanta", "Atlanta", "Boston", "Boston", "Boston", "Boston")), .Names = c("name", "city"), class = "data.frame", row.names = c(NA, -7L))

library(plyr)

ddply(dat, .(city), function(x, n) x[n,], n=3)

> ddply(dat, .(city), function(x, n) x[n,], n=3)
  name    city
1 Matt Atlanta
2 Lily  Boston
> ddply(dat, .(city), function(x, n) x[n,], n=4)
  name   city
1 <NA>   <NA>
2 Matt Boston
> 

There are plenty of other options too using base R or data.table or sqldf...

Justin
  • 42,475
  • 9
  • 93
  • 111