5

I currently have a data that looks like this for multiple ids (that range until around 1600)

id  year    name    status
1   1980    James   3
1   1981    James   3
1   1982    James   3
1   1983    James   4
1   1984    James   4
1   1985    James   1
1   1986    James   1
1   1987    James   1
2   1982    John    2
2   1983    John    2
2   1984    John    1
2   1985    John    1

I want to subset this data so that it only has the information for status=1 and the status right before that. I also want to eliminate multiple 1s and only save the first 1s. In conclusion I would want:

id  year    name    status
1   1984    James   4
1   1985    James   1
2   1983    John    2
2   1984    John    1

I'm doing this because I'm in the process of figuring out in what year how many people from certain status changed to status 1. I only know the subset command and I don't think I can get this data from doing subset(data, subset=(status==1)). How could I save the information right before that

I want to add to this question one more time - I did not get same results when I applied the first reply to this question (which uses plr packages) and the third reply which uses duplicated command. I found out that the first reply preserved information accurately while the third one did not.

halo09876
  • 2,725
  • 12
  • 51
  • 71

4 Answers4

4

This does what you want.

library(plyr)

ddply(d, .(name), function(x) {
  i <- match(1, x$status)
  if (is.na(i))
    NULL
  else
    x[c(i-1, i), ]
})

  id year  name status
1  1 1984 James      4
2  1 1985 James      1
3  2 1983  John      2
4  2 1984  John      1
Mark Heckmann
  • 10,943
  • 4
  • 56
  • 88
  • 1
    +1 and apologies for the failed edit, I'll leave it here as a comment: you can use `match(1, x$status)` instead of `which(x$status == 1)[1]` – flodel Jan 17 '14 at 19:52
  • @flodel yes, that's better, thanks! I should get used to using `match` more often! – Mark Heckmann Jan 17 '14 at 19:52
  • changed according to @flodel's suggestion. – Mark Heckmann Jan 17 '14 at 19:54
  • @MarkHeckmann I have one question - can I use the same thing to save five entries before status=1 through c(i-5)? OR will this just save the information on five years before turning into 1? – halo09876 Jan 18 '14 at 06:43
  • @MarkHeckmann By the way, thank you for the answer it worked like magic! – halo09876 Jan 18 '14 at 06:43
  • @ Rusuer9000 Thanks! ... Yes, you can use `i-5` to go back 5 entries as well. – Mark Heckmann Jan 18 '14 at 10:10
  • @MarkHeckmann Hi, I recently found out that your code doesn't work perfectly on the data set I'm working on...it doesn't seem to remember x[c(i-1, i), ] on some of the observations (person). Would the structure of the data matter here (for instance, do you think id, status, and job should be numeric)? I'm trying to figure out what may be causing the problem but I can't seem to figure this out. In addition, I ran the codes below and they all give me different answers. – halo09876 Feb 05 '14 at 12:49
  • @Rusuer9000. Best make a reproducible example/error from it. Otherwise it will be impossible to tell where the problem is. – Mark Heckmann Feb 06 '14 at 12:47
4

Here's a solution - for each grouping of numbers (the cumsum bit), it looks at the first one and takes that and the previous row if status is 1:

library(data.table)
dt = data.table(your_df)

dt[dt[, if(status[1] == 1) c(.I[1]-1, .I[1]),
        by = cumsum(c(0,diff(status)!=0))]$V1]
#   id year  name status
#1:  1 1984 James      4
#2:  1 1985 James      1
#3:  2 1983  John      2
#4:  2 1984  John      1
eddi
  • 49,088
  • 6
  • 104
  • 155
2

Using base R, here is a way to do this:

# this first line is how I imported your data after highlighting and copying (i.e. ctrl+c)
d<-read.table("clipboard",header=T)

# find entries where the subsequent row's "status" is equal to 1
# really what's going on is finding rows where "status" = 1, then subtracting 1  
# to find the index of the previous row
e<-d[which(d$status==1)-1 ,]
# be careful if your first "status" entry = 1...

# What you want
# Here R will look for entries where "name" and "status" are both repeats of a 
# previous row and where "status" = 1, and it will get rid of those entries
e[!(duplicated(e[,c("name","status")]) & e$status==1),]

   id year  name status
 5  1 1984 James      4
 6  1 1985 James      1
10  2 1983  John      2
11  2 1984  John      1
Jota
  • 17,281
  • 7
  • 63
  • 93
  • 1
    For Mac Users, `x <- read.delim(pipe("pbpaste"))` as per http://stackoverflow.com/questions/13438556/how-do-i-copy-and-paste-data-into-r – kdauria Jan 17 '14 at 21:08
0

I like the data.table solution myself, but there actually is a way to do it with subset.

# import data from clipboard
x = read.table(pipe("pbpaste"),header=TRUE)

# Get the result table that you want
x1 = subset(x, status==1 | 
               c(status[-1],0)==1 )
result = subset(x1, !duplicated(cbind(name,status)) )
kdauria
  • 6,300
  • 4
  • 34
  • 53