1

In R I have a data.frame data where head(data) gives

user  action      information 
12    2012-01-01  12323
11    2014-03-02  24445
12    2012-02-05  32234
....

I want to create a new dataset that only contains user and their birth, ie their first action. For user 12 it's 2012-01-01 for example.

In sparkR I know how to do this but I was wondering how to do it in R. In sparkR I simply did this

new=groupBy(data, data$user)
new_data=agg(new, birth=first(data$action))
# Making it local (from a DataFrame to a data.frame)
local_new_data=collect(new_data)

Now this list can be saved as a csv-file write.csv("...").

Thanks.

Update

I had a data set in sparkR where I runned the sparkR-code to get a list of users and their birth. My problem is that I got a new computer and haven't installed sparkR on it (I'm still working hard on this). I simply need one to run my code in sparkR so I can get the list. I have both the dataset and code ready to execute. I really hope somebody can help me?

My answer

I tried to solve it a different way and for some reason it's running very fast. I simply did this since column action is sorted

s=data[!duplicated(data),]

Now s contains users where action is their birth. To only get them I simply do this

ss=cbind(as.character(s$user), as.character(s$action))

in this runs very fast in R for some reason.

My question is not duplicate - it differs much from the 2 other questions some claims.

1 Answers1

1

In R, using dplyr, it is almost similar syntax as it also have the first function along with group_by (in place of groupby)

library(dplyr)
data %>%
     group_by(user) %>%
     summarise(birth = first(action))

Or another option is data.table

library(data.table)
setDT(data)[, .(birth = action[1L]) , by = user]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I can run this code in R but it takes extremly long time even for only 10000 itereations. Is there a faster way to solve it? – user6678274 Aug 30 '16 at 12:55
  • @user6678274 I updated with a `data.table` option. Perhaps it helps. – akrun Aug 30 '16 at 14:15
  • I get a clear output but the row with birth is the same in all rows which it should not be. The command 'action[1L]' gives the first birth and therefore for all rows? – user6678274 Aug 31 '16 at 07:50
  • @user6678274 It is the data.table command that corresponds to the `dplyr` code. It is grouping by 'user' and select the first 'action' for each 'user'. So, i am not sure how you are getting the same in all rows. – akrun Aug 31 '16 at 08:12
  • It is the last line 'setDF(data)[, .birth=action(1L),by=user ]' . Do you have sparkR on your computer? – user6678274 Aug 31 '16 at 08:14
  • @user6678274 No, I don't have. But, if you check it on normal R with an example like you showed, it gives the same output – akrun Aug 31 '16 at 08:15