In R I have a data.frame data
where head(data)
gives
user action information
12 2012-01-01 12323
11 2014-03-02 24445
12 2012-02-05 32234
....
I want to create a new dataset that only contains user
and their birth, ie their first action. For user
12 it's 2012-01-01
for example.
In sparkR I know how to do this but I was wondering how to do it in R. In sparkR I simply did this
new=groupBy(data, data$user)
new_data=agg(new, birth=first(data$action))
# Making it local (from a DataFrame to a data.frame)
local_new_data=collect(new_data)
Now this list can be saved as a csv-file write.csv("...")
.
Thanks.
Update
I had a data set in sparkR where I runned the sparkR-code to get a list of users and their birth. My problem is that I got a new computer and haven't installed sparkR on it (I'm still working hard on this). I simply need one to run my code in sparkR so I can get the list. I have both the dataset and code ready to execute. I really hope somebody can help me?
My answer
I tried to solve it a different way and for some reason it's running very fast. I simply did this since column action is sorted
s=data[!duplicated(data),]
Now s
contains users where action is their birth. To only get them I simply do this
ss=cbind(as.character(s$user), as.character(s$action))
in this runs very fast in R for some reason.
My question is not duplicate - it differs much from the 2 other questions some claims.