0

In SparkR I have a DataFrame data contains id, amount_spent and amount_won.

For example for id=1 we have

head(filter(data, data$id==1))

and output is

1 30 10
1 40 100
1 22 80
1 14 2

So far I want to know if a fixed id has more won than losses. The amount can be ignored.

In R I can make it to run but it takes time. Say we have 100 id's. In R I have done this

w=c()
for(j in 1:100){
# Making it local for a fixed id 
q=collect(filter(data, data$id==j))
# Checking the difference. 1 means wins and 0 means losses
if( as.numeric(q$amount_won) - as.numeric(q$amount_spent)>0 {
w[j]=1 
}
else{w[j]=0}
}

Now w simply gives me 1's and 0's for all the id's. In sparkR I want to do this a more faster way.

Ole Petersen
  • 670
  • 9
  • 21

1 Answers1

1

I am not sure wether this is exactly what you want, so feel free to ask for adjustments.

df <- data.frame(id = c(1,1,1,1),
                 amount_spent = c(30,40,22,14),
                 amount_won = c(10,100,80,2))

DF <- createDataFrame(sqlContext, df)
DF <- withColumn(DF, "won", DF$amount_won > DF$amount_spent)
DF$won <- cast(DF$won, "integer")

grouped <- groupBy(DF, DF$id)
aggregated <- agg(grouped, total_won = sum(DF$won), total_games = n(DF$won))

result <- withColumn(aggregated, "percentage_won" , aggregated$total_won/aggregated$total_games)

collect(result)

I have added a column to DF whether the ID has won more than he spent on that row. The result has as output the amount of games someone played, the amount of games he won and the percentage of games he won.

Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18