0

I have a dataframe, similar to the example below, but larger (15000 rows):

df.example <-structure(list(Date = structure(c(3287, 3386, 4286, 5286, 6286), class = "Date"),v1 = c(1L, 1L, 1L, 1L, 1L), v2 = c(0.60378, 12.82581, 3.55357, 4.96079, 0.0422),perc = c(0.598, 0.598, 0.609, 1, 0.609), v3 = c(-99, -99, 5.83509031198686, 4.96079,0.0692939244663383)), .Names = c("Date", "v1", "v2", "perc", "v3"), row.names = c(1L, 100L, 1000L, 2000L, 3000L), class = "data.frame")

df.example:

       Date     v1       v2  perc           v3
1    1979-01-01  1  0.60378 0.598 -99.00000000
100  1979-04-10  1 12.82581 0.598 -99.00000000
1000 1981-09-26  1  3.55357 0.609   5.83509031
2000 1984-06-22  1  4.96079 1.000   4.96079000
3000 1987-03-19  1  0.04220 0.609   0.06929392

What I would like to do is calculate the percentage of rows that are below a "certain threshold value" for column "perc". I would like to do this multiple times for multiple "certain threshold values", given below:

### "certain threshold values":
seq(from =0, to = 1, by = 0.1)


### formula to be repeated/iterated/looped: (the i stands for "certain value")
100*sum(df.example$perc<=i)/nrow(df.example)

I would like the outcome to be a vector called "vector1", like the example below:

vector1 <- c(0,0,0,0,0,0,0.2,0.6,0.6,0.6,1.0)    

This is what I have so far, but it is not working:

### create vector to store calculated values in
vector1=c()
vector1[1]=3

### loop calculation of percentage of rows that are below "certain threshold value" in column df.example$perc
for(i in seq(0,1, by=0.1)){
vector1[i]=sum(df.example$perc<=i)/nrow(df.example)
}

I only get one value, which I would expect to be the last one of my vector1.

I already looked at similar topics in SO, as R create a vector with loop structure & How to make a vector using a for loop

Any suggestions?

By the way: please comment if the dput() I used doesn't create the data to work with, its the first time I use dput().

Community
  • 1
  • 1
T. BruceLee
  • 501
  • 4
  • 16
  • You may need `s1 <- seq(0, 1, 0.5); for(i in seq_along(s1)){vector1[i]=sum(df.example$perc<=s1[i])/nrow(df.example) }` also, initialize `vector1 <- numeric(nrow(df.example))` – akrun Nov 07 '16 at 14:58
  • difference between : for(i in seq_along(seq(0,1, by=0.1))){print(i)} and for(i in seq(0,1, by=0.1)){print(i)} shall explain you the solution – joel.wilson Nov 07 '16 at 14:59

3 Answers3

1

Concerning the number of rows, no need to compute it each time, you can assign it to a variable. Then you can use sapply:

nrow_df <- nrow(df.example)
sapply(seq(from =0, to = 1, by = 0.1), function(x) sum(df.example$perc<=x)/nrow_df)
# [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.8 0.8 1.0

Or (vectorized)

indx <- seq(0, 1, by=0.1)
rowSums(df.example$perc <= matrix(indx, length(indx), nrow(df.example))) / nrow(df.example)
## [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.8 0.8 1.0
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Cath
  • 23,906
  • 5
  • 52
  • 86
0

We need to initialize the vector1 and loop through the sequence in the for loop.

s1 <- seq(0, 1, 0.1)
vector1 <- numeric(nrow(df.example))
for(i in seq_along(s1)){
   vector1[i]=sum(df.example$perc<=s1[i])/nrow(df.example)
 }
vector1
#[1] 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.8 0.8 1.0

Or a vectorized approach would be

rowSums(outer(s1, df.example$perc, FUN = `>=`))/nrow(df.example)
#[1] 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.8 0.8 1.0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Your second vectorized approach also worked on the larger dataset. The first approach did not. Thanks for the help! – T. BruceLee Nov 07 '16 at 16:29
0

Here is a fourth method using outer and colSums:

colSums(outer(df.example$perc, seq(from=0, to=1, by=0.1), "<=")) / nrow(df.example)
[1] 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.8 0.8 0.8 1.0

outer creates a logical matrix that shows performs the treshold test for each threshold-element pair. The "successes" are summed along the column with colSums, and this count is divided by the number of elements tested.

lmo
  • 37,904
  • 9
  • 56
  • 69