0

I have a data frame of about 3000 rows. I wish to find the longest run of positive and negative numbers.

My example dataframe: df

1   0.502310591 
2   -0.247577976    
3   -0.307256769    2
4   0.442253678 
5   -0.795770351    
6   2.08244648  
7   -0.01672777 
8   -0.164145656    2
9   0.610117365 
10  0.014758371 
11  0.381105476 
12  0.721386493 4
13  -0.363222383    
14  0.201409322 
15  0.724867214 
16  -1.586829584    
17  1.066288451 
18  0.182824494 
19  0.237447191 
20  -0.215475797

Longest positive run: 4 Longest negative run: 2

I am following this tutorial: https://ocw.mit.edu/ans7870/18/18.05/s14/html/r-tut-rle.html

I need to fund the longest run of values >0 and also <0. So is there any way to edit the above?

I guess this only finds longest lengths of 1,0? If that case then i would need a helper column ifelse 1,0 to split the negative , positive... then perhaps find longest length with this code:

> df$wins <- ifelse(df$V2 > 0, 1, 0) 
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> win <- (c(df$wins))
> max(rle(win)$lengths)
[1] 4

This works for finding maximum wins...

This is for losses:

> print(df$loss)
 [1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> df$loss <- ifelse(df$V2 < 0, 1, 0)
> print(df$loss)
 [1] 0 1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1
> loss <- (c(df$loss))
> max(rle(loss)$lengths)
[1] 4

Not sure why it says 4... there is clearly 2 maximum losses, anyone see why? What am i missing, shouldn't the logic of the wins work exactly the same for the loss? I cant see any error in the code...

if value in df$loss is less than 0 print 1 else 0. make a vector containing the contents of df$loss column find max length of 1's using max(rle(loss)$lengths) again result is 4.. however, its clearly 2?

Andrew Bannerman
  • 1,235
  • 2
  • 16
  • 36

3 Answers3

1

I think rle will do the job based on your objective, but I'll post an alternative solution, with a little bit more data manipulation, but you'll be able to get some more info.

Typically, after answering one question you'll be asked to answer more. For example, get an idea of the distribution of the runs for positives vs. negatives, as the maximum value might not tell you much. Or, perform a statistical comparison to see if positives run longer than the negatives on average.

Also, with ~3000 rows I don't think you'll have any speed issues.

library(dplyr)

# example dataset
dt = data.frame(x = c(1,-1,-2,0,2,4,3,5,-5,-6,-7,0,0))

# get a dataset that assigns an id to all positive or negative series
dt %>%
  mutate(sign = ifelse(x > 0, "pos", ifelse(x < 0, "neg", "zero")), # get the sign of the value
         sign_lag = lag(sign, default = sign[1]),       # get previous value (exception in the first place)
         change = ifelse(sign != sign_lag, 1 , 0),      # check if there's a change
         series_id = cumsum(change)+1) %>%              # create the series id
  print() -> dt2                                        # print to screen and save it

#     x sign sign_lag change series_id
# 1   1  pos      pos      0         1
# 2  -1  neg      pos      1         2
# 3  -2  neg      neg      0         2
# 4   0 zero      neg      1         3
# 5   2  pos     zero      1         4
# 6   4  pos      pos      0         4
# 7   3  pos      pos      0         4
# 8   5  pos      pos      0         4
# 9  -5  neg      pos      1         5
# 10 -6  neg      neg      0         5
# 11 -7  neg      neg      0         5
# 12  0 zero      neg      1         6
# 13  0 zero     zero      0         6

You can remove the help columns sign_lag and change as I've included them only to demonstrate how the process works.

# Get longest runs
dt2 %>% 
  count(sign, series_id) %>%
  group_by(sign) %>%
  filter(n == max(n)) %>%
  select(-series_id) %>%
  ungroup

# # A tibble: 3 x 2
#    sign     n
#   <chr> <int>
# 1   neg     3
# 2   pos     4
# 3  zero     2


# Get all runs
dt2 %>% count(sign, series_id)

# # A tibble: 6 x 3
#    sign series_id     n
#   <chr>     <dbl> <int>
# 1   neg         2     2
# 2   neg         5     3
# 3   pos         1     1
# 4   pos         4     4
# 5  zero         3     1
# 6  zero         6     2

If you get all runs you will be able to plot distributions of the runs of positives vs. negatives, or perform statistical comparisons to see if the positives run longer on average than the negative.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Can you check if your data can handle 0's? right now if my data == 0 it goes down as a neg. I think if we correct that then it should be good to go – Andrew Bannerman Jul 30 '17 at 01:09
  • 1
    Solution updated to treat zeros as a separate category. – AntoniosK Jul 30 '17 at 02:59
  • How would you modify this to then filter a certain number of rows prior to and after a change? Like let's say for instance you wanted to filter your data to only contain observations where there were so many positive values before they changed to negative, and also keep so many of the negative values after the change as well. For example if I wanted to set a criteria of retaining only rows where there were 5 positives followed by 5 negatives, or vise versa...5 negatives followed by 5 positives. – Jared Nov 04 '22 at 18:47
1

This is a simple manner to do it, I assumed that you began with a dataframe with only the values I also assumed you need a dataframe with the counts of the same digits run a column for the positive and a column for the negative.

set.seed(42)
df=data.frame(x= runif(300, -1.0, 1.0))
count_pos=c()
count_neg=c()
n1=df$x[1]
if (sign(n1)==1){
  count_pos[1]=1
  count_neg[1]=0
}else{
  count_neg[1]=1
  count_pos[1]=0
}
count=1
index=1
for (i in df$x[2:nrow(df)]){
  #print (i)
  index=index+1
  if (sign(n1)==sign(i)){
    count=count+1
  }
  else{
    count=1
  }
  if (sign(i)==1){
    count_pos[index]=count
    count_neg[index]=0
  }else{
    count_pos[index]=0
    count_neg[index]=count
  }
  n1=i
}

df2=data.frame(x=df$x,count_pos=count_pos,count_neg=count_neg)
#df2 contains the dataframe with columns that count the run for numbers with 
#same sign
print (paste("Maximum run of Positive numbers:",max(count_pos)))
print (paste("Maximun run of negative numbers:",max(count_neg)))

I know there is a more compact way to do it but this code makes the trick.

  • Cool code! I am only used to seeing R code, not done any other programming. Is the above formed around the base of another language? – Andrew Bannerman Jul 30 '17 at 03:10
1

To get the maximum positive/negative runs in a series ( in your case in a column of a data.frame the rle function is all you need:

set.seed(123)
df <- data.frame(col1=rnorm(20, mean = 0, sd = 1))

max. pos run:

max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == 1])
[1] 5

max. neg.run:

max(rle(sign(df$col1))[[1]][rle(sign(df$col1))[[2]] == -1])
[1] 3

let’s check the results:

> df
          col1
1  -0.56047565
2  -0.23017749
3   1.55870831
4   0.07050839
5   0.12928774
6   1.71506499
7   0.46091621
8  -1.26506123
9  -0.68685285
10 -0.44566197
11  1.22408180
12  0.35981383
13  0.40077145
14  0.11068272
15 -0.55584113
16  1.78691314
17  0.49785048
18 -1.96661716
19  0.70135590
20 -0.47279141
hvollmeier
  • 2,956
  • 1
  • 12
  • 17
  • Does this handle 0 values at all in the series? I think i am seeing something strange if there are consecutive 0 values in the series. – Andrew Bannerman Jul 30 '17 at 13:29
  • Ok i see what is happening...... if I have positive....and negative separated by 0's it is a different. So essentially if I had c(1,0,0,0,1,0,0,-1,0,0,-2,0,0,0) for positive count i would want to start the count at 1... and skip the 0... and count the next 1... so the positive run is 2.... and for negative run we find a -1 skip the 0 and count the -2.. so max negative run is 2.... maybe will make a new question! i wasnt aware of this during the original question – Andrew Bannerman Jul 30 '17 at 13:44
  • I can pull that data out of the data frame, remove all 0 then run same calc. should work.. – Andrew Bannerman Jul 30 '17 at 13:56
  • remove.zero <- df[-which(df$col1 == 0 ), ] # removing rows 0 values – Andrew Bannerman Jul 30 '17 at 14:02
  • 1
    @AndrewBannerman, you have several options to deal with zeros. One is to filter out zeroes before calculating the runs, second: regard zeroes as either wins or losses or calculate a helper column with i.e loss -> 0, win -> 1 and zero -> 2 and treat zeroes as separate runs. Regardless of which method you use the principle of how to calculate the max loss/wins stays the same.Hope this helps. – hvollmeier Jul 30 '17 at 14:15
  • for my own learning purposes, when you write: rle(sign(df$col1))[[2]] == 1]) we are looking for a positive sign in the data?? So anything that is a positive or + number and again for the negative... trying to understand the logic of the code here and how it works, thanks! – Andrew Bannerman Jul 30 '17 at 15:32
  • 1
    `sign(x)` returns a vector with the signs of the corresponding elements of x .(the sign of a real number is 1, 0, or -1 if the number is positive, zero, or negative, respectively). – hvollmeier Jul 30 '17 at 15:49
  • Great and these signify? [[1]] [[2]] – Andrew Bannerman Jul 30 '17 at 16:09
  • `rle` returns a list of 2. Try `rle(sign(df$col1)`. One of `lengths` and one of `values`. You can access them with double square brackets `[[]]` like regular list elements.[[1]] is accessing the `lengths` and [[2]] the respective `values`. – hvollmeier Jul 30 '17 at 16:17