0

I have the following dataset and looking to write a code that can help pull out which stocks have been positive or negative consecutively. The data would have first 3 column. last 2 columns are manually calculated in excel to depict expected results.

This is only sample, i would have data for 200+ stocks and few years of data with all stocks not trading every day.

In the end, i want to extract which stocks have say 3 or 4 or 5 consecutive positive or negative change for the day.

`    Stocks Date    Close Price Change for day  Positive/Negative Count
A   11/11/2020         11       
B   11/11/2020         50       
C   11/11/2020        164       
A   11/12/2020         19         8                 1
B   11/12/2020         62        12                 1
C   11/12/2020        125        -39               -1
A   11/13/2020          7        -12               -1
B   11/13/2020         63         1                 2
C   11/13/2020        165        40                 1
A   11/16/2020         17        10                 1
B   11/16/2020         70         7                 3
C   11/16/2020        170         5                 2
A   11/17/2020         24         7                 2
B   11/17/2020         52        -18               -1
C   11/17/2020        165         -5               -1
A   11/18/2020         31          7                3
B   11/18/2020         61          9                1
C   11/18/2020        157         -8               -2
A.B.
  • 83
  • 1
  • 8

1 Answers1

0

The difficulty is to have a function that makes the cumulative sum, both positive and negative, resetting the count when the sign changes, and starting the count with the first value. I managed to make one, but it is not terribly efficient and will probably get slow on a bigger dataset. I suspect there is a way to do better, if only with a simple for loop in C or C++.

library(tidyverse)


df <- read.table(text="Stocks Date    Close_Price Change_for_day  Positive/Negative_Count
A   11/11/2020         11       NA                 0
B   11/11/2020         50       NA                 0
C   11/11/2020        164       NA                 0
A   11/12/2020         19         8                 1
B   11/12/2020         62        12                 1
C   11/12/2020        125        -39               -1
A   11/13/2020          7        -12               -1
B   11/13/2020         63         1                 2
C   11/13/2020        165        40                 1
A   11/16/2020         17        10                 1
B   11/16/2020         70         7                 3
C   11/16/2020        170         5                 2
A   11/17/2020         24         7                 2
B   11/17/2020         52        -18               -1
C   11/17/2020        165         -5               -1
A   11/18/2020         31          7                3
B   11/18/2020         61          9                1
C   11/18/2020        157         -8               -2",
           header = TRUE) %>%
  select(1:3) %>%
  as_tibble()


# this formulation could be faster on data with longer stretches
nb_days_cons2 <- function(x){
  n <- length(x)
  if(n < 2) x
  out <- integer(n)
  y <- rle(x)
  cur_pos <- 1
  for(i in seq_len(length(y$lengths))){
    out[(cur_pos):(cur_pos+y$lengths[i]-1)] <- cumsum(rep(y$values[i], y$lengths[i]))
    cur_pos <- cur_pos + y$lengths[i]
  }
  out
}

# this formulation was faster on some tests, and would be easier to rewrite in C
nb_days_cons <- function(x){
  n <- length(x)
  if(n < 2) x
  out <- integer(n)
  out[1] <- x[1]
  for(i in 2:n){
    if(x[i] == x[i-1]){
      out[i] <- out[i-1] + x[i]
    } else{
      out[i] <- x[i]
    }
  }
  out
}

Once we have that function, the dplyr part is quite classic.

df %>%
  group_by(Stocks) %>%
  arrange(Date) %>%   # make sure of order
  mutate(change = c(0, diff(Close_Price)),
         stretch_duration = nb_days_cons(sign(change))) %>%
  arrange(Stocks)
#> # A tibble: 18 x 5
#> # Groups:   Stocks [3]
#>    Stocks Date       Close_Price change stretch_duration
#>    <chr>  <chr>            <int>  <dbl>            <dbl>
#>  1 A      11/11/2020          11      0                0
#>  2 A      11/12/2020          19      8                1
#>  3 A      11/13/2020           7    -12               -1
#>  4 A      11/16/2020          17     10                1
#>  5 A      11/17/2020          24      7                2
#>  6 A      11/18/2020          31      7                3
#>  7 B      11/11/2020          50      0                0
#>  8 B      11/12/2020          62     12                1
#>  9 B      11/13/2020          63      1                2
#> 10 B      11/16/2020          70      7                3
#> 11 B      11/17/2020          52    -18               -1
#> 12 B      11/18/2020          61      9                1
#> 13 C      11/11/2020         164      0                0
#> 14 C      11/12/2020         125    -39               -1
#> 15 C      11/13/2020         165     40                1
#> 16 C      11/16/2020         170      5                2
#> 17 C      11/17/2020         165     -5               -1
#> 18 C      11/18/2020         157     -8               -2
Created on 2020-11-19 by the reprex package (v0.3.0)

Of course, the final arrange() is just for easy visualization, and you can remove the columns you don't need anymore with select().

Alexlok
  • 2,999
  • 15
  • 20
  • the results in last column is not correct. For example, stock C- date 17 Nov should have value of -1 and 18 Nov should have value of -2 to show 2 consecutive negative close. It should never be zero because positive negative are not suppose to net off. – A.B. Nov 19 '20 at 03:16
  • Oh I didn't understand that you wanted this, right now it's the number of consecutive entries with the same sign, so that you can easily extract the ones with 3 or 4 days in a row. I'm editing my answer to add a signed column (just by multiplying with `pos`). – Alexlok Nov 19 '20 at 03:22
  • Alexlok, it still isnt correct. after the positive value of 1, if the stock closes negative next day, the result should not be zero. On every positive or negative close, it should start counting from -1 or +1 respectively. so answer in last column should never be zero. Hope i am able to clarify. my expected result for stock C on 17 Nov is -1 and 18 Nov is -2. Please help to improve the code. – A.B. Nov 19 '20 at 03:36
  • Does it correspond to expectations now? – Alexlok Nov 19 '20 at 06:08
  • Yes it correspond to the expectations. How can i give 5 star rating and to this website for such a brilliant support? Just one thing i would like to confirm because so far i am not able to confirm from results. How this code will be due to gaps in dates like weekend or any particular stock not traded on any particular date? – A.B. Nov 20 '20 at 04:03
  • Here I completely ignored the date (you can see it in the call to `mutate()`, there is only reference to the column `change`, not to the column `Date`). If you want to reset the count, for days when some stocks are missing not others, you can look at the function `complete()` that can create rows for the missing ones. To include days were nothing was traded, you'd have to `left_join()` a column with all possible dates. – Alexlok Nov 20 '20 at 05:22
  • facing this error Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.OutOfMemoryError: Java heap space – A.B. Nov 20 '20 at 14:05
  • This is a different problem, nothing here uses Java. – Alexlok Nov 20 '20 at 15:02
  • i further want to add a column in the results which shows the minimum stretch duration value for the stock in the whole data prior to the date in the given row we are looking at. For example for Stock C on 18th Nov would show that historical minimum stretch duration for Stock C was -1 before 18th November which it brings by evaluating Stretch duration values for all dates before 18th November. – A.B. Nov 27 '20 at 15:20
  • You can try with `cummin` and `cummax`, else this would be best asked as a separate question with a minimal exemple. – Alexlok Nov 27 '20 at 15:41