2

I have the following issue:

  1. I have two dfs, called df1 and df2.

  2. With the given ifelse function I try to find all rows in the df2$Index which starts with the same three characters as in df1$Index_1.

  3. Search all cells starting with CEE.Then take all the growth values and the df2$Index values that are at CEE and one before (Period 1). etc. The df: Result1 shows what the final result (just for the first row) looks like.

  4. The function should run through the entire vector Index_1 & Period. At the end there should be 5 dfs with the result.

  5. The ifelse function should identify the start point, and if the point was founded, then it should use the number in the period list to give the Growth and df2$Index values back, as already mentioned above. I tried it with the ifelse function but I am sure there is a better and easier way to do this, moreover, the iflese function is not complete due to my lack of knowledge in R. It would be very nice if someone can help me.

Period <- c(1,2,3,4,5)
Index_1 <- c("CEE","DAE","ABC","EBB","BDC")



Growth <- c(10:34)
Index <- c("ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB" )  

Df1 <- data.frame(Period, Index_1)
Df2 <- data.frame(Growth,Index)

ss <- as.data.frame(ifelse(substr(Df2$Index, 1,3) == df1$Index_1,1,0))

Result1 <- data.frame(Index, Growth)
Index <- c("BDCE","CEED","BDCE","CEED","BDCE","CEED","BDCE","CEED","BDCE","CEED")
Growth <- c(11,12,16,17,21,22,26,27,31,32)

Result2 <- data.frame(Index, Growth)
Index <- c("BDCE", "CEED", "DAEB","BDCE", "CEED", "DAEB","BDCE", "CEED", "DAEB","BDCE", "CEED", "DAEB","BDCE", "CEED", "DAEB")
Growth <- c(11,12,13,16,17,18,21,22,13,26,27,28,31,32,33)

Second Question


Period <- c(3,2,1,4,5)
Index_1 <- c("CEE","DAE","ABC","EBB","BDC")


Growth <- c(10:34)
Index <- c("ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB","ABCD","BDCE", "CEED", "DAEB", "EBBB" )  

Df1 <- data.frame(Period, Index_1)
Df2 <- data.frame(Growth,Index)

lst <- apply(Df1, 1, function(x) {
  match_rows <- which(substr(Df2$Index, 1, 3) == x[["Index_1"]])
  all_rows <- unlist(Map(`:`, match_rows - as.numeric(x[["Period"]]), match_rows))
  Df2[all_rows[all_rows > 0],]
})

lst[[1]] # This is how it looks now

#and this is how it should look like
Result3 <- data.frame(Index, Growth) 
Index <- c("NA","ABCD","BDCE", "CEED","EBBB","ABCD","BDCE", "CEED","EBBB","ABCD","BDCE", "CEED","EBBB","ABCD","BDCE", "CEED","EBBB","ABCD","BDCE", "CEED")
Growth <- c(NA,10,11,12,14,15,16,17,19,20,21,22,24,25,26,27)
Selina1
  • 141
  • 7
  • Is this what you want? https://stackoverflow.com/questions/47271685/fuzzy-matching-in-r & https://www.princeton.edu/~otorres/FuzzyMergeR101.pdf – Tung Sep 08 '21 at 18:03
  • At a certain point yes! But what I need additionally is to consider the period. If you read the result1 or 2 df you will see the difference. Let my code run and (without the ifelse function) and you will see what I try to achieve. Thank you for your help! :-) – Selina1 Sep 08 '21 at 18:18

2 Answers2

2
  1. Define a pattern to search for with str_detect
  2. filter for this pattern
  3. Use window function lead and lag and
  4. filter with if_any to fulfill request 3
  5. group to sets of 3 and
  6. use group_split to fulfill request 4
library(dplyr)
library(stringr)

pattern <- paste(Df1$Index_1, collapse = "|")

Df2 %>% 
    filter(str_detect(Index, pattern)) %>% 
    mutate(x = lead(Index),
           y = lag(Index)) %>% 
    filter(if_any(c(Index, x, y), ~str_detect(., "^CEE"))) %>% 
    select(Index, Growth) %>% 
    mutate(group = as.integer(gl(n(), 3, n()))) %>% 
    group_split(group)

output:

<list_of<
  tbl_df<
    Index : character
    Growth: integer
    group : integer
  >
>[5]>
[[1]]
# A tibble: 3 x 3
  Index Growth group
  <chr>  <int> <int>
1 BDCE      11     1
2 CEED      12     1
3 DAEB      13     1

[[2]]
# A tibble: 3 x 3
  Index Growth group
  <chr>  <int> <int>
1 BDCE      16     2
2 CEED      17     2
3 DAEB      18     2

[[3]]
# A tibble: 3 x 3
  Index Growth group
  <chr>  <int> <int>
1 BDCE      21     3
2 CEED      22     3
3 DAEB      23     3

[[4]]
# A tibble: 3 x 3
  Index Growth group
  <chr>  <int> <int>
1 BDCE      26     4
2 CEED      27     4
3 DAEB      28     4

[[5]]
# A tibble: 3 x 3
  Index Growth group
  <chr>  <int> <int>
1 BDCE      31     5
2 CEED      32     5
3 DAEB      33     5
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

Here is an approach in base R. You can use apply to go through each row of Df1, and get the row numbers of matching rows in Df2. Then, you can take these matching values and expand based on Period (getting a range of values, such as 2 additional rows if Period is 2).

It is not exactly clear what to do if matching one of the first rows (with the Period offset). What if the first row of Df2 matches but Period is 5? So I just use rows that are available and subsetted where row positions are greater than zero. The end result is a list of data.frames.

lst <- apply(Df1, 1, function(x) {
  match_rows <- which(substr(Df2$Index, 1, 3) == x[["Index_1"]])
  all_rows <- unlist(Map(`:`, match_rows - as.numeric(x[["Period"]]), match_rows))
  Df2[all_rows[all_rows > 0],]
})

> lst[[1]]
   Growth Index
2      11  BDCE
3      12  CEED
7      16  BDCE
8      17  CEED
12     21  BDCE
13     22  CEED
17     26  BDCE
18     27  CEED
22     31  BDCE
23     32  CEED

> lst[[2]]
   Growth Index
2      11  BDCE
3      12  CEED
4      13  DAEB
7      16  BDCE
8      17  CEED
9      18  DAEB
12     21  BDCE
13     22  CEED
14     23  DAEB
17     26  BDCE
18     27  CEED
19     28  DAEB
22     31  BDCE
23     32  CEED
24     33  DAEB
Ben
  • 28,684
  • 5
  • 23
  • 45
  • Thank you for your help! It is exactly what I wanted. :-) There is no case in my df's that this will happen! – Selina1 Sep 09 '21 at 10:04
  • My bad, Indeed I got something like you mentioned before. It should be filled out with Na's if this is possible? – Selina1 Sep 09 '21 at 14:59
  • Thank you ;-). Maybe you have also and idea to another issue. After Df2[all_rows[all_rows > 0],] I wrote the following chunk to get after each run the list splitted to equal part: max <- Period x <- seq_along(Df2$Growth) d1 <- split(Df2$Growth, ceiling(x/max)), but it doesn't work! because I want to calculate the maxDrawdown. As an example: The first result (lst[[1]]) CEED and BDCE occurs 5 times, and I need to calculate the maxDD for each of those 5 parts, do you understand what I try to explain? – Selina1 Sep 09 '21 at 15:27
  • This is a bit confusing --- first, if you fill in with `NA`, what exactly should this look like? Can you edit your original question with what the output should be for `Period` of 3 (in which case this problem occurs)? Then, with calculating "maxDrawdown" - I'm not sure I follow. That could also be a separate question. If you post the code you're using, and indicate what the final result should look like, I'm sure I or someone else will be able to answer... – Ben Sep 09 '21 at 15:55
  • Thank you for your answer! I will update my original question in a few minutes and opening a new question with the word maxDrawdown in it. – Selina1 Sep 09 '21 at 16:09
  • Your `Result3` cannot be created, as `Index` and `Growth` vectors are different lengths. Also you have "Na" in quotes, but I think you just want `NA` instead? Once you create your `Result3` on exactly as it should look like, you can do `dput(Result3)` and then copy/paste the result into your question. It makes things easier for others to reproduce... – Ben Sep 09 '21 at 17:04
  • You can‘t reproduce this result, because it shows just how it should looks like! If you change at the period vector the first number (1) trough 3 you will get the result how it looks like with your function and my result which I already posted is how it should looks like. But I can also do it, I am just on the way home (using my cellphone). – Selina1 Sep 09 '21 at 17:52