1

I have a dataframe where I need to create a new column of values using the strings in Issue_key column and form the strings of values from the Issue_description column.

Issue_Description_needed is what I am seeking. I have used the lag and summarise functions from dplyr but it's not the correct approach.

input <- data.frame(
  
  Issue_Key = c("KLM391",
                "KLM391.02",
                "KLM391.01",
                "KLM391.04",
                "KLM391.01.0999993",
                "KLM391.01.0999993.898",
                "KLM391.01.0999993.898.asds",
                "KLM391.01.0999993.898.tyut",
                "KLM391.02.0999993",
                "KLM391.02.0999996",
                "KLM391.04.0999991",
                "KLM391.04.0999998"),
  
  Issue_description = c("L",
                        "M2",
                        "M1",
                        "M4",
                        "O123",
                        "P1234",
                        "Q12345",
                        "Q67809",
                        "XYq12",
                        "46525",
                        "4hrh4",
                        "fg785"),
  
  Issue_description_needed = c(
    "L",
    "L.M2",
    "L.M1",
    "L.M4",
    "L.M1.O123",
    "L.M1.O123.P1234",
    "L.M1.O123.P1234.Q12345",
    "L.M1.O123.P1234.Q67809",
    "L.M2.XYq12",
    "L.M2.46525",
    "L.M4.4hrh4",
    "L.M4.fg785"
  )
)
Output:

                    Issue_Key Issue_description    Issue_description_needed
1                      KLM391                 L                           L
2                   KLM391.01                M1                        L.M1
3           KLM391.01.0999993              O123                   L.M1.O123
4       KLM391.01.0999993.898             P1234             L.M1.O123.P1234
5  KLM391.01.0999993.898.asds            Q12345      L.M1.O123.P1234.Q12345
6  KLM391.01.0999993.898.tyut            Q67809      L.M1.O123.P1234.Q67809
7                   KLM391.02                M2                        L.M2
8           KLM391.02.0999993             XYq12                  L.M2.XYq12
9           KLM391.02.0999996             46525                  L.M2.46525
10                  KLM391.04                M4                        L.M4
11          KLM391.04.0999991             4hrh4                  L.M4.4hrh4
12          KLM391.04.0999998             fg785                  L.M4.fg785

I have searched the forum with best of my knowledge:

  1. R: Collapse partial duplicate rows into one row by ID but with different conditions per column

  2. How do I group together string rows with partial matching strings and then sum them together?

CaseebRamos
  • 684
  • 3
  • 18
  • 1
    I don't understand. What are you trying to match? I recognize you want answers from the third column above, but ... what are you matching against? What have you tried? – r2evans Jul 11 '23 at 19:39
  • Maybe I'm dumb, but I don't understand what you are concatenating with what and why. – Axeman Jul 11 '23 at 19:39
  • @r2evans, I want to create the third column using the values in Issue_description column by looking up the Issue_keys. Recently I tried this `input %>% dplyr::group_by(Issue_key_prefix) %>% dplyr::summarise(Issue_description_needed = paste(Issue_description, collapse = "."))` Where issue_key_prefix is the string before the first period. The final output doesn't match the third column. – CaseebRamos Jul 11 '23 at 19:46
  • 1
    Please [edit] your question and put that code, its returned value, and why that value is wrong into your question. But please check it, up front we don't have `Issue_key_prefix`. – r2evans Jul 11 '23 at 19:48
  • 1
    It looks like maybe you are accumulating the `Issue_description` based on `Issue_Key`? So, `Issue_description_needed` is the concatenation of `Issue_description` based on `Issue_Key`. For example, all `Issue_Key` that start with `"KLM391"` have the corresponding `Issue_description` "L" concatenated to its own `Issue_description`, which is why I think you were trying to use the `lag` function. Is that right? – LMc Jul 11 '23 at 19:49
  • @LMc that's correct :) – CaseebRamos Jul 11 '23 at 19:53

3 Answers3

2

Using some tidyverse tools:

library(dplyr)
library(stringr)
library(purrr)

input |>
  mutate(split = str_split(Issue_Key, "\\."),
         needed = map(split, ~ str_flatten(Issue_description[match(accumulate(.x, paste, sep = "."), Issue_Key)], "."))) |>
  select(-split)

How it works

  1. First we create a new list column split that splits up Issue_Key delimited by ".".
  2. We iterate over this list column using map and for each row we do the following. Using row 5 as an example:
  • .x on row 5 resolves to c("KLM391", "01", "0999993") and we use accumulate to paste this into these substrings: c("KLM391", "KLM391.01", "KLM391.01.0999993").

  • Next we use match to get the row indicies in Issue_Key that matches c("KLM391", "KLM391.01", "KLM391.01.0999993"), which would be c(1, 3, 5).

  • Finally, we use these indices to simply look up Issue_description (e.g., Issue_description[c(1, 3, 5)]) and collapse everything into one string using str_flatten again delimited by ".".

Output

                    Issue_Key Issue_description Issue_description_needed                 needed
1                      KLM391                 L                        L                      L
2                   KLM391.02                M2                     L.M2                   L.M2
3                   KLM391.01                M1                     L.M1                   L.M1
4                   KLM391.04                M4                     L.M4                   L.M4
5           KLM391.01.0999993              O123                L.M1.O123              L.M1.O123
6       KLM391.01.0999993.898             P1234          L.M1.O123.P1234        L.M1.O123.P1234
7  KLM391.01.0999993.898.asds            Q12345   L.M1.O123.P1234.Q12345 L.M1.O123.P1234.Q12345
8  KLM391.01.0999993.898.tyut            Q67809   L.M1.O123.P1234.Q67809 L.M1.O123.P1234.Q67809
9           KLM391.02.0999993             XYq12               L.M2.XYq12             L.M2.XYq12
10          KLM391.02.0999996             46525               L.M2.46525             L.M2.46525
11          KLM391.04.0999991             4hrh4               L.M4.4hrh4             L.M4.4hrh4
12          KLM391.04.0999998             fg785               L.M4.fg785             L.M4.fg785
LMc
  • 12,577
  • 3
  • 31
  • 43
  • I was most of the way through a `Reduce` implementation when this hit. Shockingly short, nice job. (I consider it "elegant" too, though I have a clear understanding of `accumulate` and friends ... I supposed others might find it a little "magic".) – r2evans Jul 11 '23 at 20:18
  • @r2evans Thank you! – LMc Jul 11 '23 at 20:19
  • I have a lot to learn. I don't know who you came up with neat solution. May I get some of your thinking process steps? – CaseebRamos Jul 11 '23 at 20:20
  • 1
    @CaseebRamos Stay tuned, I will post a "How it works" section to add detail to help you understand. – LMc Jul 11 '23 at 20:21
  • 2
    @r2evans, please give your response too, more way are helpful to others. – CaseebRamos Jul 11 '23 at 20:21
  • 1
    Amazing approach, thanks for teaching. – CaseebRamos Jul 11 '23 at 20:30
  • This takes a lot of time to execute with the data I have :( I am trying to make this data.table but failing. – CaseebRamos Jul 12 '23 at 14:53
  • 1
    @CaseebRamos one potential for improvement would be to use [`library(furrr)`](https://furrr.futureverse.org/) instead of `library(purrr)` which allows for parallel processing. After setting up a [`plan`](https://future.futureverse.org/reference/plan.html) you would then use `future_map` instead of `map`. How many rows are in your real data frame? – LMc Jul 12 '23 at 15:33
1

You could write a function to do the substitution/replacement. Since the replacement is non-unique ie 0999993 is to be replaced with both O123 and XYq12 we write two functions.

my_sub <- function(x, y){
  fn <- function(x, y){
    idx <- !grepl(".", x, fixed = TRUE)
    if(!is.null(nms<-names(x))) idx <- idx & is.na(nms) 
    if(any(idx)){
      names(x)[idx] <- y[idx]
      x[!idx] <- Reduce(\(i,j)gsub(j, '', i, fixed = TRUE), x[idx], init = x[!idx])
      x[!idx] <- trimws(x[!idx], whitespace = '[.]')
      Recall(x, y)
    }
    else x
  }
  r <- fn(x, y)
  z <- mapply(gsub, r, names(r), x, USE.NAMES = FALSE)
  Reduce(\(i,j)gsub(r[j], y[j], i), seq(x), init=z)
}

cbind(input[1:2],new = my_sub(input[,1], input[, 2]))
#>                     Issue_Key Issue_description                    new
#> 1                      KLM391                 L                      L
#> 2                   KLM391.02                M2                   L.M2
#> 3                   KLM391.01                M1                   L.M1
#> 4                   KLM391.04                M4                   L.M4
#> 5           KLM391.01.0999993              O123              L.M1.O123
#> 6       KLM391.01.0999993.898             P1234        L.M1.O123.P1234
#> 7  KLM391.01.0999993.898.asds            Q12345 L.M1.O123.P1234.Q12345
#> 8  KLM391.01.0999993.898.tyut            Q67809 L.M1.O123.P1234.Q67809
#> 9           KLM391.02.0999993             XYq12             L.M2.XYq12
#> 10          KLM391.02.0999996             46525             L.M2.46525
#> 11          KLM391.04.0999991             4hrh4             L.M4.4hrh4
#> 12          KLM391.04.0999998             fg785             L.M4.fg785

Created on 2023-07-11 with reprex v2.0.2

Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thank you! Unfortunately, when I put this in my real dataset, I get `Error in gsub(j, "", i, fixed = TRUE) : zero-length pattern Calls: cbind ... Recall -> fn -> Recall -> fn -> Reduce -> f -> gsub` – CaseebRamos Jul 12 '23 at 12:47
0

I used @r2evans response and formulated this answer which solves my problem at a considerable speed but doesn't solve fully (continue question here Having issues accumulating values from one column based on keys in another column):

fun <- function(st) strcapture("(.*)[.][^.]+$", st, list(L=""))$L

input <- input %>%
  mutate(K = fun(Issue_Key))



while (TRUE) {
  input <- left_join(input, select(input, Issue_Key, iss = Issue_description), by = c("K" = "Issue_Key")) %>%
    mutate(
      Issue_description= if_else(is.na(iss), Issue_description,
                          if_else(Issue_Key == K, Issue_description, paste(iss, Issue_description, sep = "."))),
      K = fun(K)) %>%
    select(-iss)
  if (all(is.na(input$K))) break
}



input$K <- NULL



# Update the PROJ_NAME column to include repeated parts for each unique PROJ_NAME
input$Issue_description <- sapply(strsplit(as.character(input$Issue_description), "\\."), function(x) {
  unique_parts <- unique(x)
  paste(unique_parts, collapse = ".")
})



# Print the updated dataframe
print(input)


1                      KLM391                      L
2                   KLM391.02                   L.M2
3                   KLM391.01                   L.M1
4                   KLM391.04                   L.M4
5           KLM391.01.0999993              L.M1.O123
6       KLM391.01.0999993.898        L.M1.O123.P1234
7  KLM391.01.0999993.898.asds L.M1.O123.P1234.Q12345
8  KLM391.01.0999993.898.tyut L.M1.O123.P1234.Q67809
9           KLM391.02.0999993             L.M2.XYq12
10          KLM391.02.0999996             L.M2.46525
11          KLM391.04.0999991             L.M4.4hrh4
12          KLM391.04.0999998             L.M4.fg785
CaseebRamos
  • 684
  • 3
  • 18