3

I have a data frame where each row represents interaction data per person.

actions = read.table('C:/Users/Desktop/actions.csv', header = F, sep = ',', na.strings = '', stringsAsFactors = F)

Each person can have one, or more of the following interactions:

eat, sleep, walk, jump, hop, wake, run

The action lengths being recorded for each person may differ as below:

P1: eat,  sleep, sleep, sleep
P2: wake, walk,  eat,   walk, walk, jump, jump, run, run
P3: wake, eat,   walk,  jump, run,  sleep

To make the lengths equal, I have NA padding at the end:

P1: eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
P2: wake, walk,  eat,   walk,  walk, jump,  jump, run, run
P3: wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA

Now, my requirement is to update the per person entries (row wise data), so that no two consecutive entries are duplicates. It is very important to maintain the order. My required output is:

P1: eat,  sleep, NA,   NA,   NA,   NA,    NA,   NA,  NA
P2: wake, walk,  eat,  walk, jump, run,   NA,   NA,  NA 
P3: wake, eat,   walk, jump, run,  sleep, NA,   NA,  NA

The column names are by default V1, V2, V3 .... Vn where

n = maximum length of interactions string 

In the above example P2 has maximum length; so n = 9. So total columns in the above example are from V1-V9.

The output for the

dput(actions)

structure(list(V1 = c("S", "C", "R"), V2 = c("C", "C", "R"), 
V3 = c("R", "C", "R"), V4 = c("S", NA, "R"), V5 = c("C", 
NA, "R"), V6 = c("R", NA, NA), V7 = c("S", NA, NA), V8 = c("C", 
NA, NA), V9 = c("R", NA, NA)), class = "data.frame", row.names = c(NA,-3L))

The following question: Removing Only Adjacent Duplicates in Data Frame in R is bit similar to mine, however, there are several differences. I am unable to solve my problem even by incorporating the code from the above question.

Any suggestions on this would be highly appreciated!

Sandy
  • 1,100
  • 10
  • 18
  • Can you post sample data in `dput` format? Please edit **the question** with the output of `dput(df)`. Or, if it is too big with the output of `dput(head(df, 20))`. (`df` is the name of your dataset.) – Rui Barradas Nov 10 '18 at 11:51
  • Contributors: the "two consecutive" appears to be paramount. – hrbrmstr Nov 10 '18 at 12:18
  • @hrbrmstr Your solution has disappeared. It seemed to be leading me in the right direction... – Sandy Nov 10 '18 at 12:19
  • it's not, though. you'll get errant results with it since it doesn't handle the "only two consecutive" properly – hrbrmstr Nov 10 '18 at 12:26
  • I think I did not convey properly: sleep sleep sleep sleep should result in 'sleep'; not 'sleep sleep' as I do not want duplicates. – Sandy Nov 10 '18 at 12:38
  • @hrbrmstr, at least perhaps now you see that I didn't copy your solution since mine doesn't leave such duplicates. – Julius Vainora Nov 10 '18 at 13:01
  • mine didn't either @JuliusVainora – hrbrmstr Nov 10 '18 at 13:07

3 Answers3

3
library(tidyverse)

read.csv(text=gsub(" +", "", "P1, eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
P2, wake, walk,  eat,   walk,  walk, jump,  jump, run, run
P3, wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA"), 
           header = FALSE, stringsAsFactors = FALSE) %>% 
  setNames(c("person", sprintf("i%s", 1:9))) %>% tbl_df() -> xdf

de_dup <- function(x) {
  # remove consecutive dups and keep order
  interactions <- rle(unlist(x, use.names = FALSE)[-1])$values
  # fill in NAs
  interactions <- c(interactions, rep(NA_character_, length(x[-1])-length(interactions)))
  # return a data frame
  as.data.frame(as.list(setNames(c(x[1], interactions), names(x))), stringsAsFactors=FALSE)
}

rowwise(xdf) %>% 
  do(de_dup(.)) %>% 
  ungroup()
## # A tibble: 3 x 10
##   person i1    i2    i3    i4    i5    i6    i7    i8    i9   
## * <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 P1     eat   sleep NA    NA    NA    NA    NA    NA    NA   
## 2 P2     wake  walk  eat   walk  jump  run   NA    NA    NA   
## 3 P3     wake  eat   walk  jump  run   sleep NA    NA    NA 

Requested Exposition

Since the dups are across columns, the most straightforward approach (not necessarily the fastest or least memory/CPU intensive) is to re-create the data frame row-by-row.

  • rowwise() is a tidyverse function that breaks the data frame into groups by rows
  • we then take each individual row (using do()) and pass that to a function we've created to make the code more readable and update-able (unlike confusing inline bracketed {} craziness with semicolons vs newlines). The . == the entire row
  • The x parameter in de_dup() will be a named list (read the docs on do)
  • We take that list and turn it into a vector with unlist()
  • We then pass that into the rle function but not the first element which is the person. This isn't completely necessary (the person will be unique) but it has mindful logic in that you know you're working on the interactions vs the person. Look at the output of rle(c("a", "a", "b", "c", "c", "c", "d)) to get an idea of what it does. It stands for run length encoding and it is purpose built for needs like yours
  • The return value of rle has a values element which has the de-duped elements without NAs.
  • Due to ^^ we have to re-pad the NAs again. Lots of ways to do this. I like this way.
  • We then have to return a data frame (check the docs on do() again) so we create a named character vector and turn it into a data frame
  • At the end of do() we still have a row-by-row grouped data frame so we then need to ungroup it
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • didn't see the "V" note in your edited question before finishing – hrbrmstr Nov 10 '18 at 11:56
  • Thanks, the solution is mostly working and it is amazing!! Just a little trouble, i think if there are more than 3 or more duplicate entries in a row; the code produces 2 instances in the output. An entry which had 'sleep sleep sleep NA NA NA NA NA NA ' was reduced to 'sleep sleep NA NA NA NA NA NA NA' And in another case, 'walk walk walk walk walk NA NA NA NA' came down to 'walk walk NA NA NA NA NA NA NA' – Sandy Nov 10 '18 at 12:12
  • Oh. So the "two consecutive" is paramount. i.e. it's not just "all consecutive"? – hrbrmstr Nov 10 '18 at 12:15
  • Thanks! Your code works fine except at the first occurrence at the start of a line. – Sandy Nov 10 '18 at 21:32
  • lemme poke at that. i thought that might be an edge case – hrbrmstr Nov 10 '18 at 22:15
1

Here's a simple way using base R. I have simply created a function that will replace consecutive duplicates with NA and rearrange the new row in desired order -

# function to check consecutive duplicates
ccd <- function(x) {
  # first value can never be duplicate so initiating to 0
  test <- c(0, sapply(1:(length(x)-1), function(i) anyDuplicated(x[i:(i+1)])))
  x[test > 0] <- NA_character_
  x[order(test)]
}

# Original df from dput
> df
  V1 V2 V3   V4   V5   V6   V7   V8   V9
1  S  C  R    S    C    R    S    C    R
2  C  C  C <NA> <NA> <NA> <NA> <NA> <NA>
3  R  R  R    R    R <NA> <NA> <NA> <NA>

for(r in 1:nrow(df)) {
  df[r, ] <- ccd(as.character(df[r, ]))
}

> df
  V1   V2   V3   V4   V5   V6   V7   V8   V9
1  S    C    R    S    C    R    S    C    R
2  C <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3  R <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

For the demo-ed example in post -

df <- read.csv(
text=gsub(" +", "", "P1, eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
P2, wake, walk,  eat,   walk,  walk, jump,  jump, run, run
                         P3, wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA"), 
               header = FALSE, stringsAsFactors = FALSE)[, -1]

> df
    V2    V3    V4    V5   V6    V7   V8   V9  V10
1  eat sleep sleep sleep <NA>  <NA> <NA> <NA> <NA>
2 wake  walk   eat  walk walk  jump jump  run  run
3 wake   eat  walk  jump  run sleep <NA> <NA> <NA>

for(r in 1:nrow(df)) {
  df[r, ] <- ccd(as.character(df[r, ]))
}

> df
    V2    V3   V4   V5   V6    V7   V8   V9  V10
1  eat sleep <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
2 wake  walk  eat walk jump   run <NA> <NA> <NA>
3 wake   eat walk jump  run sleep <NA> <NA> <NA>
Shree
  • 10,835
  • 1
  • 14
  • 36
1

A combination of dplyr, reshape2 and base R. First, it identifies the desired duplicates and replace them with NA. Then, it shifts the non-NA values to the left.

as.data.frame(t(apply(df %>%
          gather(var, val, -V1) %>% 
          group_by(V1) %>% 
          mutate(val2 = ifelse(val == lag(val), NA, val),
                 val2 = ifelse(var == "V2", paste(val), val2)) %>% 
          dcast(V1~var, value.var = "val2"), 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))

  V1   V2    V3   V4   V5   V6    V7   V8   V9  V10
1 P1  eat sleep <NA> <NA> <NA>  <NA> <NA> <NA> <NA>
2 P2 wake  walk  eat walk jump   run <NA> <NA> <NA>
3 P3 wake   eat walk jump  run sleep <NA> <NA> <NA>

Data (using the code from @Shree):

df <- read.csv(text = gsub(" +", "", "P1, eat,  sleep, sleep, sleep, NA,   NA,    NA,   NA,  NA
            P2, wake, walk,  eat,   walk,  walk, jump,  jump, run, run
            P3, wake, eat,   walk,  jump,  run,  sleep, NA,   NA,  NA"), 
               header = FALSE, stringsAsFactors = FALSE)
tmfmnk
  • 38,881
  • 4
  • 47
  • 67