1

I want to recodate my data, function of decisions rules

An example of rules :

Data with more than 3 variables years

First rule :we corrected data if only one error :
y ≤ y+2 and y+1 < y then y+1 = y

After the previous correction, corrected with the second rule :

  • More than 2 identical years, keep the most frequent
  • Equality of frequencies : keep the higher

Maybe with an example it's little bit more clear :

ID  y1  y2  y3   y4 y5
1   6   7   6   8   
2   6   7   7   6   8
3   6   7   8   7   8
4   6   7   8   6   7
6   3   4   5   6   3

the corrected data

ID  y1  y2  y3   y4 y5
1   6   7   7   8   
2   6   7   7   7   8
3   6   7   8   8   8
4   6   7   7   7   7
6   3   4   5   6   3

If you have any idea to corrected variable function of other variable, many thank's


If I have an ID with 8 years of data, line 4 doesn't work. Do you know why ? It's problem with a lot of NA? Before code :

ID  y1 y2  y3   y4  y5  y6  y7  y8
1   6   7   6   8   NA  NA  NA  NA
2   6   7   7   6   8   NA  NA  NA
3   6   7   8   7   8   NA  NA  NA
4   6   7   8   6   7   NA  NA  NA
5   3   4   5   6   3   NA  NA  NA
6   7   7   8   8   7   8   7   8

after code

   y1  y2  y3  y4   y5  y6  y7  y8
1   6   7   7   8   NA  NA  NA  NA
2   6   7   7   7   8   NA  NA  NA
3   6   7   8   8   8   NA  NA  NA
4   6   7   8   6   7   NA  NA  NA
5   3   4   5   6   3   NA  NA  NA
6   7   7   8   8   8   8   8   8

If you have a solution otherwise I will make a select according to the number of non empty fields

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Nic
  • 143
  • 5

1 Answers1

2

Updated Solution I made a slight modification so that it can be used for with observations containing NA values:

  • I used pmap_df function from purrr package that is used for row-wise operation on data frames as you can pass multiple arguments into it
  • c(...) captures all values of y in each row except the value of ID which I omitted by c(...)[-1]
  • For your y + 2 I omitted the first two values of every row since they cannot be y + 2 and since we are also checking y + 1 for every y the length of two expressions must be the same. So I only chose those y + 1 where there is a y + 2
  • With regard to other rules I created a vector called z which only requires to omit y1 from x and check if there are 3 unique values meaning 2 are the same then transform all others to that value
library(dplyr)
library(purrr)

df %>%
  pmap_df(~ {x <- c(...)[!is.na(c(...))][-1]
  y_2 <- x[-c(1, 2)]
  y_1 <- x[2:(length(y_2) + 1)]
  ids <- which((x[seq_along(y_2)] <= y_2) & (y_1 < x[seq_along(y_1)]))
  x[ids + 1] <- x[ids]
  x
  z <- x[-1]
  if(length(unique(z)) == 3 & sum(is.na(z)) == 0) {
    z[1:length(z)] <- z[duplicated(z)]
    c(x[1], z)
  } else {
    c(x[1], z)
  }})

# A tibble: 5 x 5
     y1    y2    y3    y4    y5
  <int> <int> <int> <int> <int>
1     6     7     7     8    NA
2     6     7     7     7     8
3     6     7     8     8     8
4     6     7     7     7     7
5     3     4     5     6     3

Second data sample

# A tibble: 6 x 8
     y1    y2    y3    y4    y5    y6    y7    y8
  <int> <int> <int> <int> <int> <int> <int> <int>
1     6     7     7     8    NA    NA    NA    NA
2     6     7     7     7     8    NA    NA    NA
3     6     7     8     8     8    NA    NA    NA
4     6     7     7     7     7    NA    NA    NA
5     3     4     5     6     3    NA    NA    NA
6     7     7     8     8     8     8     8     8
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • Thank you. It's possible to explain a little bit your code so I can complete it with my other rules. – Nic Aug 09 '21 at 11:57
  • Sure, some edits might be need for rule 2 and 3 but so far it gives your desired results. I will add some notes now. – Anoushiravan R Aug 09 '21 at 12:04
  • Check my updates please and let me know if I need to explain more. – Anoushiravan R Aug 09 '21 at 12:11
  • thank you for explanations. the line with ids is to be sur we have more than 3 variables y. And many thanks for the code and the quickness. – Nic Aug 09 '21 at 12:35