3

Edit: I changed Var4 to a string value as my question was not precise enough about my data and therefore answers were failing because of invalid types. Sorry for that

this is my first question here and I hope someone can help me.

I have the following data set:

ID Date N_Date Var1 Var2 Var3 Var4 type
1 4.7.22 50000 12 NA NA NA normal
1 4.7.22 50000 NA 23 NA NA normal
1 4.7.22 50000 NA NA 5 NA normal
1 4.7.22 50000 NA NA NA asd normal
2 4.7.22 50000 NA 2 NA NA normal
3 5.7.22 20000 7 NA NA NA normal

My goal is to have just one row for each ID. So what I want R to do, is to shift the Var column values for each ID up or somehow combine them. As you can see, at the moment, there is never more than one value in a Var column for each row. So it should be easy to rewrite the NAs with the corresponding "real value". I also found similiar questions but the answer did not help in my case:

How to combine rows with the same identifier R?

I think the problem in my case is, that I have columns like "date", "N_date" (which is the number of observations on that date) and "type". In these cases my code should see, that it is exactly the same value for the corresponding ID, and just take the first value for example.

So that in the end I just have 3 rows with same number of columns, containing all information.

Thank you very much for anyone who has an idea how to solve this.

YodagamaHeshan
  • 4,996
  • 2
  • 26
  • 36
Aisberg
  • 35
  • 1
  • 5
  • Welcome to SO. If you are *certain* that you have only one non-missing value for each ID in each column, summarising is probably your simplest option, as indicated in the post you have already found. Please indicate why that solution did not work for you. I believe it should, based on the information you have provided. Note that `summarise_all` has been superseded by `summarise(across(...))` in the latest versions of dplyr. I'm voting to close as a duplicate, but will retract if you explain why it isn't. – Limey Jul 04 '22 at 07:44
  • If you try this `df[is.na(df)] <- 0 df %>% group_by(ID) %>% summarise_if(is.numeric, sum, na.rm = TRUE) %>% distinct()`, and you will get the result expected. – Isaac Jul 04 '22 at 07:52
  • I am not sure but using `na.omit` as provided in the as duplicate classified answer will not solve the question by the OP (Their themselves indicate the answer on that site). If this is true (and again I am not sure), then this is not a duplicate! – TarJae Jul 04 '22 at 08:04
  • Hello Limey, when I try to run the code from the solution of the other question, than it does not work for me. I get the error "Error in summarise(): Problem while recycling "type = (function (object, ...) ....". Type must be size 0 or 1, not 2. An earlier column had size 0" – Aisberg Jul 04 '22 at 08:34
  • @IsaacBravo thank you for your reply. Unfortunately if I want to perform df<- 0 I get an error message saying: "Assigned data must be compatible with existing data. Cant convert double to character." The error occured for a var column that is not numeric but string. Sorry this is my fault as I have not made it clear in the question. Do you know a work around? – Aisberg Jul 04 '22 at 08:41
  • @Aisberg you need to do this `df[is.na(df)] <- 0` and not `df <- 0`, otherwise you will have problems with the code. – Isaac Jul 04 '22 at 08:59
  • sorry for the misunderstanding. I did df[is.na(df)] <- 0 and not df <- 0. Still the same error message (i believe because var4 is not a numeric but contains string). What I also checked now is to remove all string columns just to test if the code (df[is.na(df)] <- 0 df %>% group_by(ID) %>% summarise_if(is.numeric, sum, na.rm = TRUE) %>% distinct()) would work. I dont get an error message anymore but besides changing NAs to 0, it has not reduced the number of columns in my dataset – Aisberg Jul 04 '22 at 09:41

1 Answers1

3

Something like this: Here we first group for all except the Var variables, then we use summarise(across... as suggested by @Limey in the comments section. Main feature is to use na.rm=TRUE:

library(dplyr)

df %>% 
  group_by(ID, Date, N_Date, type) %>% 
  summarise(across(starts_with("Var"), ~sum(., na.rm = TRUE)))
     ID Date   N_Date type    Var1  Var2  Var3  Var4
  <int> <chr>   <int> <chr>  <int> <int> <int> <int>
1     1 4.7.22  50000 normal    12    23     5    54
2     2 4.7.22   4000 normal     0     2     0     0
3     3 5.7.22  20000 normal     7     0     0     0
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Thank you very much TarJae. Unfortunately my data is not labeled like "var1" "var2" etc. but has very different names. In the "help" I found the command "contains" and tried to use your code with "contains" rather than "starts_with". But than I get an error at the first observation and code stops. Again, it is because of the character type of one of my Var which is not numeric because error says: "error in sum(), invalid type .... – Aisberg Jul 04 '22 at 08:59
  • Also here, I just deleted the string Var column. Than your code runs perfectly with "contains" rather than "starts_with". For the moment this solves my issue, as the string column is less important for me and I dropped it. Still, if you also know a way how it would work with string columns, i would be very thankfull. – Aisberg Jul 04 '22 at 10:03
  • 1
    Instead of `summarise(across(starts_with("Var"), ~sum(., na.rm = TRUE)))` you could use `summarise(across(c("colx", "coly", "colz"), ~sum(., na.rm = TRUE)))` – TarJae Jul 04 '22 at 10:47