1

I have two dataframes, one that has gene names and their counts, and a second dataframe that has the gene names and their ontological names. I want to update the gene names from df1 with the names they associate to in df2.

Sample data:

df1 <- data.frame(ID=c("gene1","gene2","gene3"), sample1=c(1,0,50), sample2=c(0,0,0), sample3=c(45,56,11))
rownames(df1) <- df1$ID
df1$ID <- NULL

> df1
      sample1 sample2 sample3
gene1       1       0      45
gene2       0       0      56
gene4      50       0      11

df2 <- data.frame(ID=c("gene1","gene2","gene3", "gene4"), name=c("hr1","gene2","exoc like exoc1 in drosophila", "ftp"), desc=c("protein","unknown","fake immunity known for fighting viruses", "like ftp1"))
rownames(df2) <- df2$ID
df2$ID <- NULL

> df2
          name      desc
gene1     hr1       protein
gene2     gene2     unknown
gene3     exoc      like exoc1 in drosophila fake immunity known for fighting viruses
gene4     ftp       like ftp1

What I want is for df1 row names to update using the names in "name" in df2. df2 contains all the gene names and their ontological names in the first column; some of those genes are missing in df1.

Expected output:

> df1.new
      sample1 sample2 sample3
hr1       1       0      45
gene2     0       0      56
ftp      50       0      11

I'm not familiar with tidyverse to try and update names and the problem I am having is the way my dataframes are loaded, is I am trying to update index names. I've tried manipulating my dataframes using the only similar question I could find (R - replace specific values in df with values from other df by matching row names) but I am trying to update index row names.

I've tried variations of:

df1 <- df1[na.omit(match(rownames(df1), df2$name)),] # throws an error

library(dplyr)
library(tibble)
rownames_to_column(df1) %>% rows_update(df2 %>% rownames_to_column(df1), by ="rowname") %>% column_to_rownames(df1) # Error, Names repair functions must return a character vector

Having trouble because it's an index I want to match and update with a column in a second data frame.

3 Answers3

4

Another one (btw, your code does not match the dataframes):

> map = df2$name
> names(map) = rownames(df2)
> df1.new = df1
> rownames(df1.new) = map[rownames(df1)]
> df1.new
      sample1 sample2 sample3
hr1         1       0      45
gene2       0       0      56
exoc       50       0      11
777moneymaker
  • 697
  • 4
  • 15
  • 1
    This worked! Actually quite simple broken down like this. Thank you. The solution below also worked but did end up requiring a bit more cleaning of the final dataframe. – Katherine Chau Aug 10 '22 at 20:15
2

The code you have to create df1 and df2 does not match the df1 and df2 that you show, but here is a way to get the result column I think you want--you can then remove any columns you don't want.

library(dplyr)
library(tibble)
library(tidyr)
df1 %>%
  rownames_to_column(var = "gene") %>%
  left_join(
    df2 %>% rownames_to_column(var = "gene"),
    by = "gene"
  ) %>%
  mutate(result = ifelse(desc == "unknown", gene, desc))
#    gene sample1 sample2 sample3                          name                                     desc
# 1 gene1       1       0      45                           hr1                                  protein
# 2 gene2       0       0      56                Unknown origin                                  unknown
# 3 gene3      50       0      11 exoc like exoc1 in drosophila fake immunity known for fighting viruses
#                                     result
# 1                                  protein
# 2                                    gene2
# 3 fake immunity known for fighting viruses
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • While this does add the columns and the proper cells from df2 to df1, this isn't the final output I need. I need the row names to update. I guess from this I could push the "name" column into the index, and then delete the last column but I thought there would be a cleaner way to do this in one step. – Katherine Chau Aug 10 '22 at 20:07
  • dplyr/tidyverse mostly prefers not to use row names at all, so if you want to use tidyverse there's probably not much of a better way. In base R, you could probably use `match` and update them more directly. – Gregor Thomas Aug 10 '22 at 20:09
  • And oops yes I updated my code to make the proper dataframes. – Katherine Chau Aug 10 '22 at 20:11
  • Yes you're right - match worked well for this problem. Thank you for your help! – Katherine Chau Aug 10 '22 at 20:15
1

Here is a slightly modified version of @Gregor Thomas answer:

library(tibble)
library(dplyr)

left_join(df1 %>% 
            rownames_to_column("gene"), 
          df2 %>% 
            rownames_to_column("gene"), 
          by="gene") %>% 
  column_to_rownames("name") %>% 
  select(starts_with("sample"))
      sample1 sample2 sample3
hr1         1       0      45
gene2       0       0      56
ftp        50       0      11
TarJae
  • 72,363
  • 6
  • 19
  • 66