0

I have one data set that includes a list of all the metabolite IDs from Kegg, and a data set with metabolite IDs that I have discovered from my samples. The goal is to use the metabolite IDs that I've found to select the IDs from the Kegg date frame, and only the IDs that I've found.

This may seem trivial, but my data does not include the actual molecule names and just the IDs, while the Kegg data includes the molecule names. I need the molecule names to do further research, and figuring this out would save me hours of time. I've tried to use the filter and mutate commands. You can see my code below. I am pretty new to r, so maybe this code will work and I've just botched it somewhere.

We would have two data frames like this:

kegg_data <-  data.frame("ID" = c("C00001" , "C00002" , "C00003"  , "C00004"), 
                        "molecule" = c("H20" , "ATP" , "NAD" , "NADH"))

my_data <- data.frame("ID" = c("C00002", "C00004"))                         

Obviously, there would be many more IDs in both data sets.

Here is the code I have tried:

your_kegg_IDs <-  kegg_data %>%
  filter(my_data == my_data$ID)

The error code when running the filter command is : Error in filter_impl(.data, quo) : Evaluation error: level sets of factors are different.

Honestly, I do not know if I am on the right track here. Any help is appreciated. The perfect result would be ending with a data frame that only has the IDs I've found, including their molecule name.

  • 1
    It looks like your string variables are factors, where you probably want them as characters. How are you importing the data? There is proabably an argument that you can set like this: `stringsAsFactors = F`. This might fix your error. I would do this with a `join` operation from `dplyr`. – Paul Jul 03 '19 at 04:56
  • What is your expected output? Do you need `kegg_data[kegg_data$ID %in% my_data$ID , ]` ? – Ronak Shah Jul 03 '19 at 04:57
  • 1
    Please, be consistent. The variable you provided are named `kegg_data` and `my_data`, then in your attempt we see `kegg_compound` and `matched_compounds`, with also column names that are absent in the data. – nicola Jul 03 '19 at 04:57

1 Answers1

1

not sure I understand, but why can't you just subset the kegg_data whose id are in your data?

my_final_data <- subset(kegg_data, kegg_data$ID %in% my_data$ID)

my_final_data
      ID molecule
2 C00002      ATP
4 C00004     NADH
FGirosi
  • 106
  • 5