0

I have a dataframe like this:

Family Genus Species
Gemmatimonadaceae Roseisolibacter Roseisolibacter_agri
Bacillaceae Bacillus NA
Blastocatellaceae NA NA

And I would like to modify it as follow:

Family Genus Species
Gemmatimonadaceae Roseisolibacter Roseisolibacter_agri
Bacillaceae Bacillus Unclassified Bacillus
Blastocatellaceae Unclassified Blastocallaceae Unclassified Blastocallaceae

I was trying to do this:

 replace_na(
  list(Genus = paste("Unclassified",  Family),
       Species = paste("Unclassified",  Genus)))

or using

 replace_na(
  list(Genus = paste("Unclassified",  vars(Family)),
       Species = paste("Unclassified",  vars(Genus))))

But in both cases I end up with "Unclassified Genus" or "Unclassified ~Genus".

How can I make it inherit from the previous known variable?

I thought also of using fill() but it works for tidy data only. Naturally I could transpose the data.frame but there must be a more elegant/simple solution!

Rob
  • 13
  • 3
  • "I could transpose the data.frame but there must be a more elegant/simple solution!": some would say that the tidy solution *is* the elegant solution! ;=) But here, because you're using data from the same row, I think your current format is the one to use. – Limey Jun 24 '21 at 11:59
  • I would like to have it tidy, but the main package I am using for the analyses is phyloseq and this dataframe is imported row-wise :| – Rob Jun 24 '21 at 12:09
  • taxonomyTable-class {phyloseq} R Documentation An S4 class that holds taxonomic classification data as a character matrix. Description Row indices represent taxa, columns represent taxonomic classifiers. – Rob Jun 24 '21 at 12:09
  • Ah, how often we are constrained by the shortsightedness of others! – Limey Jun 24 '21 at 12:15
  • I'm sure this is duplicate of a recently asked question – AnilGoyal Jun 24 '21 at 12:25

2 Answers2

1

How about mutate & coalesce?

library(dplyr, warn.conflicts = FALSE)

df = data.frame(
    Family = c('Gemmatimonadaceae', 'Bacillaceae', 'Blastocatellaceae'),
    Genus = c('Roseisolibacter', 'Bacillus', NA),
    Species = c('Roseisolibacter_agri', NA, NA))

df %>%  
    mutate(Genus = coalesce(Genus, paste('Unclassified', Family)),
           Species = coalesce(Species, 
                              if_else(grepl('^Unclassified', Genus),
                                      Genus, paste('Unclassified', Genus))))
#>              Family                          Genus
#> 1 Gemmatimonadaceae                Roseisolibacter
#> 2       Bacillaceae                       Bacillus
#> 3 Blastocatellaceae Unclassified Blastocatellaceae
#>                          Species
#> 1           Roseisolibacter_agri
#> 2          Unclassified Bacillus
#> 3 Unclassified Blastocatellaceae

Created on 2021-06-24 by the reprex package (v2.0.0)

Ian Gow
  • 3,098
  • 1
  • 25
  • 31
BlacKnight
  • 38
  • 4
  • I had forgotten about `coalesce`. Good answer. – Limey Jun 24 '21 at 12:47
  • You want `dplyr`, not `tidyr`. As written, your code doubled up "Unclassified". I made edits to your answer. (I also made it easier to read the data on screen.) – Ian Gow Jun 24 '21 at 12:50
0

I think replace_na() isn't going to give you what you want because of the logic you need to apply to rows with missing Species: exactly how they are unclassified depends on whether or not Genus is missing.

This seems to give you what you want:

df %>% 
  mutate(
    MissingGenus=is.na(Genus),
    Genus=ifelse(MissingGenus, paste(Family, "Unclassified"), Genus),
    Species=ifelse(
             is.na(Species), 
             ifelse(
               MissingGenus, 
               paste(Family, "Unclassified"), 
               paste(Genus, "Unclassified")
             ), 
             Species)
  ) %>% 
  select(-MissingGenus)
# A tibble: 3 x 3
  Family            Genus                          Species                       
  <chr>             <chr>                          <chr>                         
1 Gemmatimonadaceae Roseisolibacter                Roseisolibacter_agri          
2 Bacillaceae       Bacillus                       Bacillus Unclassified         
3 Blastocatellaceae Blastocatellaceae Unclassified Blastocatellaceae Unclassified

A small request: next time, please dput() your example data rather than posting images or tables. It makes testing a solution much easier.

Limey
  • 10,234
  • 2
  • 12
  • 32