0

I have a dataframe (df1) that includes abundances of different species in each sample:

> SampleID   Sp1   Sp2   Sp3   Sp4   ... Spn
> asb-001      3     0     0    23         9
> asb-002      4    15    10    56        98
> asb-003      8    45     8   453         0
> asb-004      0     5     0     3         6
> asb-005    120    56     0     0         0
...

Each column represents a different species.

I have another dataframe (df2)

Sp     Fam
Sp1   Fam1
Sp2   Fam2
Sp3   Fam1
Sp4   Fam3
Sp5   Fam2
Sp6   Fam1
...

There are fewer Family names than there are Species names.

I would like to add columns together depending on the species name to get a total for the family (e.g. Sp1 + Sp3 + Sp6 = Fam1). I don't want to keep the original column with the species name.

If all goes according to plan, my new dataframe (df3) will look something like this:

> SampleID  Fam1  Fam2  Fam3 
> asb-001     12     0     9 
> asb-002     14    18   112 
> asb-003     28    58    18
> asb-004     10    12    10
> asb-005    142    65     0
...

I could manually go through and add individual species together, but this seems tedious for a large dataset, and likely to produce errors. I have the feeling that I should melt df1, but I'm not confident in the details. Any advice would be appreciated!

ayesha
  • 135
  • 15
  • You probably want to `gather` df1, and then do a left join on `Sp`. Start with something like `tidyr::gather(df1, key = 'Sp', value = 'value', 2:n) %>% dplyr::left_join(df1, df2, by = 'Sp')`. – C. Braun Feb 12 '18 at 18:20

2 Answers2

4

Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by. Finally, spread the data back out to be in wide format.

library( tidyverse )

df1 %>% gather( Sp, Value, -SampleID ) %>%     # Convert to long format
  inner_join( df2 ) %>%                        # Combine with family mapping
  group_by( SampleID, Fam ) %>%                # Work on each sample/family pair
  summarize( ValSum = sum(Value) ) %>%         # Compute the sum across species
  ungroup %>% spread( Fam, ValSum, fill=0 )    # Convert back to wide format

Depending on whether each sample is represented in each family, you may get NA after converting back to wide format. The optional parameter fill=0 takes care of converting these NA to 0.

Artem Sokolov
  • 13,196
  • 4
  • 43
  • 74
0

Another way, using library(data.table):

setDT(df1); setDT(df2) # Convert data.frames into data.tables

x = df2[melt(df1, variable.name = 'Sp'), on = 'Sp'] # Join melted df1 to df2 on species
df3 = dcast(x, SampleID~Fam, fun.aggregate = sum) # cast to wide format by summing total values per family
dww
  • 30,425
  • 5
  • 68
  • 111