2

I have two data frames in R:

df1

    Site_code  Species_code  Abundance
    1           MRN          50
    1           TFP          100
    2           MRN          5
    2           XNP          20
    2           AMP          15

and in df2 I have a bunch of information, as well as the site code, but there is only one row for each site. The other columns are not of interest.

    Site_code   latitude   mean_temp ...etc
    1               55          15
    2               56          10 

I want to create a new dataframe (or matrix) with one row for each site in df2, based on the information in df1, with one column for each species, and with the column name the same as the species code, and the information on the abundance of each species in the column, with a 0 value if the species was not recorded at the site, so that I would get a df like:

    Site_code   AMP  MRN    TFP   XNP
    1           0     50     100  0  
    2           15    5      0    20

I also want to do this several times, since I have many df2's, and I want to make up a new dataframe for each of them.

I've read through many questions on SO and have not come across one that answers this. If it has already been answered, I would really appreciate being pointed in the right direction.

Frieda
  • 65
  • 6

2 Answers2

3

R has several functions for reshaping data from long format to wide, including reshape in the pre-installed stats package and dcast in the reshape2 package. In my view, spread from tidyr package has the most intuitive syntax:

library(tidyr)
spread(df1, Species_code, Abundance, fill = 0)

data

df1 <- read.table(text = 
"Site_code  Species_code  Abundance
    1           MRN          50
    1           TFP          100
    2           MRN          5
    2           XNP          20
    2           AMP          15",
header = TRUE)
davechilders
  • 8,693
  • 2
  • 18
  • 18
1

You can use the dcast function from reshape2:

library(reshape2)
df2 <- dcast(df1, Site_code ~ Species_code, fill = 0)

df2
# Site_code AMP MRN TFP XNP
#         1   0  50 100   0
#         2  15   5   0  20

Short and simple.

You could also use reshape from the stats package, which requires no external libraries.

# Transpose the data frame by site
df2 <- reshape(df1,
               idvar = "Site_code",
               timevar = "Species_code",
               direction = "wide")

# Reset NA values to 0
df2[is.na(df2)] <- 0

# Remove "Abundance." from the column names
colnames(df2) <- gsub("Abundance.", "", colnames(df2))

df2
# Site_code MRN TFP XNP AMP
#         1  50 100   0   0
#         2   5   0  20  15

The reshape function fills values not present in the original, untransposed dataset with NA, so these must be manually reset to 0.

It also attaches the name of the transposed variable to the new column names, but that can be removed using gsub.

Alex A.
  • 5,466
  • 4
  • 26
  • 56
  • After accepting the answer from DMC, I noticed on closer inspection of my df that was produced that the site_code had multiple entries for the same number. The code using reshape2 above from Alex A. worked fine. – Frieda May 05 '15 at 20:46
  • To finish off the project, as I said I had many files to make based on sub-sets of sites, I made the following loop using df2 as described above, and several other df's: `for (Species in CODES){ Tab<-read.table(file=c(paste0("F:/Data/",Species,".txt")),sep=",",header=TRUE) set<-df2[df2$Site_code %in% intersect (df2$Site_code,Tab$Site_code),] write.table(set,file=c(paste0("F:/Data/ABUN/",Sp_name,".txt")), sep=",", col.names=TRUE, append=FALSE) }` – Frieda May 05 '15 at 21:17