6

I have a dataframe that looks like this.

input dataframe

position,mean_freq,reference,alternative,sample_id
1,0.002,A,C,name1
2,0.04,G,T,name1
3,0.03,A,C,name2

These data are nucleotide differences at a given position in a hypothetical genome, mean_freq is relative to the reference, so the first row means the proportion of C's are 0.002 implying the A are at 0.998.

I want to transform this to a different structure by creating new columns such that,

desired_output

position,G,C,T,A,sampleid
1,0,0.002,0,0.998,name1
2, 0.96,0,0.04,0,name
3,0,0.93,0,0.07,name2

I have attempted this approach

per_position_full_nt_freq <- function(x){
  df <- data.frame(A=0, C=0, G=0, T=0)
  idx <- names(df) %in% x$alternative
  df[,idx] <- x$mean_freq
  idx2 <- names(df) %in% x$reference 
  df[,idx2] <- 1 - x$mean_freq
  df$position <- x$position
  df$sampleName <- x$sampleName
  return(df)
}

desired_output_dataframe <- per_position_full_nt_freq(input_dataframe)

I ran into an error

In matrix(value, n, p) :
  data length [8905] is not a sub-multiple or multiple of the number of columns 

additionally, I feel there has to be a more intuitive solution and presumably using tidyr or dplyr. How do I conveniently transform the input dataframe to the desired output dataframe format?

Thank you.

eastafri
  • 2,186
  • 2
  • 23
  • 34

2 Answers2

4

One option would be to create a matrix of 0's with the 'G', 'C', 'T', 'A' column names, match with the column names of the original dataset, use the row/column index to assign the values and then cbind with the original dataset's 'position' and 'sample_id', columns

m1 <- matrix(0, ncol=4, nrow=nrow(df1), dimnames = list(NULL, c("G", "C", "T", "A")))
m1[cbind(seq_len(nrow(df1)), match(df1$alternative, colnames(m1)))]  <-  df1$mean_freq
m1[cbind(seq_len(nrow(df1)), match(df1$reference, colnames(m1)))]  <-  0.1 - df1$mean_freq
cbind(df1['position'], m1, df1['sample_id'])
#   position    G     C    T     A sample_id
#1        1 0.00 0.002 0.00 0.098     name1
#2        2 0.06 0.000 0.04 0.000     name1
#3        3 0.00 0.030 0.00 0.070     name2
akrun
  • 874,273
  • 37
  • 540
  • 662
0

The following should do the trick:

library(readr)
library(dplyr)
library(tidyr)

input_df <- read_csv(
  'position,mean_freq,reference,alternative,sample_id
  1,0.002,A,C,name1
  2,0.04,G,T,name1
  3,0.03,A,C,name2'
)

input_df %>%
  mutate( ref_val = 0.1 -mean_freq) %>%
  spread(alternative, mean_freq, fill=0) %>%
  spread(reference, ref_val, fill=0) %>%
  select( position, G, C, T, A, sample_id )

One assumption you have here is that the alternative and reference are distinct, otherwise you will get two columns with the same name, but different values. You need to handle for that with a couple of command at the beginning of your code if need be.

Plamen Petrov
  • 317
  • 1
  • 5
  • reference and alternative cannot be distinct since you have multiple samples_ids, positions. That would an erroneous assumption. The only distinct values are the combination of sample_id and position since we can constrain one position per sample. – eastafri Nov 11 '17 at 09:06
  • I meant that the alternative and reference need to be distinct values row-wise, i.e no rows like 4, 0.03, A, A, name15 This will return two A columns, so you need to work out which was which yourself. If on each row alternative and reference are different, the above should work. – Plamen Petrov Nov 11 '17 at 09:27
  • OK and many thanks. I seem to get an exception "Error: Data source must be a dictionary", when I provide the actual data. – eastafri Nov 11 '17 at 09:52
  • Well, this might be due to a number of factors. But, I would venture to say it is probably caused by the select statement at the end, which just rearranges the columns in the order you wanted. Try either `T` - adding tick qoutes to T - since it is a preserved symbol for True or skip the select statement and arrange with a native R statement. – Plamen Petrov Nov 11 '17 at 10:19