Manipulation of tab delimited file for top 5 values based and printing adjacent columns values with it

Question

I have a tab delimited file abc.txt

contig  score   guide
1:100-101   7   AAA
1:100-101   6   BBB
1:100-101   5   CCC
1:100-101   4   DDD
1:100-101   3   EEE
1:100-101   2   FFF
1:100-101   1   GGG
1:100-101   90  HHH
1:100-101   111 III
1:100-101   1111    JJJ
1:200-203   503.5333333 KKK
1:200-203   570.7212121 LLL
1:200-203   637.9090909 MMM
1:200-203   705.0969697 NNN
1:200-203   772.2848485 OOO
1:200-203   839.4727273 PPP
1:200-203   906.6606061 QQQ
1:200-203   973.8484848 RRR
2:300-301   1041.036364 SSS
2:300-301   1108.224242 TTT
2:300-301   1175.412121 UUU
2:300-301   1242.6  VVV
2:300-301   1309.787879 ABC
2:300-301   1376.975758 CGA
2:300-301   1444.163636 ACD

Column 1-Contig has multiple repeat values, column two has scores and column three has guide letters corresponding to column-2 scores. I need to select top 5 scores for the similar values in column one (contig) and print there corresponding column 3 values.

Output should look like this, with first column having the unique column 1-Contig entry and next 10 rows for the top 5 scores and corresponding column-3 guide letters

    Score-1 Guide-1 Score-2 Guide-2 Score-3 Guide-3 Score-4 Guide-4 Score-5 Guide-5
1:100-101   1111    JJJ 111 III 90  HHH 7   AAA 6   BBB
1:200-203   973.8484848 RRR 906.6606061 QQQ 839.4727273 PPP 772.2848485 OOO 705.0969697 NNN
2:300-301   1444.163636 ACD 1376.975758 CGA 1309.787879 ABC 1242.6  VVV 1175.412121 UUU

I used "dplyr" and "desctools" packages, however I am running with some error.

library(dplyr)
library(DescTools)
file <- "abc.txt"
x=read.table(file)
b <- Large(x, k=5, unique = FALSE, na.last=NA)

and getting this error

Error in Large(x, k = 5, unique = FALSE, na.last = NA) : 
  Not compatible with requested type: [type=character; target=double].

I was manged to do this in excel using 'sumproduct, large, iferror and vllokup' formulas, however for large datasets I want to extract file using R.

Any help will be much appreciated

Philippe Poirier · Accepted Answer · 2019-06-21T20:14:46.953

1

The problem is large expects a numeric vector, not an entire dataframe. This is just a guess since I dont have a reproducible example, but you might want to do something along these lines:

library(dplyr)
library(DescTools)
file <- "./abc.txt"
x=read.table(file)
colnames(x)<-c("contig","score","guide")
x<-x[-1,]

list <- split(x , f = x$contig )
columntitles<-c()
for (i in 1:5)
  columntitles<-c(columntitles,paste0("guide-",i),paste0("score-",i))
x = data.frame(matrix(NA, nrow = 1, ncol = 10)) 
colnames(x)<-columntitles

for (i in 1:3){
  singlerow<-c()
  partialdata<-list[[i]]
  partialdata<-partialdata%>% top_n(5, score)
  partialdata<-partialdata[Rev(order(partialdata$score)),]
  for (j in 1:5){
    singlerow<-c(singlerow,toString(partialdata$guide[j]),toString(partialdata$score[j]))

  }
  x<-rbind(x,singlerow)
}
x<-x[-1,]

edited Jun 21 '19 at 20:14

answered Jun 21 '19 at 17:07

Philippe Poirier

56
3

I am having difficulty in getting dataframe wherein I have to account for similar contigs in column 1, as for each unique contig value i need those top 5 values. – RonicK Jun 21 '19 at 18:42
In that case you want to first split them in a list according to contig, `list <- split( x , f = x$contig )` Then, you want to apply `x<-x %>% top_n(5, score)` to each element in the list with a loop. Finally, you want to merge them all together (cbind or rbind). – Philippe Poirier Jun 21 '19 at 18:54
If I loop it like `for (i in list) { x<-x %>% top_n(5, score)}` , getting following error `Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "NULL"` – RonicK Jun 21 '19 at 20:03
I've edited my original answer to provide you with a full script of what I meant. It's probably not the most line-efficient way of doing this, but it should be fairly simple to understand. You might need to change a few things, but it gives the exact output you need on my side. – Philippe Poirier Jun 21 '19 at 20:16
Thanks! However it doesn't display the correponding contig in the output file. – RonicK Jun 21 '19 at 21:37

Andri Signorell · Answer 2 · 2019-06-23T12:24:54.367

There's no need to search for packages that solve this problem - base R has everything you need.

First we have to deal with the groupwise operation, so we split the data frame according to the groupvariable contig by means of the function split(). We get a list of data frames. In order to apply our operations to each of these we can use lapply(), which applies the given function to all the list elements. We build a function which selects the first 5 rows of the provided data frame x, ordered by score (descending with -x$score). We assign the result to z. The following do.call(data.frame, split…) structure only reshapes the result to the intended form.

The only thing that remains is to pack the list to a data frame and set the columnnames.

ll <- lapply(split(abc, abc$contig), function(x) {
      z <- x[order(-x$score)[1:5], 2:3]
      do.call(data.frame, split(z, 1:5))
    })

ll         # we're almost there …

$`1:100-101`
  X1.score X1.guide X2.score X2.guide X3.score X3.guide X4.score X4.guide X5.score X5.guide
10     1111      JJJ      111      III       90      HHH        7      AAA        6      BBB

$`1:200-203`
   X1.score X1.guide X2.score X2.guide X3.score X3.guide X4.score X4.guide X5.score X5.guide
    18 973.8485      RRR 906.6606      QQQ 839.4727      PPP 772.2848      OOO  705.097      NNN

$`2:300-301`
   X1.score X1.guide X2.score X2.guide X3.score X3.guide X4.score X4.guide X5.score X5.guide
25 1444.164      ACD 1376.976      CGA 1309.788      ABC   1242.6      VVV 1175.412      UUU

# … only pack the list to a data frame and set the columnnames
d.frm <- (do.call(rbind, ll))
colnames(d.frm) <- c(paste("Guide", 1:5), paste("Score", 1:5))[as.vector(t(matrix(1:10, nrow=5)))]

d.frm

            Guide 1 Score 1   Guide 2 Score 2   Guide 3 Score 3   Guide 4 Score 4  Guide 5 Score 5
1:100-101 1111.0000     JJJ  111.0000     III   90.0000     HHH    7.0000     AAA    6.000     BBB
1:200-203  973.8485     RRR  906.6606     QQQ  839.4727     PPP  772.2848     OOO  705.097     NNN
2:300-301 1444.1636     ACD 1376.9758     CGA 1309.7879     ABC 1242.6000     VVV 1175.412     UUU

Manipulation of tab delimited file for top 5 values based and printing adjacent columns values with it

2 Answers2