1

I have a big data frame with 4 columns and many rows (an example is attached).

#what I have
Arm <- c("5prime","3prime","5prime","CoMature","3prime","5prime","3prime","3prime")
Family <- c("LET-7","LET-7","LET-7","MIR-10","MIR-103","MIR-124","MIR-124","MIR-124")
Sequence <- c("ATCGGCA","ATGCTAC","ATCGGCA","ATCGTTT","TGAGGAG","TGATCAG","AATTCAG","AATTCAG")
Star_seq <- c("TTCAGGT","TATACTG","TTCAGGT","GAGATCA","CAAAAGC","CACATGC","AATATGC","AATATGC")
my_data_frame <- data.frame(Arm,Family,Sequence,Star_seq)

What I want to do is basically for each i in the Family column count the number of occurrences of '5prime', '3prime' or 'CoMature' in the Arm column. And then for the most frequent one ('5prime','3prime' or 'CoMature') take the third and fourth column. To sum up, I need to have a final file that shows the most frequent arm (in the first row) for each i in the Family column and their relative sequences in third and fourth columns.

#what I want as output
five_prime_counts <- c("2","0","0","1")
three_prime_counts <- c("1","0","1","2")
CoMature_counts <- c("0","1","0","0")
Arm_new <- c("5prime","CoMature","3prime","3prime")
Family_new <- c("LET-7","MIR-10","MIR-103","MIR-124")
Sequence_new <- c("ATCGGCA","ATCGTTT","TGAGGAG","AATTCAG")
Star_seq_new <- c("TTCAGGT","GAGATCA","CAAAAGC","AATATGC")
my_data_frame_new <- data.frame(five_prime_counts,three_prime_counts,CoMature_counts,Arm_new,Family_new,Sequence_new,Star_seq_new)
Apex
  • 1,055
  • 4
  • 22

1 Answers1

0

We can add a count variable for each Family and Arm, get the corresponding Sequence, Star_seq and Arm value for maximum count in each Family and get the data in wide format.

library(dplyr)

my_data_frame %>%
  add_count(Family, Arm) %>%
  group_by(Family) %>%
  mutate(Sequence = Sequence[which.max(n)], 
         Star_seq =  Star_seq[which.max(n)], 
         Arm_new = Arm[which.max(n)]) %>%
  distinct() %>%
  tidyr::pivot_wider(names_from = Arm, values_from = n, values_fill = list(n = 0))

#  Family  Sequence Star_seq Arm_new  `5prime` `3prime` CoMature
#  <fct>   <fct>    <fct>    <fct>       <int>    <int>    <int>
#1 LET-7   ATCGGCA  TTCAGGT  5prime          2        1        0
#2 MIR-10  ATCGTTT  GAGATCA  CoMature        0        0        1
#3 MIR-103 TGAGGAG  CAAAAGC  3prime          0        1        0
#4 MIR-124 AATTCAG  AATATGC  3prime          1        2        0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you for your answer, but there is this error: Error: 'pivot_wider' is not an exported object from 'namespace:tidyr' and I dont have count columns as well – Apex Jan 28 '20 at 12:50
  • @MortezaAslanzadeh I think you need to update `tidyr` package. Do `install.packages('tidyr')` Or use `spread` instead of `pivot_wider` if you have an older version. – Ronak Shah Jan 28 '20 at 12:51
  • I spread instead of pivote_wider but this error poped up and again there is no columns for counts .......Error in tidyr::spread(., names_from = Arm, values_from = n, values_fill = list(n = 0)) : unused arguments (names_from = Arm, values_from = n, values_fill = list(n = 0)) > – Apex Jan 28 '20 at 12:57
  • and this is the error when I install #tidyr......xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun ERROR: compilation failed for package ‘tidyr’ * removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/tidyr’ * restoring previous ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/tidyr’ Warning in install.packages : installation of package ‘tidyr’ had non-zero exit status – Apex Jan 28 '20 at 13:00
  • @MortezaAslanzadeh Use `my_data_frame %>% add_count(Family, Arm) %>% group_by(Family) %>% mutate(Sequence = Sequence[which.max(n)], Star_seq = Star_seq[which.max(n)], Arm_new = Arm[which.max(n)]) %>% distinct() %>% tidyr::spread(Arm, n, fill = 0)` – Ronak Shah Jan 28 '20 at 13:03
  • maybe this will be usefull for someone else. the error was because I'm using mac and I typed this in terminal .....xcode-select --install..... and after installing the software I installed tidyr successfully – Apex Jan 28 '20 at 13:22
  • Yes but I have the counts in the console but when I do View(my_data_frame) it has four columns without counts. – Apex Jan 28 '20 at 13:29
  • Please assign the results `my_data_frame_new <- my_data_frame %>% add_count(Family, Arm) %>% ...rest of the code....` and then check `my_data_frame_new`. – Ronak Shah Jan 28 '20 at 13:31
  • It solved. Thank you so much you have five stars from me :) – Apex Jan 28 '20 at 13:35
  • dear Ronak can I have my new_data_frame in this format? – Apex Jan 28 '20 at 13:52
  • five_prime_counts <- c("2","2","0","0","1","1") three_prime_counts <- c("1","1","0","1","2","2") CoMature_counts <- c("0","0","1","0","0","0") Arm_new <- c("5prime","5prime","CoMature","3prime","3prime","3prime") Family_new <- c("LET-7","LET-7","MIR-10","MIR-103","MIR-124","MIR-124") Sequence_new <- c("ATCGGCA","ATCGGCA","ATCGTTT","TGAGGAG","AATTCAG","AATTCAG") Star_seq_new <- c("TTCAGGT","TTCAGGT","GAGATCA","CAAAAGC","AATATGC","AATATGC") my_data_frame_new <- data.frame(five_prime_counts,three_prime_counts,CoMature_counts,Arm_new,Family_new,Sequence_new,Star_seq_new) – Apex Jan 28 '20 at 13:52
  • It is basically same analyses but just keeping every row that has a high count in the Arm column for every i in the Family. Because there are nucleotide variety so I dont want to loose them by keepin just a single row for every i in the Family column – Apex Jan 28 '20 at 13:54
  • Sorry, the code is not clear in the comments. Would you mind asking a new question with appropriate code and data? – Ronak Shah Jan 28 '20 at 14:45