% of matching between rows and find duplications based on specific column

Question

Hi Friends please help me !

I have large data set with duplications of the first and third cloumns.

test<-matrix(c("Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_D","Line_D","Line_D","Line_D","Line_E","Line_E","Line_E","Line_E","F5","F5","F5","F5","F6","F6","F6","F6","F5","F5","F5","F5","F6","F6","F6","F6","F7","F7","F7","F7","F5","F5","F5","F5","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","A/A","T/T","T/T","C/C","A/A","T/T","A/T","C/C","A/A","T/T","T/T","C/C","-/-","T/T","A/T","C/C","A/A","C/T","A/A","G/G","A/A","T/T","T/T","C/C","A/A","C/T","T/A","C/C","A/A","T/T","T/T","C/C","A/A","C/C","A/A","G/G"),nrow=36)

colnames(test)<-c("Line","Year","Marker","data")

I converted into a dataframe

test1<-data.frame(test)

There are duplicates in the dataset. For example row 1 and row 5, row 2 and row 6 etc..which has same data in cloumn 1 (Line) and column 3 (Marker).

Line         Year      Marker      data
Line_A         F5          M1       A/A
Line_A         F5          M2       T/T
Line_A         F5          M3       T/T
Line_A         F5          M4       C/C
Line_A         F6          M1       A/A
Line_A         F6          M2       T/T
Line_A         F6          M3       A/T
Line_A         F6          M4       C/C
Line_B         F5          M1       A/A
Line_B         F5          M2       T/T
Line_B         F5          M3       T/T
.
.
.

I would like to have a table as shown below to list the lines which are duplicated followed by non-duplicated. I would like to get the % of match between the duplicates between years by comparing the marker data. Such as

Line            Year                 Duplication        Matching_Marker %  
Line A          F5 / F6                  Yes                  75                  
Line B          F5 / F6                  Yes                  75                   
Line B          F5 / F7                  Yes                  50                   
Line B          F6 / F7                  Yes                  25                   
Line C          F5 / F6                  Yes                  50                   
Line D          F6                         No                 NA                  
Line D          F6                         No                 NA

Thanks in advance !

The best way of showing result should be

Line    Year        Duplication   MatchCount Mismatchcount     Matching_Marker %  Het%Year1   Het%Year2   Missing%Year1    Missing%Year2
Line A   F5 / F6      Yes            3           1                75               0            25            0                0     
Line B   F5 / F6      Yes            2           1                75               0            33.3          0                25                                               
Line B   F5 / F7      Yes            1           3                50               0            25             0                0                      
Line B   F6 / F7      Yes            0           3                0               25            25             25               0                       
Line C   F5 / F6      Yes            2           2                50               0            50             0                0                     
Line D   F6            No            NA           NA              NA               0            NA             0                NA                             
Line D   F6            No            NA           NA              NA               0           NA              0                NA

second example data set is

test<-matrix(c("Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_D","Line_D","Line_D","Line_D","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","F5","F5","F5","F5","F6","F6","F6","F6","F5","F5","F5","F5","F6","F6","F6","F6","F7","F7","F7","F7","F5","F5","F5","F5","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","A/A","T/T","T/T","C/C","A/A","T/T","A/T","C/C","A/A","T/T","T/T","C/C","-/-","T/T","A/T","C/C","A/A","C/T","A/A","G/G","A/A","T/T","T/T","C/C","A/A","C/T","T/A","C/C","A/A","T/T","T/T","C/C","A/A","C/C","A/A","G/G","A/A","C/C","A/A","G/G"),nrow=40)
colnames(test)<-c("Line","Year","Marker","data")
test1<-data.frame(test)

Third example data set where the one of the year data for a line is missing

test<-matrix(c("Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_A","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_B","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_C","Line_D","Line_D","Line_D","Line_D","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","Line_E","F5","F5","F5","F5","F6","F6","F6","F6","F5","F5","F5","F5","F6","F6","F6","F6","F7","F7","F7","F7","F5","F5","F5","F5","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","F6","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","M1","M2","M3","M4","-/-","-/-","-/-","-/-","A/A","T/T","A/T","C/C","A/A","T/T","T/T","C/C","-/-","T/T","A/T","C/C","A/A","C/T","A/A","G/G","A/A","T/T","T/T","C/C","A/A","C/T","T/A","C/C","A/A","T/T","T/T","C/C","A/A","C/C","A/A","G/G","A/A","C/C","A/A","G/G"),nrow=40)

see `?duplicated` or `?by` I think you are looking for the latter, but it's unclear based on your question. You might also want to check in the search for `[r] data.table by` — Ricardo Saporta, Jul 02 '14 at 20:54
Sorry, if I'm missing something, but how is "Matching_Marker %" computed? — alexis_laz, Jul 02 '14 at 22:07
Thanks, Alex for trying to help me. The Matching_Marker % is based on the Line A which has two sets of values for M1, M2,M3, M4. When compared the two sets only one data is not matching out of 4, so the percentange of match is 75%. — user3794715, Jul 02 '14 at 22:24
Shouldn't "Line_C" be 50% ("M2" and "M3" -2 out of 4- differ between "F5" and "F6")? Also, why are there two "F5 / F6"s in "Line_B"? — alexis_laz, Jul 02 '14 at 22:37
Alexis: Yes you are absolutely correct. Line C should be 50%.and also I corrected the question on Line B. Sorry ! — user3794715, Jul 02 '14 at 23:54
Alexis: Thanks. As usual (?!) i was asked additional requirements. Could you please help me? 1. For calculating the match if one or both the line has -/- this should not be used in calculation 2. Additional column in the result to show count of match for each combination. 2. Additional column in the result to show count of mismatch 3. Additional column to show the count of mist match (het) for each year. Please see the additional result table in my question. I am sorry for expecting too much from you, this is a wonderful learning experience. — user3794715, Jul 04 '14 at 13:33
Also if there are identical years of the same line, this should also treated as duplication (Ex. F6/F6). A quick note on defining het is when a marker has A/T or A/C or A/G or C/G or C/T or G/T. — user3794715, Jul 04 '14 at 14:34

AndrewMacDonald · Answer 1 · 2014-07-03T04:39:35.820

Here is an example using dplyr. The first step is to "spread" the year data, then calculate similarity for pairs.

library(dplyr)
library(tidyr)

test %>% 
  data.frame(stringsAsFactors = FALSE) %>% 
  spread(Year,data) %>%
  group_by(Line) %>%
  summarise(F5_F6 = sum(F5 == F6)/length(F5)*100,
            F5_F7 = sum(F5 == F7)/length(F5)*100,
            F6_F7 = sum(F6 == F7)/length(F6)*100) %>%
  gather(Year, Matching_marker, F5_F6:F6_F7, na.rm = FALSE) %>%
  arrange(Line) %>%
  group_by(Line) %>%
  mutate(Duplicated = all(is.na(Matching_marker))) %>% 
  filter(!((!Duplicated) & is.na(Matching_marker)))

     Line  Year Matching_marker Duplicated
1  Line_A F5_F6              75      FALSE
2  Line_B F5_F6              50      FALSE
3  Line_B F5_F7              25      FALSE
4  Line_B F6_F7               0      FALSE
5  Line_C F5_F6              50      FALSE
6  Line_D F5_F6              NA       TRUE
7  Line_D F5_F7              NA       TRUE
8  Line_D F6_F7              NA       TRUE
9  Line_E F5_F6              NA       TRUE
10 Line_E F5_F7              NA       TRUE
11 Line_E F6_F7              NA       TRUE

AndrewMacDonald: Thanks. I tried after installing ' dplyr'. But it throws an error as 'Error in eval(expr, envir, enclos) : could not find function "spread"'. How/when to define the input file 'test1' ? sorry for asking questions, as i am a student learning R. — user3794715, Jul 03 '14 at 04:04
Whoops, sorry! I was supposed to add that I am also using the package `tidyr`, which contains the functions `gather` and `spread`. *Edited* to show both the loading of `tidyr` and the conversion to a data.frame — AndrewMacDonald, Jul 03 '14 at 04:36
Thanks AndrewMacDonald !..now i am stuck with additional requirement. I have updated the question and added details in comments section. — user3794715, Jul 04 '14 at 14:47

alexis_laz · Answer 2 · 2014-07-05T11:37:06.273

EDIT I'm making a second attempt for the revised question; previous answer is deleted from below. I 've, on purpose, tried to make the answer straightforward so that it 's easy to be read through and modified, wherever and whenever needed, easily (especially for trivial changes like in the updated question) Also, the below doesn't seem too efficient; hope it helps anyway.

The basic idea is:

split "test1" by "Line" (#1)
find all possible combinations of 2 "Year"s for each "Line" (#2)
manipulate entries with "-/-" (#3)
compare the "data" between each pair of "Year"s for each "Line" (#4)
format output accordingly (#5)

All the above, except #1, are defined in a helper function for the sake of handiness.

helper_ff = function(x) 
{
   if(length(unique(x[["Year"]])) > 1) {
       combs = combn(as.character(unique(x[["Year"]])), 2, simplify = F)  #2: get all combinations
       do.call(rbind, 
               lapply(combs,
                      function(z) {
                         Y1 = x[["data"]][x[["Year"]] == z[1]]
                         Y2 = x[["data"]][x[["Year"]] == z[2]]
                         misY1 = Y1 == "-/-"   
                         misY2 = Y2 == "-/-"
                         mis = misY1 | misY2  #3: ignore "-/-"
                         Y1b = Y1[!mis]
                         Y2b = Y2[!mis]
                         matches = Y1b == Y2b #4: find matches of the non '-/-'
                         data.frame(Line = x[["Line"]][1], #5: formatting stuff
                                    Year = paste(z, collapse = "/"),
                                    Duplication = "Yes",
                                    MatchCount = sum(matches),
                                    MismatchCount = sum(!matches),
                                    Matching_Marker = (sum(matches) / (sum(matches) + sum(!matches))) * 100,
                                    Het1 = (sum(sapply(strsplit(as.character(Y1b), "/"), function(x) x[1] != x[2])) / length(Y1b)) * 100,
                                    Het2 = (sum(sapply(strsplit(as.character(Y2b), "/"), function(x) x[1] != x[2])) / length(Y2b)) * 100,
                                    Mis1 = (sum(misY1) / length(Y1)) * 100,
                                    Mis2 = (sum(misY2) / length(Y2)) * 100)
                      }))
   } else {
       Y = x[["data"]]
       misY = Y == "-/-"
       Yb = Y[!misY]  #3: ignore "-/-"
       data.frame(Line = x[["Line"]][1], #5: formatting stuff
                  Year = x[["Year"]][1],
                  Duplication = "No",
                  MatchCount = NA,
                  MismatchCount = NA,
                  Matching_Marker = NA,
                  Het1 = (sum(sapply(strsplit(as.character(Yb), "/"), function(x) x[1] != x[2])) / length(Yb)) * 100,
                  Het2 = NA,
                  Mis1 = (sum(misY) / length(Y)) * 100,
                  Mis2 = NA)
   }     
}

res = do.call(rbind, 
              lapply(split(test1, test1[["Line"]]), #1: split
                     helper_ff))
rownames(res) = NULL
res
#    Line  Year Duplication MatchCount MismatchCount Matching_Marker     Het1     Het2 Mis1 Mis2
#1 Line_A F5/F6         Yes          3             1        75.00000  0.00000 25.00000    0    0
#2 Line_B F5/F6         Yes          2             1        66.66667  0.00000 33.33333    0   25
#3 Line_B F5/F7         Yes          1             3        25.00000  0.00000 25.00000    0    0
#4 Line_B F6/F7         Yes          0             3         0.00000 33.33333 33.33333   25    0
#5 Line_C F5/F6         Yes          2             2        50.00000  0.00000 50.00000    0    0
#6 Line_D    F6          No         NA            NA              NA  0.00000       NA    0   NA
#7 Line_E    F6          No         NA            NA              NA  0.00000       NA    0   NA

@user3794715 : Sorry, I hadn't seen your comments (to notify a user, either leave a comment below their answer or call them using "@username"). I modified the answer to match your updated output. It's a bit messy, because I tried to use a lot of stuff and in a clear way, so that you'll be able to work out any extra small changes that might be needed. Hope you can make something out of the above! — alexis_laz, Jul 05 '14 at 11:43
:Thank you very much. Yes the detailed script will help me to learn rather than 'library' solution. The script is perfect. However a minor modification to address when there are identical years of the same line, this should also be treated as duplication (Ex. F6/F6). On this regard I had posted comment yesterday. Sorry I was not aware of sending comments directly to the user. I have given new example data set for a Line_E which has two same year F6 data. I am sorry for asking too much...! I have learnt a lot an thanks again — user3794715, Jul 05 '14 at 15:06
@user3794715 : Using your new "test1" and running the code in my answer, it seems that I get a correct result, because your years are "F6" and "F6.1" so are already separated. Is your data really like that or you have only "F6" for "Line_E"? If the second, then you could use something like `test1$Year = ave(as.character(test1$Year), test1$Line, test1$Marker, FUN = make.unique)` so that your duplicate same-years are separated and, then, run the code in my answer. — alexis_laz, Jul 05 '14 at 15:31
Sorry..my concern was if there are two F6 years for the same Line. For better understanding, I have changed F6.1 into F6 for the Line_E in the test data ! — user3794715, Jul 05 '14 at 15:36
@user3794715 : In that case, you could turn "F6" to "F6.1" (and "F6.2" etc) :). A way to do this would be to just run `test1$Year = ave(as.character(test1$Year), test1$Line, test1$Marker, FUN = make.unique)` before running my answer (as I mentioned in the previous comment). — alexis_laz, Jul 05 '14 at 15:43
test1$Year = ave(as.character(test1$Year), test1$Line, test1$Marker, FUN = make.unique) works perfect !!!!!..thanks again !..this ends my chain of request ! — user3794715, Jul 05 '14 at 15:43
Sorry again !..in my large data, I got the following error, where one of the line has missing data for the one of the year for all markers.Error in sum(sapply(strsplit(as.character(Y1b), "/"), function(x) x[1] != : invalid 'type' (list) of argument. I have included the third example in my question. where Line_A has missing data for the year F5 for all markers !..In this case it should be Line_A, Year=F5/F6,match_count=0 and unmatch_count=0, Match=0%, het_1 = 0, het_2=25%, mis_1=0%, mis_2=100% Do you think you can help me ? — user3794715, Jul 08 '14 at 12:17
@user3794715 : Sorry for the delay, I didn't have time earlier to give it a look. You could define a function "find_het" that handles such cases: `find_het <- function(x) if(!length(x)) 0 else (sum(sapply(strsplit(as.character(x), "/"), function(z) z[1] != z[2])) / length(x)) * 100`. And in "helper_ff" replace the lines inside "if", "Het1" and "Het2", using: `Het1 = find_het(Y1[!misY1])` and `Het2 = find_het(Y2[!misY2])` and in "else" `Het1 = find_het(Y1b)`. It seems valid this way. Note that "Match" will be `NaN` because of 0 matches / 0 mismatches, but, I guess, you can easily fix that. — alexis_laz, Jul 10 '14 at 15:44

% of matching between rows and find duplications based on specific column

2 Answers2