0

I have two tables, the start of each is given below:

Table 1: All SNPs

   SNp           Gene
rs1798922   ENSG00000167634
rs4677723   ENSG00000167634
rs1609823   ENSG00000104450
rs11597390  ENSG00000104643
rs7824557   ENSG00000104643
rs1371867   ENSG00000104450

Table 2: Best SNP per gene

  SNP            Gene
rs1371867   ENSG00000104450
rs7824557   ENSG00000104643
rs1671152   ENSG00000167634
rs11597390  ENSG00000095485
rs285757    ENSG00000185442

Table 1 shows a list of genes with their corresponding SNPs. As can be seen, the same gene is repeated in many places in the table.

Table 2 is the result after filtering through all the SNPs for each gene in Table 1, and keeps only one SNP per gene (keeps best SNP according to the p-value, although that's not relevant here).

So in other words, there are some SNPs in Table 1 that's not included in Table 2, since Table 2 only keeps the best SNP for each gene.

For each gene, I want to use R to compare the 2 tables and output back the SNPs that weren't included in Table 2 for that gene. So the specification for comparison is the Gene name, which will change constantly since there are many genes in the table.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
zfz
  • 153
  • 3
  • 16

2 Answers2

3

Assuming that all the SNPs are different, try this:

subset(t1,!(t1$SNp %in% t2$SNP))
eyanquenb
  • 193
  • 7
  • thanks this worked! is there a way to lump together the result for each gene? this output gives me the genes out of order, but to be easy on the eyes, can i order the output of this code by the gene name? so the same gene will appear one after another without a different gene interrupting the order. – zfz Jun 27 '13 at 09:54
  • 2
    +1 - I was severely overworking my answer to this question. Nice one. @zfzhao look at `order`, e.g. `t3 <- subset(t1,!(t1$SNp %in% t2$SNP)) ; t3 <- t3[ order(t3$Gene , t3$SNP) , ]` – Simon O'Hanlon Jun 27 '13 at 09:56
  • thanks :-) Sorry one more thing, is there a way to NOT include the **common** SNP for each gene between the 2 tables? This code gives me all the SNPs for each gene between the 2 tables, but I really just want the SNPs that are missing in Table 2 compared to Table 1 for every gene. – zfz Jun 27 '13 at 10:08
  • @zfzhao um, not it doesn't, otherwise that would just be `t1`??! Check the results on your subset of data you posted. – Simon O'Hanlon Jun 27 '13 at 10:12
1

Another solution is to merge the 2 dataset and take the incomplete cases:

res <- merge(dat.all,dat.best,by.x='SNP',by.y='SNP',
             suffixes =c('.all','.best'),
             all.x=TRUE,all.y=TRUE)

Then keep only genes that doesn't exist within best's:

res[is.na(res$Gene.best),]
        SNP        Gene.all Gene.best
3 rs1609823 ENSG00000104450      <NA>
4 rs1798922 ENSG00000167634      <NA>
5 rs4677723 ENSG00000167634      <NA>
agstudy
  • 119,832
  • 17
  • 199
  • 261