1

I have a df like:

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

And I want to join the rows that share the same chr and strand and that have similar starting and end points (say like with 100 +/- distance). For those columns that the row join is performed, I would also like to concatenate the SampleID names and the Value. With the previous example, something like:

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

Ideas? Thanks!

EDIT:

I found the fuzzyjoin package for R (https://cran.r-project.org/web/packages/fuzzyjoin/index.html). Does anyone have experience with this package?

EDIT2:

It would be also nice if just one of the variables (SampleID or Value) would be concatenated.

Tato14
  • 425
  • 1
  • 4
  • 9

1 Answers1

1

We could do group by 'Chr', 'Strand', create a grouping ID based on the difference between adjacent elements in 'Start' and 'End' columns after ordering by 'Start', 'End', then grouped by 'Chr', 'Strand' and 'ind', get the first element of 'Start', 'End', while pasteing the elements in 'SampleID' and 'Value' column

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

NOTE: Assumed that 'df' is a data.table

akrun
  • 874,273
  • 37
  • 540
  • 662
  • WOW! That's great, I would like to know if you could be a little bit more explicit with the command that you just used. Correct me if I'm wrong: You generate a "Label" for each condition looking at the `Chr`, `Strand` and the `Start` and `End` which fulfills the requirement that I asked (+/- 100). Then you joined the ones with the same "Label" and add the `SampleID` and `Value` as a string. Am I right? Another curious thing that happens in my `data.table` is that the rows that are exactly at the same position do not perform the join. Where would be the problem? – Tato14 Nov 18 '17 at 13:09
  • @Tato14 In the first set of `df[i, j, by]` we specify the `i` with `order` so that the values are ordered in the ascending order. The `by` is the grouping variables 'Chr', 'Strand', In the 'j', we assign the run-length-id (`rleid`) of logical output as a new columns, then in the second set, we are doing a group by operation. It is not clear when u say that `rows that are exactly at the sam position do not perform the join` – akrun Nov 18 '17 at 13:16
  • Thanks for the explanation! Regarding the last part, by "position" I meant same `Chrom`, `Strand`, `Start` and `End`. – Tato14 Nov 18 '17 at 13:23
  • I also found in the output some rows with joined `SampleID`s that do not match the conditions about `Start` and `End` but get the same `id`. If I only run the first part of the command a warning message appear: `RHS 1 is length 344 (greater than the size (32) of group 1). The last 312 element(s) will be discarded. RHS 1 is length 344 (greater than the size (5) of group 2)...` – Tato14 Nov 18 '17 at 14:00