1

I'm looking to merge a list of 6 GRanges objects into one, with the resulting table have a common set of breakpoints across all 6 objects as shown in Figure 1 below. I've tried to do this using findOverlaps(), however the this is equivalent to a left_join (or right depending how you set it up), which doesn't exactly take both coordinate sets into account.

I've experimented with plyranges however, these duplicate the row coordinates rather than breaking them into individual breakpoints, and I'm not exactly sure what's happening with the metadata. Example below:

Any suggestions on how to combine the sets ? I use GenomicRanges, but I'm open to other suggestions since the data can be coerced to a data.frame, etc.

thank you,

Concept figure and data example:

enter image description here Figure 1: Depiction of genomic segments in three samples (S1..S3), along a chromosome region. The unified breakpoint ranges are marked by the black vertical lines, labeled 1..9.

Example using plyranges:

3   > library(GenomicRanges)
4   > library(tidyverse)
5   > library(plyranges)
6   *** output flushed ***
7   > data(grL)
8   > grL[1:2]
9   GRangesList object of length 2:
10  $A
11  GRanges object with 14212 ranges and 7 metadata columns:
12            seqnames            ranges strand |        Type X.log10.q.value
13               <Rle>         <IRanges>  <Rle> | <character>       <numeric>
14        [1]        1     564424-846680      * |         Amp               0
15        [2]        1     852863-974841      * |         Amp               0
16        [3]        1    986491-1565745      * |         Amp               0
17        [4]        1   1566563-2719525      * |         Amp               0
18        [5]        1   2728545-2761325      * |         Amp               0
19        ...      ...               ...    ... .         ...             ...
20    [14208]       22 50750387-50763701      * |         Del        1.129787
21    [14209]       22 50780464-50885937      * |         Del        2.941190
22    [14210]       22 50887704-51067512      * |         Del        0.011509
23    [14211]       22 51067512-51173133      * |         Del        0.400633
24    [14212]       22 51174234-51178205      * |         Del        0.480761
25              a.score           b.score   c.score      cohort     subtype
26            <numeric>         <numeric> <numeric> <character> <character>
27        [1]  0.163684          1.322534  0.054545    NYC        A
28        [2]  0.137531          1.689206  0.036364    NYC        A
29        [3]  0.064422          0.889654  0.036364    NYC        A
30        [4]  0.092361          1.195796  0.036364    NYC        A
31        [5]  0.101790          0.881394  0.054545    NYC        A
32        ...       ...               ...       ...         ...         ...
33    [14208]  0.068896          0.305572  0.181818    NYC        A
34    [14209]  0.112202          0.305572  0.181818    NYC        A
35    [14210]  0.015748          0.251801  0.181818    NYC        A
36    [14211]  0.037401          0.251801  0.181818    NYC        A
37    [14212]  0.039369          0.278090  0.181818    NYC        A
38    -------
39    seqinfo: 24 sequences from an unspecified genome; no seqlengths
40
41  $B
42  GRanges object with 25101 ranges and 7 metadata columns:
43            seqnames            ranges strand |        Type X.log10.q.value
44               <Rle>         <IRanges>  <Rle> | <character>       <numeric>
45        [1]        1     564424-835336      * |         Amp               0
46        [2]        1    837386-1458139      * |         Amp               0
47        [3]        1   1464470-1555673      * |         Amp               0
48        [4]        1   1556444-1565745      * |         Amp               0
49        [5]        1   1566563-2633368      * |         Amp               0
50        ...      ...               ...    ... .         ...             ...
51    [25097]       22 50710971-50742016      * |         Del        0.926994
52    [25098]       22 50780464-50885937      * |         Del        1.208032
53    [25099]       22 50887704-50914027      * |         Del        0.825840
54    [25100]       22 50915850-51173133      * |         Del        0.844370
55    [25101]       22 51174234-51218950      * |         Del        0.893049
56              a.score           b.score   c.score      cohort     subtype
57            <numeric>         <numeric> <numeric> <character> <character>
58        [1]  0.263871          0.408836  0.269231    NYC        B
59        [2]  0.239878          0.400169  0.250000    NYC        B
60        [3]  0.262476          0.436118  0.250000    NYC        B
61        [4]  0.223937          0.400252  0.230769    NYC        B
62        [5]  0.271884          0.451085  0.250000    NYC        B
63        ...       ...               ...       ...         ...         ...
64    [25097]  0.224858          0.350757  0.346154    NYC        B
65    [25098]  0.247760          0.362135  0.365385    NYC        B
66    [25099]  0.214448          0.338461  0.384615    NYC        B
67    [25100]  0.216530          0.338461  0.384615    NYC        B
68    [25101]  0.220694          0.378862  0.346154    NYC        B
69    -------
70    seqinfo: 24 sequences from an unspecified genome; no seqlengths
71
72  > a <- join_overlap_left(grL[[2]], grL[[1]])
73  > a
74  GRanges object with 73802 ranges and 14 metadata columns:
75            seqnames            ranges strand |      Type.x X.log10.q.value.x
76               <Rle>         <IRanges>  <Rle> | <character>         <numeric>
77        [1]        1     564424-835336      * |         Amp                 0
78        [2]        1     564424-835336      * |         Amp                 0
79        [3]        1    837386-1458139      * |         Amp                 0
80        [4]        1    837386-1458139      * |         Amp                 0
81        [5]        1    837386-1458139      * |         Amp                 0
82        ...      ...               ...    ... .         ...               ...
83    [73798]       22 50915850-51173133      * |         Del          0.844370
84    [73799]       22 50915850-51173133      * |         Del          0.844370
85    [73800]       22 50915850-51173133      * |         Del          0.844370
86    [73801]       22 51174234-51218950      * |         Del          0.893049
87    [73802]       22 51174234-51218950      * |         Del          0.893049
88            a.score.x           b.score.x   c.score.x    cohort.x   subtype.x
89            <numeric>           <numeric>   <numeric> <character> <character>
90        [1]  0.263871            0.408836    0.269231    NYC        B
91        [2]  0.263871            0.408836    0.269231    NYC        B
92        [3]  0.239878            0.400169    0.250000    NYC        B
93        [4]  0.239878            0.400169    0.250000    NYC        B
94        [5]  0.239878            0.400169    0.250000    NYC        B
95        ...       ...                 ...         ...         ...         ...
96    [73798]  0.216530            0.338461    0.384615    NYC        B
97    [73799]  0.216530            0.338461    0.384615    NYC        B
98    [73800]  0.216530            0.338461    0.384615    NYC        B
99    [73801]  0.220694            0.378862    0.346154    NYC        B
100   [73802]  0.220694            0.378862    0.346154    NYC        B
101                Type.y X.log10.q.value.y a.score.y           b.score.y
102           <character>         <numeric> <numeric>           <numeric>
103       [1]         Del          0.006434  0.007874            0.201368
104       [2]         Amp          0.000000  0.163684            1.322534
105       [3]         Del          0.006434  0.007874            0.201368
106       [4]         Amp          0.000000  0.163684            1.322534
107       [5]         Amp          0.000000  0.137531            1.689206
108       ...         ...               ...       ...                 ...
109   [73798]         Amp          0.000000  0.000000            0.000000
110   [73799]         Del          0.011509  0.015748            0.251801
111   [73800]         Del          0.400633  0.037401            0.251801
112   [73801]         Amp          0.000000  0.000000            0.000000
113   [73802]         Del          0.480761  0.039369            0.278090
114             c.score.y    cohort.y   subtype.y
115             <numeric> <character> <character>
116       [1]    0.018182    NYC        A
117       [2]    0.054545    NYC        A
118       [3]    0.018182    NYC        A
119       [4]    0.054545    NYC        A
120       [5]    0.036364    NYC        A
121       ...         ...         ...         ...
122   [73798]    0.000000    NYC        A
123   [73799]    0.181818    NYC        A
124   [73800]    0.181818    NYC        A
125   [73801]    0.000000    NYC        A
126   [73802]    0.181818    NYC        A
127   -------
128   seqinfo: 24 sequences from an unspecified genome; no seqlengths
129
130 > sessionInfo()
131 R version 4.0.5 Patched (2021-06-20 r80951)
132 Platform: x86_64-apple-darwin17.0 (64-bit)
133 Running under: macOS Big Sur 10.16
134
135 Matrix products: default
136 BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.0.dylib
137 LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
138
139 locale:
140 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
141
142 attached base packages:
143 [1] parallel  stats4    stats     graphics  grDevices utils     datasets
144 [8] methods   base
145
146 other attached packages:
147  [1] plyranges_1.10.0     forcats_0.5.1        stringr_1.4.0
148  [4] dplyr_1.0.8          purrr_0.3.4          readr_2.1.2
149  [7] tidyr_1.2.0          tibble_3.1.6         ggplot2_3.3.6
150 [10] tidyverse_1.3.1      GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
151 [13] IRanges_2.24.1       S4Vectors_0.28.1     BiocGenerics_0.36.1
152
153 loaded via a namespace (and not attached):
154  [1] lattice_0.20-45             lubridate_1.8.0
155  [3] Rsamtools_2.6.0             Biostrings_2.58.0
156  [5] assertthat_0.2.1            rprojroot_2.0.3
157  [7] utf8_1.2.2                  R6_2.5.1
158  [9] cellranger_1.1.0            backports_1.4.1
159 [11] reprex_2.0.1                httr_1.4.3
160 [13] pillar_1.7.0                zlibbioc_1.36.0
161 [15] rlang_1.0.2                 readxl_1.4.0
162 [17] rstudioapi_0.13             Matrix_1.4-1
163 [19] BiocParallel_1.24.1         RCurl_1.98-1.6
164 [21] munsell_0.5.0               DelayedArray_0.16.3
165 [23] broom_0.8.0                 rtracklayer_1.50.0
166 [25] compiler_4.0.5              modelr_0.1.8
167 [27] pkgconfig_2.0.3             tidyselect_1.1.2
168 [29] SummarizedExperiment_1.20.0 GenomeInfoDbData_1.2.4
169 [31] matrixStats_0.61.0          XML_3.99-0.9
170 [33] fansi_1.0.3                 crayon_1.5.1
171 [35] tzdb_0.3.0                  dbplyr_2.1.1
172 [37] withr_2.5.0                 GenomicAlignments_1.26.0
173 [39] bitops_1.0-7                grid_4.0.5
174 [41] jsonlite_1.8.0              gtable_0.3.0
175 [43] lifecycle_1.0.1             DBI_1.1.3
176 [45] magrittr_2.0.3              scales_1.2.0
177 [47] cli_3.2.0                   stringi_1.7.6
178 [49] XVector_0.30.0              fs_1.5.2
179 [51] xml2_1.3.3                  ellipsis_0.3.2
180 [53] generics_0.1.2              vctrs_0.4.0
181 [55] tools_4.0.5                 Biobase_2.50.0
182 [57] glue_1.6.2                  hms_1.1.1
183 [59] MatrixGenerics_1.2.1        colorspace_2.0-3
184 [61] rvest_1.0.2                 haven_2.4.3
185 [63] usethis_2.1.6
186 >

RodrigoGM
  • 68
  • 5

0 Answers0